How software is like a wooden boat, musings on maintainability

Posted on October 14, 2020 by wjwh

Of late I’ve been enjoying this video series about the reconstruction of the Tally Ho, a 100+ year old wooden sailing yacht that was in very poor condition and needed an almost complete rebuild. Throughout the series it comes up again and again how wooden boat design emphasizes maintainability: every part of a boat can (and is expected to) be easily replaced during periodic maintenance. Planking, masts and even keels may seem quite central to the construction, but in reality they are sacrificial. All these elements are to be replaced as necessary so that the whole may survive. Doing this properly requires not just time and resources but also people with the required experience and a lot of attention is given to maintaining the ‘human infrastructure’ that makes maintenance possible. In this, there is a lot to be learned for the software world.

It seems like every few weeks there is another horror story about some giant legacy software system lurking untouched for decades in the bowels of a giant corporation, finally about to collapse under its own weight while everyone who could rescue it has long since passed away from old age. Whether it’s a big bank, the Veterans Association over in the US or a national healthcare organisation, big old software systems keep working until they suddenly don’t and require massive effort to bring back up.

A common thread in all these IT failures is that a complex system was set up decades ago and because it “works”, it has been left mostly alone since initial construction. This lack of maintenance then inevitably leads to failure down the road.

Well-maintained and maintainable

In the context of software, the concept of software being well maintained and maintainable are often confused. I propose the following definitions:

Typically a maintainable program will also be kept well maintained, though it is not a guarantuee. Since maintainability depends a lot on the team, it is usually a very good idea to keep working on a program so that knowledge of its working remains fresh.

To maintain or not to maintain

There is an obvious moment where building for maintainability will not pay off: for programs with a sufficiently low remaining lifespan, additional maintenance is going to cost more than the benefits you get from an up-to-date program. Of course, it is extremely hard to accurately determine the remaining lifespan of a program as evideced by all the “temporary” solutions running in production today. There is a huge and enduring underestimation of how long any given program will remain needed.

A second reason not to maintain a software system is because it is perceived to be “working”, leading to the often heard phrase of “if it ain’t broke don’t fix it”. This is a recipe for disaster, since with the passage of time all the original developers will eventually move on and the project will be left almost unmaintainable. Ironically, the less problems any given program displays the higher the chance of it failing catastrophically in the future and the less anyone will be able to do anything about it. To prevent this knowledge gap, documentation is only a partial answer. All serious programs I have seen require extensive domain knowledge from the programmer to understand them. The only way to gain this domain knowledge is experience, hopefully accompanied by advice and pointers from someone who is already an expert.

Summary

In the context of any long-lived asset, “if it ain’t broke don’t fix it” really means “don’t start fixing it until it breaks”. For software this is more insidious, since cultural norms about maintenance of physical assets have been fairly well established. Of course you have to maintain your bridge/house/factory machines, otherwise they break. Even when these are left to deteriorate, the declining state is often clearly visible. On the other hand, software keeps “working” for quite a while until it suddenly doesn’t. In reality its state has actually been deteriorating for a while, but that is often not visible from the “outside”. When it finally breaks for real, not only is there a lot of deferred maintenance to catch up on, but the original team will have long since moved on. The only way around this seems to constantly keep up low-level work on the project, so that knowledge of the internals never gets lost. Making this work in short-term driven companies and/or underfunded open source project is left as an exercise for the reader. :)