Don’t Let Failure Become Inevitable
15 May 2013, by David Simons
In 2003, the space shuttle Columbia embarked on its 28th mission. During many of the previous launches, something broke off the body of the rocket. No major damage was caused, so it was dealt with as a minor issue, being fixed and forgotten about. Soon it became a matter of process that something flew off. It wasn’t as expected, but it was tolerable. But this time, a piece of foam insulation broke off and damaged its thermal protection system.
Two weeks later, Columbia disintegrated during re-entry into the Earth’s atmosphere. The crew of seven all died.
Fortunately, most of the software we write doesn’t have life or death consequences, however there are still lessons that the software industry can learn from this, as I found out when attending the QCon talk “The Inevitability of Failure.”
In the past we’ve had to work with fragile legacy systems that we inherited from other suppliers: fighting the odds with temperamental servers; poor regression suites and non-standard server architecture.
After a while the team in question work out how to “babysit” the system – doing things in a certain way, and only deviating from that process with good reason.
But occasionally – say, when time pressures force the team to deviate from the safe process, complications arise. At this point we get to use our excellent debugging and problem-solving ability to fix the system promptly, and with any luck we avoid any negative consequences.
So, given that there are no negative consequences, why should we worry about these issues? I would suggest that this is analogous to the ‘margin of error’ that the NASA architects discovered in Challenger, that let bigger issues lie dormant.
In our industry, there’s always an expectation to move forward and there’s a pressure to do things as fast as possible. If we went with the flow, the problem would be fixed but not understood. If it happened again, we’d be able to make the same fix but again with no knowledge of why we’re doing it. Soon, it becomes part of process that we check for these particular issues and this defect is now a ‘beauty spot’ on the server that we have learnt to live with.
This boundary-shifting of what is acceptable is known as the “normalisation of deviance.” The phenomenon suggests that the more we experience problems, or deviations from the norm, the more we are able to live with them as part of everyday life. As the scope of what is deemed acceptable grows, we are more likely to encounter problems.
There is a different industry that is very aware of this phenomenon, and tackles in head-on: about 20% of all patients receiving anaesthetics in a hospital experience ‘complications.’ As we know, very few of these are fatal.
Doctors are competent and have such a depth of knowledge they would be able to fix most problems as they go wrong. It would be easy for anaesthetists to label these as successes and not record any issues but as the stakes are high they don’t want imperfections to become part of a daily routine. Any abnormalities are flagged as problematic, because they realise it just takes one instance where a complication is outside their knowledge for the impact to be devastating.
I feel that software developers can learn from this philosophy, by connecting it with Softwire’s great core value: Quality. The lesson I took from the talk was that we shouldn’t be content with software with technical debt that means it only works as we expect 99% of the time. The 1% of the time it doesn’t could well be the most important time it’s ever used.