fissionable materials always seemed capable of finding a flaw in the best intentions.
James Mahaffey Atomic Accidents
I have just finished reading Atomic Accidents – it’s a fascinating and frightening book. Don’t read it if your sleep is easily disturbed (I mean it – I have started having dreams where critical quantities of uranium are about to come together and I cannot stop them).
This post is going to draw some analogies between nuclear engineering and software engineering. Respecting Martin Fowler’s quote about analogies (discussed here) I am going to try to extract questions from the comparison, not answers. In any case, many of the answers from nuclear engineering are unlikely to help us in the software field. For example (off topic fact of the day), the shape of the vessel in which radioactive material is stored has a big impact on whether it will go critical or not. For safety you want as much surface area per unit volume as possible. Long, skinny tubes are safe, spheres and scaled up tomato soup cans are not (food cans are designed to use as little tin as possible compared to the volume that they hold). Given the contents of some office coffee machines you might want to stop using those nice cylindrical coffee mugs.
There are a number of themes that run through the book. I have picked out a few for discussion:
- An impossible situation. Something that could not possibly happen did. In line with Murphy’s Law, any container of the wrong shape will inevitably end up containing fissionable material, regardless of how many safety precautions are in place. In one accident, the container in question was a cooking pot from the kitchen that was not physically connected to anything else in the system.
- Complex systems. Complexity is mentioned often in the book, and never in a good way. Complexity kills, quite literally in the case of nuclear engineering. The more complex the system the more difficult it is to predict how it will react.
- Things are (unknowingly) going wrong even in apparently normal operation. The system appears to be working fine, however an excess of material is collecting in an unexpected place, or some amount of coolant is being lost. None of this matters until the moment it does.
- A lack of visibility into what is going wrong. Many of the accidents were made worse because the operators did not have a clear picture of what was happening. Or they did have a clear picture but didn’t believe it or understand it.
- Lessons are not always learned.
All of these apply in the context of software:
- An impossible situation. I have often seen an engineer (sometimes me) state that something “cannot be happening”, despite the fact that it is quite plainly happening. My rule is that I get to say “that’s impossible” once, then I move on, deal with reality and fix the problem.
- Complex systems. We have talked about complexity on this blog before. As if complexity wasn’t bad enough on its own it makes every other problem far worse.
- Things are going wrong even in normal looking operation. Every time I dive into a system that I believe is working perfectly I discover something happening that I didn’t expect. I might be running a debugger and see that some pointer is NULL when it shouldn’t be, or looking at a log file showing that function
B
is called before function A
despite that being incorrect, or studying profiler output demonstrating that some function is called 1,000 times when it should only be called once. In Writing Solid Code, Steve Maguire suggests stepping through all new code in the debugger to ensure that it is doing the right thing. There is a difference between “looking like it’s working” and “actually working”.
- A lack of visibility into what is going wrong. Modern software systems are complicated and have many moving parts. Imagine a video delivery system that runs over the internet with a web front end, a database back end and a selection of different codecs used for different browsers on different devices. There are many places for things to go wrong, and too often we lack the tools to drill down to the core of the problem.
- Lessons are not always learned. I could rant about this for a very long time. At some point I probably will but the full rant deserves a post of its own. I have seen far too many instances in this industry of willful ignorance about best practices and the accumulated knowledge of our predecessors and experts. We’re constantly re-inventing the wheel and we’re not even making it round.
I am quite happy that I do not work on safety critical systems, although I did once write software to control a robot that was 12′ tall, weighed 1,000 pounds and was holding a lit welding torch. I stood well back when we turned it on.
In conclusion, one more quote from the book:
A safety disk blew open, and sodium started oozing out a relief vent, hit the air in the reactor building, and made a ghastly mess.
No one was hurt. When such a complicated system is built using so many new ideas and mechanisms, there will be unexpected turns, and this was one of them. The reactor was in a double-hulled stainless steel container, and it and the entire sodium loop were encased in a domed metal building, designed to remain sealed if a 500-pound box of TNT were exploded on the main floor. It was honestly felt that Detroit was not in danger, no matter what happened.
Atomic Accidents. Read it, unless you’re of a nervous disposition.