Kristian Glass - Do I Smell Burning?

Mostly technical things

There Is No Root Cause

One of my biggest pet peeves, when it comes to incident management and response, is the term “the root cause”.

At best it feels naïve, at worst downright harmful - this notion that there is a singular “root cause” that we can just dig down to, and if we find that, then we’ve cracked it.

Incidents, even the simplest, tend to come from a complex web of context and considerations.

There may be a singular triggering event - the one thing that made it all come crashing down - but to call that a “root cause”, like the thinking stops there, is to ignore all the many potential contributing factors, big or small, that made it such a problem.

Consider a hypothetical warehouse fire. Maybe the triggering event was a discarded cigarette by the back door. But maybe a great many factors contributed to that ending up as a raging inferno rather than just some minor littering:

  • Maybe it’s a common nighttime hangout spot for local teenagers to smoke - forced out of other places by poor urban planning. So rather than a secure and clear area, there’s people, there’s litter, all in a place not really designed for it.
  • Maybe the firefighters took longer to turn up than ideal - maybe because budget cuts, or poor rota design, or a local traffic issue
  • Maybe your warehouse inventory was stored in such a way that flammables were kept along the back wall, instead of more centrally where they were more isolated. Maybe this was a systematic failure of your storage system, maybe this was just a temporary situation because of recent growth in your organisation.

Yes, look hard at that discarded cigarette - without it you might not be looking at a financial and logistical nightmare. But to call it a “root cause”, like it all stops there, like the whole situation doesn’t emerge from a complex web of factors, some situational but some systematic, some within your control but some totally beyond - that feels like unhelpfully short-sighted thinking.

Incidents are learning opportunities. If you stop your exploration at some “root cause”, you risk throwing away lots of potential value. Inevitably there will be opportunities to improve across the board. They might not always be worth actively doing anything about, but they are still usually worth recording.

What Instead?

Stop looking for a “root cause”, start looking at:

  1. The Triggering Event - what happened to make things go “bang”?
  2. As many Contributing Factors as you can - what made things worse than they could have been?

I generally find that the Triggering Event is, though worth noting, often the least interesting part. A component failed, a mistake was made, whatever. Components fail, people make mistakes - and a healthy system should cope. But you’re probably doing this exercise because it didn’t. It’s worth knowing that a discarded cigarette started the fire, but could you have necessarily done much about it, and could any one of a number of similar things led to the same outcome?

The Contributing Factors are where I find all the really interesting things turn up. There’s often a point of diminishing returns, but sometimes even small-seeming things can be valuable to think about. Maybe the fire was made worse by flammable material being temporarily stored nearby as a result of lack of space due to recent growth - in isolation seemingly simple misfortune, but potentially part of a pattern of issues forcing a rethink of the growth strategy, or at least a cost worth considering.

You may look at a Contributing Factor and decide it’s not a priority to address, and that’s fine! But it’s still valuable to record it, and consider mitigations and/or improvements.

See also

Comments