Kristian Glass - Do I Smell Burning?

Mostly technical things

GitLab - An Amazing Recovery

At around 23:30 on Tuesday 31st January 2017, a GitLab engineer deleted the data directory of their primary database instance.

What followed was one of the most open and brilliant incident responses I’ve ever seen; an inspirational example that many should aspire to.

There’s no denying that what happened was bad. Six hours of data loss, including around 700 users and 4500 projects. “5 backup/replication techniques deployed” and “none are working reliably or set up in the first place”.

How many people can honestly say they haven’t been there though, or at least had some very close calls. It’s easy to talk about the value of backups, the value of testing your backups, and most importantly the value of testing your restores. Actually doing so is much harder - I knew a company diligently backing up their data, testing those backups by periodically restoring to a secondary system, and shipping them to a secondary location for resilience and safety (after encrypting them, due to the PII contained within). It was only after a catastrophic datacenter incident that they discovered they didn’t have the key to be able to decrypt their offsite backups…

These events so rarely have one clear root cause, but come from a series of small failures combining to cause catastrophe; the tired individual undertakes a manual and exceptional process and makes a mistake that gets magnified by on-going long-term silent background errors. GitLab’s downtime and data loss is, in that sense, educational but unremarkable; one in a long series of cautionary tales and horror stories.

What is truly remarkable and commendable was the reaction.

They published their live incident notes. They tweeted and blogged about it. They live-streamed the recovery process. Finally, they published an excellent postmortem writeup.

Honesty and openness about mistakes and failures is as invaluable as it is hard. For an individual to admit their error and its scope to their colleagues is hard and requires a certain amount of bravery that should be commended, however open, encouraging, and safe the environment may be. For a company to do so to the world - to customers partners and investors both present and potential - is truly astounding.

That’s a fantastic commitment to transparency all the way through the company. Well done GitLab.

Comments