The GitLab system crash and what can we learn from that
Yesterday night, GitLab’s hosted service (gitlab.com) suffered a database crash and the service went down for a day1.
I’m not going to discuss the technicalities of the down-time (which is covered extensively in the blog post linked above), except to note that “shit happens” – my main take-aways from that are basically two:
- Don’t let tired people handle critical system issues, and if you ever find yourself juggling the third production issue at midnight after a full day of work – just say: “no, I’m not going to fix this – some one else must step in or we leave the system down for tomorrow”.
- The GitLab process for handling the failure was nothing short of amazing, and they deserve all the kudus for that: After figuring out how deep in shit they are, and posting a “sorry we’re down” page on the main web site, they:
- Posted frequent updates on their tweeter status account (to the point of posting restore percentages progress)
- Published the technical root cause analysis Google Document as it was being developed
- Streamed the operations team ongoing restoration work on a live YouTube channel (!!) The operations people even responded live to comments on the chat, commented on the process and their architecture and were generally being awesome.
I think this should be the standard from now on how to handle system crashes on your public facing application – 1000% transparency should be how these things are handled if you have any hope of recovering the trust of the community in your service.
- at the time of writing, the service is still not up, but its not yet even 24 hours since the crash happened [↩]