The GitLab system crash and what can we learn from that

Yesterday night, GitLab’s hosted service (gitlab.com) suffered a database crash and the service went down for a day(1).

I’m not going to discuss the technicalities of the down-time (which is covered extensively in the blog post linked above), except to note that “shit happens” – my main take-aways from that are basically two:

  1. Don’t let tired people handle critical system issues, and if you ever find yourself juggling the third production issue at midnight after a full day of work – just say: “no, I’m not going to fix this – some one else must step in or we leave the system down for tomorrow”.
  2. The GitLab process for handling the failure was nothing short of amazing, and they deserve all the kudus for that: After figuring out how deep in shit they are, and posting a “sorry we’re down” page on the main web site, they:

I think this should be the standard from now on how to handle system crashes on your public facing application – 1000% transparency should be how these things are handled if you have any hope of recovering the trust of the community in your service.


  1. at the time of writing, the service is still not up, but its not yet even 24 hours since the crash happened []

Leave a Reply

 

 


Spam prevention powered by Akismet