GitLab Fails on 5 out of 5 backup methods

Recently GitLab.com accidentally deleted a live production database while attempting to perform routine maintenance late at night. The Guardian reports that a sysadmin working late at night mistakenly used the sudo rm -rf command(linux command for forced and recursive deletion of all files and folders in the specific folder, ran as superuser) on a live production folder containing database files and backups.

The sysadmin was in the process of replicating the folder from one folder to another, and did not realize the active folder was the production folder until only 4.5GB of the original 300GB remained. Normally with good backups in place this would only be a minor inconvenience with minimal data loss, but that is the major failure in this entire situation.

GitLab decided to bring everything in house last November in an attempt to give customers better performance and reliability. Unfortunately they found out that sometimes over-complicated solutions can have disastrous results. They had 5 backup methods in place, but all 5 failed in various ways, some of them not even configured properly from the beginning.

Here are the backup methods and failures to each system that GitLab had employed:

LVM Snapshots of all servers but only configured to run once every 24 hours. The last LVM snapshot thankfully was only 6 hours old, but as you’ll see later in this article even just 6 hours can contribute to a massive loss of data.
Regular file level backups had also been setup to run every 24 hours, however they have not been able to locate where any of these backups have been stored and it appears the backups had failed only producing backups the size of a few bytes. Misconfiguration of this file level backup along with not verifying the backups working caused this method to fail.
Database dumps had been setup, but had failed to due version conflicts in the database binaries and the dump configuration.
Disk Snapshots through Azure had been setup for the File Servers but not the Database Servers.
Backups to Amazon S3 appear to have not been configured properly as their S3 bucket was empty.

These failures all have two common themes, the backup method is overly complicated and difficult to configure and no method in place to verify backups. Unfortunately the backup industry is full of products with these same failures, overly complicated setups and configurations along with no automated way to verify the backup easily. Flashback from Idealstor solves both of these issues with a very easy and intuitive web based configuration along with a screenshot backup verification feature that actually launches your backup as a virtual machine on the backup appliance and takes a screenshot of your machine fully booted in a virtual environment, it also replicates your backup data to the cloud so you’ve got multiple copies of backed up data.

In the end, GitLab ended up losing 6 hours worth of database data which included 5037 total customer projects, around 5000 comments, 707 users and any webhooks created before the last good backup which was 6 hours previous. You can read their full report on a google doc they created showing amazing transparency, hopefully this disaster will show other businesses the downfalls of typical backup solutions and how important a true business continuity and disaster recovery plan is. In the end it doesn’t matter if your data is lost by natural disaster, viruses and hacking or even just simply human error, you’ll never get that downtime or data back if you’re not protected.

Leave a Reply Cancel reply