facebook rss twitter

Facebook suffers biggest outage in four years

by Pete Mason on 24 September 2010, 16:26

Tags: Facebook

Quick Link: HEXUS.net/qaz7g

Add to My Vault: x

Have you tried turning it off and on again?

Anyone using Facebook yesterday may have noticed some slight 'hiccups' in the service.  In fact, the social-networking site was down for a total of two and a half hours - the longest outage that it has experienced at any point in the last four years.

In a rather sheepish blog-post, the company's director of software engineering, Robert Johnson, explained what exactly happened that brought the whole site to its knees.  The problem started when the development team introduced a new system for verifying data that was supposed to detect values in cached versions of files and update them from the central data-store.  So far, so good.

Image courtesy of Channel 4

Unfortunately, things went a little haywire when the devs made a change to a value in the persistent copies.  The algorithm decided that the change was invalid, and every single client system attempted to correct it, flooding the central database with requests. 

As if that wasn't bad enough, attempting to make this change returned an error that the algorithm also saw as an invalid value, causing a massive feedback loop.  Every time the error was sent, the clients sent more requests to the central servers.

The only way to stop the damage was to pull the plug on the entire site, shutting everything down until the problem could be found.  With the error correction system disabled, the system could safely be brought back online.

The complete site is now back up, and the developers are busy investigating new ways to implement an error-correction algorithm that doesn't cause epic system failures.



HEXUS Forums :: 14 Comments

Login with Forum Account

Don't have an account? Register today!
I fail to understand how they didn't spot that error before releasing it to the live system. It doesn't sound like a particularly subtle error… but I guess the devil is in the details.
Facebook breaks, meanwhile worker productivity rises 800% and England climbs out of the recession…
Wonder how many people contemplated killing themselves because facebook wasn't working for so long.
One word “ha!” honestly i had both sisters on the phone to me about this, i had to laugh about it.
Fraz
I fail to understand how they didn't spot that error before releasing it to the live system. It doesn't sound like a particularly subtle error… but I guess the devil is in the details.
That's easy. Have no QA process what so ever. It's actually rather common for stupid annoying bugs to make it into ‘production’ Failbook.