Have you tried turning it off and on again?
Anyone using Facebook yesterday may have noticed some slight 'hiccups' in the service. In fact, the social-networking site was down for a total of two and a half hours - the longest outage that it has experienced at any point in the last four years.
In a rather sheepish blog-post, the company's director of software engineering, Robert Johnson, explained what exactly happened that brought the whole site to its knees. The problem started when the development team introduced a new system for verifying data that was supposed to detect values in cached versions of files and update them from the central data-store. So far, so good.
Unfortunately, things went a little haywire when the devs made a change to a value in the persistent copies. The algorithm decided that the change was invalid, and every single client system attempted to correct it, flooding the central database with requests.
As if that wasn't bad enough, attempting to make this change returned an error that the algorithm also saw as an invalid value, causing a massive feedback loop. Every time the error was sent, the clients sent more requests to the central servers.
The only way to stop the damage was to pull the plug on the entire site, shutting everything down until the problem could be found. With the error correction system disabled, the system could safely be brought back online.
The complete site is now back up, and the developers are busy investigating new ways to implement an error-correction algorithm that doesn't cause epic system failures.