Microsoft States the Reason Behind September 8 Outage

A couple of weeks ago on a fine Thursday suddenly people all over the social networks started complaining about the Windows Live Services like the Mail services and the Skydrive services weren’t working for them.

Uptill now there was no statement made about the sudden outage from Microsoft,but yesterday on the WindowsTeamBlog they posted the reason behind the September 8 outage  which had something to do with the DNS issues which was causing the problems to the services users.Apart from the outage they also stated that no customer data was either compromised or lost during that outage.

Here’s what happened,

A tool that helps balance network traffic was being updated and the update did not work correctly. As a result, configuration settings were corrupted, which caused a service disruption.

At 10:23 PM PDT we began to see service restoration. We confirmed that the incident was resolved by 11:35 PM PDT, although it took some time for the changes to replicate around the world and reach all our customers.

We determined the cause to be a corrupted file in Microsoft’s DNS service.  The file corruption was a result of two rare conditions occurring at the same time.  The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string (i.e., the software was unable to parse an incorrectly constructed line in the configuration file). The second condition was related to how the configuration is synchronized across the DNS service to ensure all client requests return the same response regardless of the connection location of the client.  Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service.

After restoring service, we have identified two streams of work to drive specific service improvements around monitoring, problem identification, and recovery.  Along with these service improvements, Microsoft is focused on further hardening the DNS service to improve its overall redundancy and fail-over capability.

 

They are working on an additional recovery process so that when such scenario occurs it will add the  ability to fail over to restore service and then fail back when the DNS service is restored and also working on the recovery tools as well.