By Wing Ko
I came across this “Stress tests rain on Amazon’s cloud” article from the itnews for Australian Business about a week ago. A team of researchers in Australia spent 7 months stress tested Amazon’s EC2, Google’s AppEngine and Microsoft’s Azure cloud computing services, and found that these cloud providers suffered from regular performance and availability issues.
The researchers have just released more data yesterday – http://www.itnews.com.au/News/153819,more-data-released-on-cloud-stress-tests.aspx. Turns out Google’s AppEngine problem was “by design” – no single processing task can last more than 30 seconds to prevent denial-of-service attack to the AppEngine. It’ll be nice to warn the customers ahead of time, but nevertheless, a reasonable security feature.
The reason for Amazon’s problem was not so reasonable – due to a power and back-up generator failure. It’s kind of hard to believe that as sophisticated as Amazon, a simple power failure cause outages and performance degradations. Or was it?
I was personally involved in 3 major data center outages due to “simple” power problems. Obviously, there will be no name associate with these incidents to protect the innocents, blah, blah, blah …
We have just launched a new state-of-the-art data center, and it was in use for less than 6 months. A summer power outage knocked out half of the data center for less than an hour, but it took us about 2 days to restore services to all the customers because some high-end equipment were fired, disk crashed, etc. – you know the deal.
Initially everyone was puzzled – why half of the data center were out of power when we have separate power sources from 2 utility companies, battery banks, and diesel generators with 2 separate diesel refilling companies for our brand-new data center! We should be able to stay up for as long as we needed even without any outside power sources. Well, post-mortem revealed that the electricians didn’t connect one set of the PDUs to the power systems, so that’s why every other racks were out of power. We were in such a hurry to light up that center, we didn’t test everything. Since all systems were fed with dual-power through multiple levels, we couldn’t tell half the systems weren’t fully powered. When we tested the power, we happened to test the half that worked.
Another summer storm came through around 11PM and knocked out power to a slightly older data center. Somehow it blew the main circuit to the water pumps. The good news was that the backup power worked and all the systems were up and running. The bad news was that the A/C systems depend on the cool water, so no cool water, no A/C. Well, we had mainframes, mainframe-class UNIX servers, enterprise-class Windows servers, SANs, DASes, and many more power hungry, heat monsters in that data center. Only a few night shift people were there, and they didn’t know much, but they did follow the escalation process and called the data center and operations managers. I got the call and immediately called and instructed my staff to login remotely to shut down the systems while I drove in. In a normal day, I hated to stay in that data center floor for more than 30 minutes because it’s so cold. It took me about 20 minutes to get there, and boy, our 25 feet high, 150,000 square feet data center had reached over 90 degrees. Some equipment initiated thermal shutdowns on their own, but some simply overheated and crashed. That outage caused several million dollars in damages just on equipment alone.
This time no summer storms, just a bird. A bird was in the back room and somehow decided to end its life by plunging into the power relay. Again, normally all these systems are redundant, so it should be fine. Unfortunately, luck will have it, earlier that week, a relay went bad, but the data center manager didn’t bother to rush a repair. Well, you probably know the rest – no power except emergency lights in the data center – $$$.
I don’t know what caused the power outage in the case of Amazon, but the moral of this long story is that, pay special attention to your power systems. Test, retest, and triple-test your systems with different scenarios.