Amazon AWS – 11 9’s of reliability?

Amazon recently added a new redundancy service to their S3 data storage service. Amazon now claims that data stored in the “durable storage” class is 99.999999999% “durable” (not to be confused with availability – more on this later).

“If you store 10,000 objects with us, on average we may lose one of them every 10 million years or so. This storage is designed in such a way that we can sustain the concurrent loss of data in two separate storage facilities.” –Jef;

So how exactly does Amazon arrive at this claim? Well reading further they also offer a “REDUCED_REDUNDANCY” storage class (which is 33% cheaper than normal) that guarantees 99.99% and is “designed to sustain the loss of data in a single facility.” From this was can extrapolate that Amazon is simply storing the data in multiple physical data centers, the chance of each one becoming unavailable (burning down, cable cut, etc.) is something like 0.01%, so storing at two data centers means a 0.0001% chance that both will fail at the same time (or on the flip side: a 99.9999% durability guarantee), three data centers giving us 0.000001% chance of loss (a 99.999999% durability guarantee) and so on. I’m not sure of the exact numbers that Amazon is using but you get the general idea; a small chance of failure, combined with multiple locations makes for a very very small chance of failure at all the locations at the same time.

Except there is a huge gaping hole in this logic. To expose it let’s revisit history, specifically the Hubble Space Telescope. The Hubble Space Telescope can be pointed in specific directions using six on board gyroscopes. By adding momentum to a single gyroscope or applying the brakes to it you can cause Hubble to spin clockwise or counter clockwise in a single axis. With two of these gyroscopes you can move Hubble in three axis to point anywhere. Of course having three sets of gyroscopes makes maneuvering it easier and having spare gyroscopes ensures that a failure or three won’t leave you completely unable to point the Hubble at interesting things.

But what happens when you have a manufacturing defect in the gyroscopes, specifically the use of regular air instead of inert nitrogen during the manufacturing of the gyroscopes? Well having redundancy doesn’t do much since the gyroscopes start failing in the same manner at around the same time (almost leaving Hubble useless if not for the first servicing mission).

The lesson here is that having redundant and backup systems that are identical to the primary systems may not increase the availability of the system significantly. And I’m willing to bet that Amazons S3 data storage facilities are near carbon copies of each other with respect to the hardware and software they use (to say nothing of configuration, access controls, authentication and so on). A single flaw in the software, for example an software related issue that results in a loss or mangling of data may hit multiple sites at the same time as the bad data is propagated. Alternatively a security flaw in the administrative end of things could let an attacker gain access to and start deleting data from the entire S3 “cloud”.

You can’t just take the chance of failure and square it for two sites if the two sites are identical. The same goes for 3, 4 or 27 sites. Oh and also to read the fine print: “durability” means the data is stored somewhere, but Amazon makes no claims about availability or whether or not you can get at it.
Something to keep in mind as you move your data into the cloud.

2 thoughts on “Amazon AWS – 11 9’s of reliability?

  1. Availability of S3 is described in the SLA, but you only get a minor refund if you cant get (temporaryly) to your data. It would be good if one can get an actual contract where AWS is actually paying you if they lose data or access.

    BTW: your article would be more helpfull if we actually would know what amazon is doing. I bet they have a multi tier system in place, so single software failures are less likely. Dont know if different sites use different methods.

    BTW2: you should also mention replication and scrubbing, since if you do not actively detect missing data in one site and actively re-replicate missing files your 3 sites dont help much. AFAIK Amazon is claiming to do both the later. Detection seems to be based on access-misses only. But I am guessing here, too. I guess we should ask Werner 🙂

  2. Guys, amazon is not trying to be accurate in their prediction, it’s their marketing line. Of course hosting/storing your data with a single provider is always risky in one way or the other. But i’m not really buying into that manufacturing defect in Hubble analogy either, i’m pretty sure their drives die all the time, and are replaced all the time. It’s almost imposible for all of them to fail at the same time, or it takes some wild imagination to come up with a plausible scenario in which that could happen(some year 2k-like failure)

Leave a Reply

Your email address will not be published. Required fields are marked *

Share this content on your favorite Social Network.