Amazon recently added a new redundancy service to their S3 data storage service. Amazon now claims that data stored in the “durable storage” class is 99.999999999% “durable” (not to be confused with availability – more on this later).
“If you store 10,000 objects with us, on average we may lose one of them every 10 million years or so. This storage is designed in such a way that we can sustain the concurrent loss of data in two separate storage facilities.”
So how exactly does Amazon arrive at this claim? Well reading further they also offer a “REDUCED_REDUNDANCY” storage class (which is 33% cheaper than normal) that guarantees 99.99% and is “designed to sustain the loss of data in a single facility.” From this was can extrapolate that Amazon is simply storing the data in multiple physical data centers, the chance of each one becoming unavailable (burning down, cable cut, etc.) is something like 0.01%, so storing at two data centers means a 0.0001% chance that both will fail at the same time (or on the flip side: a 99.9999% durability guarantee), three data centers giving us 0.000001% chance of loss (a 99.999999% durability guarantee) and so on. I’m not sure of the exact numbers that Amazon is using but you get the general idea; a small chance of failure, combined with multiple locations makes for a very very small chance of failure at all the locations at the same time.
Except there is a huge gaping hole in this logic. To expose it let’s revisit history, specifically the Hubble Space Telescope. The Hubble Space Telescope can be pointed in specific directions using six on board gyroscopes. By adding momentum to a single gyroscope or applying the brakes to it you can cause Hubble to spin clockwise or counter clockwise in a single axis. With two of these gyroscopes you can move Hubble in three axis to point anywhere. Of course having three sets of gyroscopes makes maneuvering it easier and having spare gyroscopes ensures that a failure or three won’t leave you completely unable to point the Hubble at interesting things.
But what happens when you have a manufacturing defect in the gyroscopes, specifically the use of regular air instead of inert nitrogen during the manufacturing of the gyroscopes? Well having redundancy doesn’t do much since the gyroscopes start failing in the same manner at around the same time (almost leaving Hubble useless if not for the first servicing mission).
The lesson here is that having redundant and backup systems that are identical to the primary systems may not increase the availability of the system significantly. And I’m willing to bet that Amazons S3 data storage facilities are near carbon copies of each other with respect to the hardware and software they use (to say nothing of configuration, access controls, authentication and so on). A single flaw in the software, for example an software related issue that results in a loss or mangling of data may hit multiple sites at the same time as the bad data is propagated. Alternatively a security flaw in the administrative end of things could let an attacker gain access to and start deleting data from the entire S3 “cloud”.
You can’t just take the chance of failure and square it for two sites if the two sites are identical. The same goes for 3, 4 or 27 sites. Oh and also to read the fine print: “durability” means the data is stored somewhere, but Amazon makes no claims about availability or whether or not you can get at it.
Something to keep in mind as you move your data into the cloud.