Multi-tenancy and bad landlords

So there’s been a lot of discussion about multi-tenancy recently and what it means for cloud providers and users. To put it simply: multi-tenancy is highly desirable to providers because they can provide a service or a platform (such as WordPress) and cram a kajillion users into it without having to constantly customize it, modify it or otherwise do much work to sell it individually. The reality is that whether or not users like multi-tenancy, the providers love it, so it’s here to stay.

So what happens when you have a bad, or just unlucky landlord? In the last few months has had a number of outages:

What Happened: We are still gathering details, but it appears an unscheduled change to a core router by one of our datacenter providers messed up our network in a way we haven’t experienced before, and broke the site. It also broke all the mechanisms for failover between our locations in San Antonio and Chicago. All of your data was safe and secure, we just couldn’t serve it.

And more recently:

If you tried to access TechCrunch any time in the last hour or so, you probably noticed that it wasn’t working at all. Instead, you were greeted by the overly cheery notice “ will be back in a minute!” Had we written that message ourselves, there would have been significantly more profanity.

So what can we do to support this leg (availability) of the A-I-C triad of information security?

I don’t honestly know. It’s such a service/provider specific issue (do they control DNS? do you control DNS? can you redirect to another provider with the same service who has a recent copy of your data? If you do so can you then export any updates/orders/etc. back to your original provider when they come back? etc.) that pretty much any answer you’ll get is useless unless it’s specifically tailored to that provider or service.

If you have an answer to this, please post it in the comments.

2 thoughts on “Multi-tenancy and bad landlords

  1. So, I don’t see this being a problem with multi-tenancy per se; definitely more of a “bad landlord” situation. Multi-tenancy may have exacerbated and brought to a larger audience the problems caused by the outage, but the cause of the outage itself had more to do with this “unscheduled change to a core router.”
    As with any computing environment, poor change control will always cause troubles. The number of people those troubles affect may be astronomically higher in a distributed cloud environment, but those troubles are not elementally any different than the ones that occur on single-user systems.
    As such, any recommendations for improving the A-leg, as it were, are the same recommendations for “good” systems development in general: don’t make changes to production until they’ve been thoroughly vetted through architecture, development, and testing environments, and then make sure all potential changes are communicated to your users well ahead of time so they can plan accordingly.
    Someone in operations should have a lot of splainin to do .

  2. I think the root of this is the relationship, size-wise between a provider and client. With cloud computing we will most likely end up with extremely large providers (Google, Microsoft,, etc. are all quite large already and chances are they will only get larger) that dwarf most customers. Even for large customers (i.e. Fortune 100) they won’t be as important as they were in the past (WordPress currently hosts something like 16 million blogs, you’d have to have a pretty amazing blog for them to care deeply about retaining you).

Leave a Reply

The name and email fields are solely used to comment on posts. Cloud Security Alliance does no further processing of this data. See Section 3 of the CSA Privacy Policy for details.

Share this content on your favorite Social Network.