Friday, April 22, 2011

Who owns my availability?

Hey did you know EC2 had problems today? Yeah nothing major just a total effing collapse of the EBS system at US-EAST-1.

You know what that means....

"Hey guys, can anyone tell me who owns my availability?"
"Internet learns lesson of putting "all eggs in the EC2 basket". Buy your own machines, brothers."

I could go on....but I won't. I'm also going to stop short of posting a CeeLo video at this point.

Your stupid little comments mean nothing. I especially find it hilarious that someone from Twitter would make a comment availability. I also find the short-lived memory of some people hilarious (paraphrasing here):

"Thank god we're hosted on Joyent/Linode/My mom's basement"

Please. Your attempt to curry favor and free service with your provider are transparent and frankly, makes you look stupid.

Yo Netflix/SimpleGeo/JRandomDude I'm happy for you and and all. I'ma let you finish but....

So who DOES own my availability?
Here's a hint; it's not always that simple.

Yes, the ultimate responsibility for those impacted lies with those who were impacted but let's look at a few facts (or excuses - if you're being a dick about it):

Not everyone has the resources of a Netflix
Comparing anyone else's EC2 usage to Netflix is simply retarded. It's a lot like working with an ex-Google employee (I've worked with a few). They have some awesome ideas and learned some great stuff there but guess what? About 85% of it is USELESS to anyone except someone the size of Google. What works at Google doesn't work at my company.

It's not even a matter of scaling down the concept. It's simply NOT possible. Yeah let me just go buy a shipping container and build a datacenter in a box. Hardware failure? Replace the box with one off the shelf. Oh wait, not everyone has a warehouse of replacement servers. People have trouble getting a few spare hard drives to swap out.

Telling someone that they should just do what Netflix does makes you look stupid. Not them.

WE used Joyent/Linode/GoGrid/My mom's basement
Really? Really? I'm not being an AWS fanboy here but here is a simple fact: No other 'cloud' provider comes even REMOTELY close to the feature set of AWS. No one. Not only does no one come close but Amazon is CONSTANTLY iterating on new stuff to widen the gap even more.

It's not like your provider hasn't had a major outage in recent memory. And comparing an effing VPS provider to Amazon? You seriously just don't get it.

You should have designed around this possibility
Well no shit, sherlock. Guess what, it was rejected. Why? Who knows? Who cares? It's irrelevant. Sometimes the decision isn't ours to make. In the REAL world, people have to balance risk vs. reward.

Here's a tidbit of information. At EVERY single company I've been at where I was involved with architecting a solution from the ground up, we never had redundancy built in from the get go. Did I find it appalling. Absolutely but the choice wasn't mine. I did the best I could to prevent anything that would make adding it TOO difficult later on but we didn't have our DR site online from day one. We sometimes had to accrue a little technical debt. The best we could do was to minimize it as much as possible.

Designing around failure is not the same as designing for the worse case scenario. Sometimes you just have to accept that "if component X has Y number of failures, we're going to have an outage". If you have the ability to deal with it now (resources/money/whatever), then that's awesome. Sometimes you just have to accept that risk.

Oh sure I'd love to use (insert buzzword/concurrent/distributed language of the day) here. But I can't. It would be totally awesome if everything were designed from the ground up to handle that level of failure but it's not.

And another thing
The thing that bothers me most is the two-faced attitude around it all.

On one hand people are telling you it's stupid to host your own hardware. On the other hand they'll laugh at you when your provider has an outage and tell you that you should have built your own.

On one hand they'll tell you it's stupid to use some non-traditional new-fangled language and on the other hand laugh at you when you could have avoided all these problems if you had just used non-traditional new-fangled language.

On one hand they'll tell you that you should use insert-traditional-RDBMS here and on the other hand say that it's your fault for not rearchitecting your entire codebase around some NoSQL data store.

Not everyone has the same options. I hate the phrase "hindsight is 20/20". Why? Because it's all relevant. Sometimes you don't know that something is the wrong choice till it bites you in the ass. Hindsight in technology is only valuable for about a year. Maybe 6 months. Technology moves fast. It's easy to say that someone should have used X when you don't realize that they started working on things six months before X came along. If you have that kind of foresight, I'd love to hire you to play the stock market for me.

Not everyone has the luxury of switching midstream. You have to make the most what technology is available. If you keep chasing the latest and greatest, you'll never actually accomplish anything.

Are these excuses? Absolutely but there's nothing inherently wrong with excuses. You live and learn. So to those affected by the outage (still on-going mind you), take some comfort. Learn from your mistakes. The worst thing you could do at this point would be to NOT change anything. At a minimum, if you aren't the decision maker, you should document your recommendations and move on. If you are the decision maker, you need to..you know...decide if the risk of this happening again is acceptable.

No comments: