Friday, April 22, 2011

Who owns my availability?

Hey did you know EC2 had problems today? Yeah nothing major just a total effing collapse of the EBS system at US-EAST-1.

You know what that means....

"Hey guys, can anyone tell me who owns my availability?"

"Internet learns lesson of putting "all eggs in the EC2 basket". Buy your own machines, brothers."

I could go on....but I won't. I'm also going to stop short of posting a CeeLo video at this point.

Your stupid little comments mean nothing. I especially find it hilarious that someone from Twitter would make a comment availability. I also find the short-lived memory of some people hilarious (paraphrasing here):

"Thank god we're hosted on Joyent/Linode/My mom's basement"

Please. Your attempt to curry favor and free service with your provider are transparent and frankly, makes you look stupid.

Yo Netflix/SimpleGeo/JRandomDude I'm happy for you and and all. I'ma let you finish but....

So who DOES own my availability?
Here's a hint; it's not always that simple.

Yes, the ultimate responsibility for those impacted lies with those who were impacted but let's look at a few facts (or excuses - if you're being a dick about it):

Not everyone has the resources of a Netflix
Comparing anyone else's EC2 usage to Netflix is simply retarded. It's a lot like working with an ex-Google employee (I've worked with a few). They have some awesome ideas and learned some great stuff there but guess what? About 85% of it is USELESS to anyone except someone the size of Google. What works at Google doesn't work at my company.

It's not even a matter of scaling down the concept. It's simply NOT possible. Yeah let me just go buy a shipping container and build a datacenter in a box. Hardware failure? Replace the box with one off the shelf. Oh wait, not everyone has a warehouse of replacement servers. People have trouble getting a few spare hard drives to swap out.

Telling someone that they should just do what Netflix does makes you look stupid. Not them.

WE used Joyent/Linode/GoGrid/My mom's basement
Really? Really? I'm not being an AWS fanboy here but here is a simple fact: No other 'cloud' provider comes even REMOTELY close to the feature set of AWS. No one. Not only does no one come close but Amazon is CONSTANTLY iterating on new stuff to widen the gap even more.

It's not like your provider hasn't had a major outage in recent memory. And comparing an effing VPS provider to Amazon? You seriously just don't get it.

You should have designed around this possibility
Well no shit, sherlock. Guess what, it was rejected. Why? Who knows? Who cares? It's irrelevant. Sometimes the decision isn't ours to make. In the REAL world, people have to balance risk vs. reward.

Here's a tidbit of information. At EVERY single company I've been at where I was involved with architecting a solution from the ground up, we never had redundancy built in from the get go. Did I find it appalling. Absolutely but the choice wasn't mine. I did the best I could to prevent anything that would make adding it TOO difficult later on but we didn't have our DR site online from day one. We sometimes had to accrue a little technical debt. The best we could do was to minimize it as much as possible.

Designing around failure is not the same as designing for the worse case scenario. Sometimes you just have to accept that "if component X has Y number of failures, we're going to have an outage". If you have the ability to deal with it now (resources/money/whatever), then that's awesome. Sometimes you just have to accept that risk.

Oh sure I'd love to use (insert buzzword/concurrent/distributed language of the day) here. But I can't. It would be totally awesome if everything were designed from the ground up to handle that level of failure but it's not.

And another thing
The thing that bothers me most is the two-faced attitude around it all.

On one hand people are telling you it's stupid to host your own hardware. On the other hand they'll laugh at you when your provider has an outage and tell you that you should have built your own.

On one hand they'll tell you it's stupid to use some non-traditional new-fangled language and on the other hand laugh at you when you could have avoided all these problems if you had just used non-traditional new-fangled language.

On one hand they'll tell you that you should use insert-traditional-RDBMS here and on the other hand say that it's your fault for not rearchitecting your entire codebase around some NoSQL data store.

Not everyone has the same options. I hate the phrase "hindsight is 20/20". Why? Because it's all relevant. Sometimes you don't know that something is the wrong choice till it bites you in the ass. Hindsight in technology is only valuable for about a year. Maybe 6 months. Technology moves fast. It's easy to say that someone should have used X when you don't realize that they started working on things six months before X came along. If you have that kind of foresight, I'd love to hire you to play the stock market for me.

Not everyone has the luxury of switching midstream. You have to make the most what technology is available. If you keep chasing the latest and greatest, you'll never actually accomplish anything.

Are these excuses? Absolutely but there's nothing inherently wrong with excuses. You live and learn. So to those affected by the outage (still on-going mind you), take some comfort. Learn from your mistakes. The worst thing you could do at this point would be to NOT change anything. At a minimum, if you aren't the decision maker, you should document your recommendations and move on. If you are the decision maker, you need to..you know...decide if the risk of this happening again is acceptable.

Friday, April 15, 2011

Sinatra, Noah and CloudFoundry - the dirty details

So via some magical digital god, my signup for Cloud Foundry got processed. Obviously my first thought was to try and get Noah up and running. Cloud Foundry is a perfect fit for Noah because I have access to Redis natively. I have a working setup now but it took a little bit of effort.

Getting set up
As with everything these days, my first action was to create a gemset. I'll not bore you with that process but for the sake of this walkthrough, let's use a 1.9.2 gemset called 'cfdev'.

The VMC getting started guide has most of the information you'll need but I'm going to duplicate some of it here for completeness:

 gem install vmc
 vmc target api.cloudfoundry.com
 vmc login

And we're ready to rock. The VMC command line help is very good with the exception that the optional args aren't immediately visible.

vmc help options

will give you a boatload of optional flags you can pass in. One that was frequently used during the demos at launch was '-n'. I would suggest you NOT use that for now. The prompts are actually pretty valuable.

So in the case of Noah, we know we're going to need a Redis instance. Because everything is allocated dynamically CloudFoundry makes heavy use of environment variables to provide you with important settings you'll need.

First Attempt
If you watched the demo (or read the quickstart Sinatra example), there's a demo app called 'env' that they walk you through. You're going to want to use that when troubleshooting things. My first task was to duplicate the env demo so I could take a gander at the variables I would need for Redis. For the record, the steps I'm documenting here might appear out of order and result in some wasted time. I'm one of those guys who reads the instructions 2 days after I've broken something so you have an idea of what I did here:

 vmc help
 vmc services
 vmc create-service redis redis-noah
 vmc services

At this point, I now have a named instance of redis. The reason I felt safe enough doing this now is that I noticed in the help two service commands - 'bind-service' and 'unbind-service'. I figured it was easy enough to add the service to my app based on those options.

So go ahead and create the env app per the getting started documentation. If you followed my suggestion and DIDN'T disable prompts, you'll get the option to bind you app to a service when you push the first time. If you're running without prompts (using the '-n' option), you'll probably want to do something like this:

vmc push myenvapp --url ohai-env.cloudfoundry.com
vmc bind-service my_redis_service myenvapp

If you visit the url you provided (assuming it wasn't taken already?) at /env, you'll get a big dump of all the environment variables. The ones that you'll need be using most are probably going to be under `VCAP_SERVICES`. What you'll probably also notice is that `VCAP_SERVICES` is a giant JSON blob. Now you may also notice that there's a nice `VMC_REDIS` env variable there. It's pretty useless primarily because there's also a GIANT warning in the env output that all `VMC_` environment variables are deprecated but also because your redis instance requires a password to access which means you need to traverse the JSON blob ANYWAY.

So if we paste the blog into an IRB session we can get a better representation. I wish I had done that first. Instead, I reformatted it with jsonlint dutifully wrote the following madness:

which I spent a good 30 minutes troubleshooting before I realized that it's actually an array. It should have been this:

So now that I had all the variables in place, I went about converting my heroku Noah demo . That demo uses a Gemfile and a rackup file so I figured it would work just fine here. No such luck. This is where things get hairy.

Sinatra limitations
The short of it is that Sinatra application support on CF right now is a little bit of a CF. It's very basic and somewhat brute force. If you're running a single file sinatra application, it will probably work. However if you're running anything remotely complex, it's not going to work without considerable effort. Noah is even more of a special case because it's distributed as a gem. This actually has some benefit as I'll mention farther down. However it's not really "compatible" with the current setup on Cloud Foundry. Here's the deal:

If you look here, You'll see that the way your sinatra application is start is by calling ruby (with or without bundler depending) against what it detects as your main app file. This is done here which leads us all the way to this file:

`https://github.com/cloudfoundry/vcap/blob/master/cloud_controller/staging/manifests/sinatra.yml`

Essentially for sinatra applications, the first .rb file it comes across with 'require sinatra', is considered the main app file. Bummer. So config.ru is out. The next step is to rename it to a '.rb' file and try again. This is where I spent most of my troubleshooting. There's a gist of the things I tried (including local testing) here:

`https://gist.github.com/920552`

Don't jump to the solution just yet because it's actually incomplete. This troubleshooting led to another command you'll want to remember:

vmc files myapp logs/stderr.log

I found myself typing it a lot during this process. For whatever reason, possibly due to bundler or some other vcap magic I've not discovered yet what works at home does not work on Cloud Foundry exactly the same. That's fine, it's just a matter of knowing about it. It also didn't help that I wasn't getting any output at all for the entire time I was trying to figure out why config.ru didn't work.

Thanks to Konstantin Haase for his awesome suggestion in #sinatra. The trick here was to mimic what rackup does. Because the currently released Noah gem has a hard requirement on rack 1.2.1, his original suggestion wasn't an exact fit but I was able to get something working:

https://gist.github.com/921292

So what did we do?
Ensure that the wrapper file is picked up first by making sure it's the ONLY rb file uploaded with `require sinatra` at the top.
Because of a bug in rack 1.2.1 with Rack::Server.new, I HAD to create a file called config.ru. The fix in rack 1.2.2 actually honors passing all the options into the constructor without needing the config.ru file.
Explicitly connect to redis before we start the application up.

The last one was the almost as big of a pain in the ass as getting the application to start up.

I think (and I'm not 100% sure) that you are prohibited from setting environment variables inside your code. Because of the convoluted way I had to get the application started, I couldn't use my sinatra configuration block properly (`set :redis_url, blahblahblah`). I'm sure it's possible but I'm not an expert at rack and sinatra. I suppose I could have used Noah::App.set but at this point I was starting to get frustrated. Explicitly setting it via Ohm.connect worked.

I'm almost confident of this environment variable restriction because you can see options in 'vmc help' that allow you to pass environment variables into your application. That would work fine for most cases except that I don't know what the redis values are outside of the app and they're set dynamically anyway.

So where can things improve?
First off, this thing is in beta. I'm only adding this section because it'll serve as a punch list of bugs for me to fix in vcap ;)

Sinatra support needs to be more robust.

You can see that the developers acknowledged that in the staging plugin code. There are TODOs listed. It's obvious that a sinatra application of any moderate complexity wasn't really tested and that's fine. The building blocks are there and the code is opensource. I'll fix it myself (hopefully) and submit a pull request.

Allow override of the main app file from VMC.

It appears from the various comments that the node.js support suffers some of the same brute force detection routines. An option to pass in what the main applictation file is would solve some of that.

Document the environment variable restrictions.

I didn't see any documentation anywhere about that restriction (should it exist). I could be doing something wrong too. It's worth clarifying.

Better error reporting for failed startups

I'm not going to lie but I spent a LONG time troubleshooting the fact that the app simply wasn't starting up. The default output when a failure happens during deploy is the staging.log file. All this EVER contained was the output from bundler. It should include the output of stderr.log and stdout.log as well. Also an explicit message should be returned if the main app file can't be detected. That would have solved much of my frustration up front.

That's just the stuff I ran into to get things going. The first item is the biggest one. If you're writing a monolithic single-file sinatra app, the service will work GREAT. If you aren't, you'll have to jump through hoops and wrapper scripts for now. Supporting rackup files for Sinatra and Rack apps will go a long way to making things even more awesome.

One pleasant surprise I found was that, despite what I was told, I didn't need to include every gem in my Gemfile. Because Noah itself has its deps, Bundler pulls those in for me.

I've created a git repo with the code as well as a quickstart guide for getting your own instance running. You can find it here:

https://github.com/lusis/noah-cloudfoundry-demo

Thursday, April 14, 2011

Operational Primitives

"Infrastructure as code". I love the phrase. Where devops is a word that is sadly open to so much (mis)interpretation, "Infrastructure as code" is pretty clear. Treat your infrastructure as code. Programmable. Testable. Deployable.

But when you start to really think about that concept, there's a deep dive you can take, navigating various programming and computer science constructs and applying those to your infrastructure.

I've been working pretty heavily on getting the first API stable release of Noah out the door. It's been a challenge with the schedule I have to work on it - which is essentially "when everyone else in the house is asleep and I'm awake'. Last night, I came to a fork in the road where I needed to make a decision. This decision would lock me into an API path that I was unwilling to change for a while. Nobody wants to use a service or tool with a constantly changing API. I needed to shit or get off the pot, to use a creative euphemism. With the announcements of both Doozer and riak_zab, it was clear that I wasn't the only person attempting to tackle the ZooKeeper space.

Since Github lacks any facility for soliciting project feedback (hint hint, @github), I decided to create a Wufoo form and tweet it out. I don't have a very big audience but I was hoping it would at least get to the people who were likely to use Noah. The form was fairly simple with one question on something that I had pretty summarily dismissed early on - HATEOAS (hypermedia as the engine of application state).

A small HATEOAS diversion

The HATEOAS debate is a lot like Linux vs. GNU/Linux. It's fairly esoteric but there's some meat to the matter. My problem with it was simply that, despite what Roy Fielding and others intended, REST had taken on a new definition and it wasn't the strict HATEOAS one. Additionally, I found it VERY difficult to map HATEOAS concepts to JSON. JSON is a great format but a rich document structure is not (rightly so) part of the format. It's intended to be simple, easily read and cleanly mapped to machine readable format. It also felt like extra work on the part of the API consumer. The concepts that we use when reading a website (click this link, read this list, click this link) are simple not necessary when you have a contextually relevant (or descriptive) URL scheme. True, as a human I don't make changes in the URL bar to navigate a site (I use the links provided by the site) but when it comes to dealing with an API, I don't exhibit the same usage patterns as a web browser. I'm making distinct atomic transactions (DELETE this resource, PUT this resource) at a given endpoint. These simply aren't the same as filling out forms and are only tangentially related. I'm simply not willing to force someone to parse a JSON object to tell them how to create a new object in the system. The API for Noah is fairly simple as it is. Objects in the system have only two or three required attributes for a given operation and normally one of those attributes is directly inferable from the URL.

But based on the poll results thus far, I wanted to give the idea fair consideration which led me to think about what types of objects Noah had in its system.

Primitives

For those who aren't familiar or simple don't know, there's a term in computer science and programming called "Primitive". It essentially means a basic data type in a language from which other complex data types are created. A building block if you will. Some easily grokable examples of primitives are Characters and Integers. Some languages actually have ONE primitive like Object and everything is built on top of that. You could get into a semantic argument about a lot of this so I'm going to leave it at that.

But back to the phrase "Infrastucture as code". If we start looking at how we "program" our infrastructure, what are the "primitives" that our language supports. I inadvertently created some of these in Noah. I've been calling them the "opinionated models" but really in the infrastructure programming language of Noah, they're primitives.

When this hit me last night, I immediately pulled out the tablet and went to work on a mind map. I laid out what I had already implemented as primitives in Noah:

Host
Service
Application
Configuration

I then started to think about other concepts in Noah. Were Ephmerals really a primitive. Not really. If anything Ephemerals are more similar to ruby's BasicObject. The only real attribute Ephemerals have are a path (similar to the object_id).

So what else would be our modern operational primitives? Remember that we're talking about building blocks here. I don't want to abstract out too much. For instance you could simply say that a "Resource" is the only real operational primitive and that everything else is built on top of that. Also consider that languages such as Python have some richer primitives built-in like tuples.

One interesting thought I had was the idea that "State" was a primitive. Again, in the world of operations and infrastructure, one of your basic building blocks is if something is available or not - up or down. At first glance it would appear that this maps pretty cleanly to a Boolean (which is a primitive in most languages) however I think it's a richer primitive than that.

In the world of operations, State is actually quaternary (if that's the right word) rather than binary. There are two distinct areas between up and down that have dramatically different implications on how you interact with it:

Up
Down
Pending Up
Pending Down

Currently in Noah, we simple have Up, Down and Pending but something that is in the State of shutting down is grossly different than something in the state of Starting up. Look at a database that is queiscing connections. It's in a state of "Pending Down". It's still servicing existing requests. However a database in the state of "Pending Up" is NOT servicing any requests.

So I'm curious what other thoughts people have. What else are the basic building blocks of modern operations when viewed through the lens of "infrastructure as code"?

For the record, I'm still pretty confident that Noah still has a place in the Doozer, riak_zab, ZooKeeper world. All three of those still rely on the persistent connection for signaling and broadcast whereas Noah is fundamentally about the disconnected and asynchronous world.

Tuesday, April 5, 2011

It does not follow and Wheaton's Law

"I'm not a smart guy".

I say this quite a bit. I don't say it to fish for compliments or as a chance to have my ego boosted. I say it because I realize that, out of the vast corpus of computer science knowledge that exists, the part that I DO know is a blade of grass on a football field.

"I'm not a developer"

I say this a lot too. This is not meant as a slight to developers. It's meant as a compliment. There are REAL developers out there and I'm just pretending (after a fashion). I have never worked a professional gig as a developer. I've had honest discussions with people who want to pay me lots of money to be a developer. The best way I can explain it to them is that it would be unfair to you, as an employer, to hire me for a developer position because you would be unhappy with the results. In general it takes me twice as long to solve a development problem as it takes a real developer.

There are lots of factors to this; education, regular skill use and a general affinity for picking up concepts. I never graduated college and I pretty much suck at math. That's not to say I couldn't learn it but there are some things I know I'll never be as good at as someone else and that's fine by me. I'm not settling for mediocrity I just know my limitations. I'll still take a stab at it.

There are, however, some REALLY smart people out there. I used to follow a bunch of them on Twitter because they would link to or drop ideas that really made me want to go research something. I noticed an interesting trend though about some of them. They had a tendency to be dicks. Not just the occasional "Only an idiot would do X" but outright vitriol. Was it trolling? In some cases, sure, but I honestly got the impression that they actually looked down on people who didn't who use a certain technology or chose any path different than they would have chosen.

At the other extreme, you have the folks who make snide remarks or drop a non sequitur about a given technology presumably in an attempt to make the in-crowd giggle and the rest of us poor saps wonder what the hell we're doing wrong. I mean these are smart people, right? If they know something I don't about a given technology, then by god, I'd love to know what it is. I'd love to learn why they feel that way. In the end, though, all you hear is giggling in the background and wonder what the big joke was.

When the hell did we, the people who were typically on the outside of the in-crowd, turn into the people who gave us the most shit growing up? It's like a fucking geek Stockholm Syndrome thing that's gone off the deep end but instead of just sympathizing with our abuser, we're the abuser and we relish it.

I'm guilty of this behavior. I'm the first in line to criticize MongoDB, for instance. The difference? I'll actually sit down with you and tell you WHY I don't like MongoDB and why I feel it's a bad choice in many situations.

What I'm asking is that, as one of the people on the outside, educate me. As much as I think Ted Dziuba is a big troll, at least he takes the time to write it down and trys to defend his position. Ben Bleything had an awesome tweet today:

I guess what I meant is, I don't have the experience to form that opinion, I'd like to learn from you.

That's my attitude exactly. "Put up or shut up" is a bit harsh but in the broadest terms, that's what needs to happen. If you think X is superior to Z then say why. There are some of us who could benefit from it.

Sidebar on Semantics

Additionally, let's make sure we're also on the same page in terms of semantics. If we're talking about queues, clarify if you're talking about data structures versus a message queue because there's a big f'ing difference in my mind.

When I hear queue, I don't think data structure. I think of a message queue in the product sense. That's just my background. I think about things like guaranteed delivery and message durability.

Tuesday, March 8, 2011

Ad-Hoc Configuration, Coordination and the value of change

For those who don't know, I'm currently in Boston for DevOps Days. It's been amazing so far and I've met some wonderful people. One thing that was REALLY awesome was the open space program that Patrick set up. You won't believe it works until you've tried it. It's really powerful.

In one of our open spaces, the topic of ZooKeeper came up. At this point I made a few comments, and at the additional prodding of everyone went into a discussion about ZooKeeper and Noah. I have a tendency to monopolize discussions around topics I'm REALLY passionate about so many thanks for everyone who insisted I go on ;)

Slaughter the deviants!
The most interesting part of the discussion about ZooKeeper (or at least the part I found most revealing) was that people tended to have trouble really seeing the value in it. One of the things I've really wanted to do with Noah is provide (via the wiki) some really good use cases about where it makes sense.

I was really excited to get a chance to talk with Alex Honor (one of the co-founders of DTO along with Damon Edwards) about his ideas after his really interesting blog post around ad-hoc configuration. If you haven't read it, I suggest you do so.

Something that often gets brought up and, oddly, overlooked at the same time is the where ad-hoc change fits into a properly managed environment (using a tool like puppet or chef).

At this point, many of you have gone crazy over the thought of polluting your beautifully organized environment with something so dirty as ad-hoc changes. I mean, here we've spent all this effort on describing our infrastructure as code and you want to come in and make a random, "undocumented" change? Perish the thought!

However, as with any process or philosophy, strict adherence with out understanding WHEN to deviate will only lead to frustration. Yes, there is a time to deviate and knowing when is the next level of maturity in configuration management.

So when do I deviate
Sadly, knowing when it's okay to deviate is as much a learning experience as it was getting everything properly configured in the first place. To make it even worse, that knowledge is most often specific to the environment in which you operate. The whole point of the phrase ad-hoc is that it's..well...ad-hoc. It's 1 part improvisation/.5 parts stumbling in the dark and the rest is backfilled with a corpus of experience. I don't say this to sound elitist.

So, really, when do I deviate. When/where/why and how do I deviate from this beautifully described environment? Let's go over some use cases and point out that you're probably ALREADY doing it to some degree.

Production troubleshooting
The most obvious example of acceptable deviation is troubleshooting. We pushed code, our metrics are all screwed up and we need to know what the hell just happened. Let's crank up our logging.

At this point, changing your log level, you've deviated from what your system of record (your CM tool) says you should be. Our manifests, our cookbooks, our templates all have us using a loglevel of ERROR but we just bumped up one server to DEBUG. so we could troubleshoot. That system is now a snowflake. Unless you change that log level back to ERROR, you know have one system that will, until you do a puppetrun of chef-client run is different than all the other servers of the class/role.

Would you codify that in the manifest? No. This is an exception. A (should be) short-lived exception to the rules you've defined.

Dynamic environments
Another area where you might deviate is in highly elastic environments. Let's say you've reached the holy grail of elasticity. You're growing and shrinking capacity based on some external trigger. You can't codify this. I might run 20 instances of my app server now but drop back down to 5 instances when the "event" has passed. In a highly elastic environment, are you running your convergence tool after every spin up? Not likely. In an "event" you don't want to have to take down your load balancer (and thus affect service to the existing intstances) just to add capacity. A bit of a contrived example but you get the idea.

So what's the answer?
I am by far not the smartest cookie in the tool shed but I'm opinionated so that has to count for something. These "exception" events are where I see additional tools like Zookeeper (or my pet project Noah) stepping in to handle things.

Distributed coordination, dynamically reconfigurable code, elasticity and environment-aware applications.
These are all terms I've used to describe this concept to people. Damon Edwards provided me with the last one and I really like it.

Enough jibber-jabber, hook a brother up!
So before I give you the ability to shoot yourself in the foot, you should be aware of a few things:

It's not a system of record

Your DDCS (dynamic distributed coordination service as I'll call it because I can't ever use enough buzzwords) is NOT your system of record. It can be but it shouldn't be. Existing tools provide that service very well and they do it in an idempotent manner.

Know your configuration

This is VERY important. As I said before, much of this is environment specific. The category of information you're changing in this way is more "transient" or "point-in-time". Any given atom of configuration information has a specific value associated with it. Different levels of volatility. Your JDBC connection string is probably NOT going to change that often. However, the number of application servers might be at different amounts of capacity based on some dynamic external factor.

Your environment is dynamic and so should be your response

This is where I probably get some pushback. Just as one of the goals of "devops" was to deal with, what Jesse Robbins described to day as misalignment of incentive, there's an internal struggle where some values are simply fluctuating in near real time. This is what we're trying to address.

It is not plug and play

One thing that Chef and Puppet do very well is that you can, with next to no change to your systems, predefine how something should look or behave and have those tools "make it so".

With these realtime/dynamic configuration atoms your application needs to be aware of them and react to them intelligently.

Okay seriously. Get to the point
So let's take walk through a scenario where we might implement this ad-hoc philosophy in a way that gives us the power we're seeking.

The base configuration

application server (fooapp) uses memcached, two internal services called "lookup" and "evaluate" and a data store of somekind.
"lookup" and "evaluate" are internally developed applications that provide private REST endpoints for providing a dictionary service (lookup) and a business rule parser of some kind (evaluate).
Every component's base configuration (including the data source that "lookup" and "evaluation" use) is managed, configured and controlled by puppet/chef.

In a standard world, we store the ip/port mappings for "lookup" and "evaluate" in our CM tool and tags those. When we do a puppet/chef client run, the values for those servers are populated based on the ip/port information our EXISTING "lookup"/"evaluate" servers.

This works. It's being done right now.

So where's the misalignment?
What do you do when you want to spin up another "lookup"/"evaluate" server? Well you would probably use a bootstrap of some kind and apply, via the CM tool, the changes to those values. However this now means that for this to take effect across your "fooapp" servers you need to do a manual run of your CM client. Based on the feedback I've seen across various lists, this is where the point of contention exists.

What about any untested CM changes (a new recipe for instance). I don't want to apply that but if I run my CM tool, I've now not only pulled those unintentional changes but also forced a bounce of all of my fooapp servers. So as a side product of scaling capacity to meet demand, I've now reduced my capacity at another point just to make my application aware of the new settings.

Enter Noah
This is where the making your application aware of its environment and allowing it to dynamically reconfigure itself pays off.

Looking at our base example now, let's do a bit of architectural work around this new model.

My application no longer hardcodes a base list of servers prodviding "lookup" and "evaluate" services.
My application understands the value of a given configuration atom
Instead of the hardcoded list, we convert those configuration atoms akin to something like a singleton pattern that points to a bootstrap endpoint.
FooApp provides some sot of "endpoint" where it can be notified of changes to the number/ip addresses or urls available a a given of our services. This can also be proxied via another endpoint.
The "bootstrap" location is managed by our CM tool based on some more concrete configuration - the location of the bootstrap server.

Inside our application, we're now:

Pulling a list of "lookup"/"evaluate" servers from the bootstrap url (i.e. http://noahserver/s/evaluate)
Registering a "watch" on the above "path" and providing an in-application endpoint to be notified when they change.
validating at startup if the results of the bootstrap call provide valid information (i.e. doing a quick connection test to each of the servers provided by the bootstrap lookup or a subset thereof)

If we dynamically add a new transient "lookup" server, Noah fires a notification to the provided endpoint with the details of the change. The application will receive a message saying "I have a new 'lookup' server available". It will run through some sanity checks to make sure that the new "lookup" server really does exist and works. It then appends the new server to the list of existing (permanent servers) and start taking advantage of the increase in capacity.

That's it. How you implement the "refresh" and "validation" mechanisms is entirely language specific. This also doesn't, despite my statements previously, have to apply to transient resources. The new "lookup" server could be a permanent addition to my infra. Of course this would have been captured as part of the bootstrapping process if that were the case.

Nutshell
And that's it in a nutshell. All of this is availalbe in Noah and Zookeeer right now. Noah is currently restricted to http POST endpoints but that will be expanded. Zookeeper treats watches as ephemeral. Once the event has fired, you must register that same watch. With Noah, watches are permanent.

Takeaway
I hope the above has made sense. This was just a basic introduction to some of the concepts and design goals. There are plenty of OTHER use cases for ZooKeeper alone. So the key take aways are:

Know the value of your configuration data
Know when and where to use that data
Don't supplant your existing CM tool but instead enhance it.

Links
Noah
ZooKeeper
Hadoop Book (which has some AMAZING detail around ZooKeeper, the technology and use cases

Friday, February 25, 2011

Thank You

In case you hadn't heard, today Amazon went all Top Gun today and gave the world Cloud Formation. This, of course, gave rise to tweets and one-offs from pundits all over the world stating that it was the death knell of tools like Chef and Puppet.

Amazon had usurped yet another business model with the stroke of its mighty hand!

Let's ignore for a moment the fact that:

Amazon had the Chef and Puppet folks in beta
Chef and Puppet are on the block to be supported as part of CloudFormation
CloudFormation is actually nothing like Chef and Puppet and serves an entirely different purpose

I was pretty heads down at the office today (as was everyone else) so I didn't get a chance to catch up a bit until tonight. That's when I saw some of the most ignorant tweets from some supposedly smart people that I've ever seen. I ended up having to prune quite a bit from my Twitter list.

These were obviously inspired by the CloudFormation announcement and discussions around how it relates to existing CM tools. There were gems like this:

"process of orchestration, policy, governance, stacks, cross clouds, billback, etc. way too complex for some scripts"

"Scripts also wouldn't cover complexity of trying to mng a variety of clouds, all w/differing APIs & Front ends"

"You heard it here first. All you need for cloud automation, orchestration and provisioning is some Perl and you're golden! #DevFlOps"

Now maybe I'm taking these a bit out of context. Maybe I was just being a pissy bastard but these really got me riled up. Mind you not so riled up that I ran downstairs because "someone was wrong on the internet". I put my son to bed, fell asleep and when I woke up, I was still pissed off about it. I figured an hour of sleeping on it was enough justification so here I am.

Thank You

Before I get into the bitching and moaning though, I want to say "Thank you" to some people.

To Mark Burgess, Luke Kanies, Adam Jacob, Damon Edwards and any other system administrator who got so fed up with the bullshit to write the tools that we're using today, THANK YOU.

Thank you for not accepting that we had to manage systems the way we always had. Thank you for stepping outside the comfort zone and writing amazing code. Thank you for taking 20 minutes to actually think about it when you we're only given 10 minutes to get it done. Thank you.

To Patrick Debois, John Allspaw, Andrew Clay Shafer and everyone who has promoted the idea of what we call "devops" today, THANK YOU.

Thank you for pushing the idea into the mainstream with a phrase that so accurately captures what is trying to be accomplished. Thank you for being innovative and being open and sharing about it.

To everyone else who's blog posts, newsgroup postings, tweets, emails, books, irc discussions that I've had the extreme pleasure of learning from over these past 17 years in this business, THANK YOU.

Thank you for sharing. Thank you for saying it even if you thought no one was reading or listening. Thank you for challenging me to learn more and inspiring me to grow as a person and as, what I'll always be at heart, a system administrator.

To everyone above and those who I didn't mention, thank you. I thank you because it's ideas like "opensource" and "devops" and "configuration management" that free us up as individuals to think and achieve more as individuals personally and professionally. It frees us up to spend time with our families instead of answering a page at 2AM troubleshooting a stupid issue that should have never happened in the first place.

These things are more valuable than gold.

And to the haters...

Seriously.

To the vendors who write stupid applications that require me to have X11 installed on a freaking server against ALL best practices forcing me to click through a goddamn powerpoint to install your "enterprise" software, FU.

I don't need your shit and I'm luckily at a point in my career where I don't have to put up with it anymore.

To the virtualization companies who require me to have a goddamn Windows VM to manage my Linux servers because, after how many f'ing years?, you can't write a Linux port even though your product IS BASED ON LINUX? FU.

Don't worry. I can Xen and KVM like mofo. I can go to Amazon or GoGrid or Rackspace or any other provider if I don't need to host it in house. And guess what? I can do it all from the same platform I'm deploying without jumping through any hoops.

To the networking vendors who give me a choice between telnet or some overpriced java gui to do configuration of your gear, FU.

"Oh sorry about the downtime. Because we have to drop and recreate rule sets just to add a new rule, we used copy/paste from Wordpad into HyperTerminal durdurdur".

To the pundits who think that "devops" is just a bunch of perl scripts that can't "cover the complexity of blah blah blah"...I think you know the drill by now.

Really? A bunch of scripts can't cover the complexity of the various cloud providers? Interesting. I guess fog or jclouds or libcloud are just toys then.

Oh wait, what's this? You mean I can use the same commands in my CM tool regardless of where my systems are hosted? I mean Chef's command-line tool uses Fog. Any provider Fog supports, Chef will support.

But really I feel for you all. I do. You're in a losing battle. Here's the thing. People like me. People like those I mentioned above. The up and coming decision makers? We won't settle for your shitty mediocrity anymore. We won't be beholden to doing it your way. When we pick a vendor or a product or a provider, we're going to go with the ones that provide us the flexibility to manage our infrastructure in the way that's best for our company. Not for you.

We've tasted what's it like to do things the "right way" and we won't take anything less.

Friday, January 14, 2011

Follow up to "No Operations Team Left Behind"

Jim Bird over at the swreflections blog, recently posted an article entitled "What I like (and don't like) about DevOps". I've attempted to post a comment but something about my comment is making Blogger go batshit so I'm posting it here instead along with some additional notes. Jim, for the record I don't think it's anything on the Blogger side. My comment is triggering an HTTP post too large error.

Here's my original comment:

As the author of one of your links, I should probably qualify a few things that weren't originally clear. I don't think that DevOps and ITIL are mutually exclusive and I don't think that anything about DevOps inherently subverts any existing policy. The point of my original post was that the enthusiasm that so many of us have can cause a negative reaction. I've often told people that you can get to the point where you can do things like continuous deployment without actually "flipping the switch". I clarified some of this in a presentation I made to the local Atlanta devops user group:

http://devops-culture-hurdles.heroku.com/

One thing that's not clear in the slides regarding "boogeymen" is that very little of the regulation from things like HIPPA and SOX impose specific technical requirements. Much of the policy is around auditability and accountability. The problem is that companies use a checklist approach to addressing those regulations because it's most cost-effective. If,for instance, the requirement is that all user access and actions are logged why is it not acceptable to simply eliminate that user access altogether and use an automated tool instead?

Auditor: "Show me who logged on to the server and what they did"

Me: "I can do you one better. No one logs onto the servers. Here's an exact list of every single configuration change applied to the server and when."

In fact, Tools like puppet, chef, mcollective, run-deck and the like actually encourage MORE security, auditability and accountability. By approaching your infrastructure as code, using configuration management tools and automation you can eliminate most if not all of the cases where, for instance, a person needs to physically log in to a server. You get disaster recovery built in because you've now codified in "code" how to define your infrastructure and you can "compile" that infrastructure into a finished product. You attack the root cause and not just bandaid it.

I think companies like WealthFront (originally Kaching) are a good example of what's possible in a regulated industry. It will be interesting to see how Facebook deals with the additional regulation should they ever go public.

Sadly my original post has been used as "See? DevOps isn't for REAL enterprises" fodder. That was not my intention. The intention was simply this:

Do not let the "cool" factor of DevOps cloud the practical factor of DevOps.

Yes, continuous deployment and fully automated environments are freaking awesome and they are truly laudable goals but they aren't the only reason to adopt these practices. Using configuration management is a no-brainer. Automated testing is a no-brainer. Having teams work more closely together SHOULD be a no-brainer. You can implement 100% of the capabilities that allow you to do those things and never actually do them. If you do flip that switch, don't belittle another person who can't flip that switch for whatever reason.

THAT was the point of my original post.