Tuesday, March 8, 2011

Ad-Hoc Configuration, Coordination and the value of change

For those who don't know, I'm currently in Boston for DevOps Days. It's been amazing so far and I've met some wonderful people. One thing that was REALLY awesome was the open space program that Patrick set up. You won't believe it works until you've tried it. It's really powerful.

In one of our open spaces, the topic of ZooKeeper came up. At this point I made a few comments, and at the additional prodding of everyone went into a discussion about ZooKeeper and Noah. I have a tendency to monopolize discussions around topics I'm REALLY passionate about so many thanks for everyone who insisted I go on ;)

Slaughter the deviants!
The most interesting part of the discussion about ZooKeeper (or at least the part I found most revealing) was that people tended to have trouble really seeing the value in it. One of the things I've really wanted to do with Noah is provide (via the wiki) some really good use cases about where it makes sense.

I was really excited to get a chance to talk with  Alex Honor (one of the co-founders of DTO along with Damon Edwards) about his ideas after his really interesting blog post around ad-hoc configuration. If you haven't read it, I suggest you do so.

Something that often gets brought up and, oddly, overlooked at the same time is the where ad-hoc change fits into a properly managed environment (using a tool like puppet or chef).

At this point, many of you have gone crazy over the thought of polluting your beautifully organized environment with something so dirty as ad-hoc changes. I mean, here we've spent all this effort on describing our infrastructure as code and you want to come in and make a random, "undocumented" change? Perish the thought!

However, as with any process or philosophy, strict adherence with out understanding WHEN to deviate will only lead to frustration. Yes, there is a time to deviate and knowing when is the next level of maturity in configuration management.

So when do I deviate
Sadly, knowing when it's okay to deviate is as much a learning experience as it was getting everything properly configured in the first place. To make it even worse, that knowledge is most often specific to the environment in which you operate. The whole point of the phrase ad-hoc is that it's..well...ad-hoc. It's 1 part improvisation/.5 parts stumbling in the dark and the rest is backfilled with a corpus of experience. I don't say this to sound elitist.

So, really, when do I deviate. When/where/why and how do I deviate from this beautifully described environment? Let's go over some use cases and point out that you're probably ALREADY doing it to some degree.

Production troubleshooting
The most obvious example of acceptable deviation is troubleshooting. We pushed code, our metrics are all screwed up and we need to know what the hell just happened. Let's crank up our logging.

At this point, changing your log level, you've deviated from what your system of record (your CM tool) says you should be. Our manifests, our cookbooks, our templates all have us using a loglevel of ERROR but we just bumped up one server to DEBUG. so we could troubleshoot. That system is now a snowflake. Unless you change that log level back to ERROR, you know have one system that will, until you do a puppetrun of chef-client run is different than all the other servers of the class/role.

Would you codify that in the manifest? No. This is an exception. A (should be) short-lived exception to the rules you've defined.

Dynamic environments
Another area where you might deviate is in highly elastic environments. Let's say you've reached the holy grail of elasticity. You're growing and shrinking capacity based on some external trigger. You can't codify this. I might run 20 instances of my app server now but drop back down to 5 instances when the "event" has passed. In a highly elastic environment, are you running your convergence tool after every spin up? Not likely. In an "event" you don't want to have to take down your load balancer (and thus affect service to the existing intstances) just to add capacity. A bit of a contrived example but you get the idea.

So what's the answer?
I am by far not the smartest cookie in the tool shed but I'm opinionated so that has to count for something. These "exception" events are where I see additional tools like Zookeeper (or my pet project Noah) stepping in to handle things.

Distributed coordination, dynamically reconfigurable code, elasticity and environment-aware applications.
These are all terms I've used to describe this concept to people. Damon Edwards provided me with the last one and I really like it.

Enough jibber-jabber, hook a brother up!
So before I give you the ability to shoot yourself in the foot, you should be aware of a few things:


  • It's not a system of record

Your DDCS (dynamic distributed coordination service as I'll call it because I can't ever use enough buzzwords) is NOT your system of record. It can be but it shouldn't be. Existing tools provide that service very well and they do it in an idempotent manner.


  • Know your configuration

This is VERY important. As I said before, much of this is environment specific. The category of information you're changing in this way is more "transient" or "point-in-time". Any given atom of configuration information has a specific value associated with it. Different levels of volatility. Your JDBC connection string is probably NOT going to change that often. However, the number of application servers might be at different amounts of capacity based on some dynamic external factor.


  • Your environment is dynamic and so should be your response

This is where I probably get some pushback. Just as one of the goals of "devops" was to deal with, what Jesse Robbins described to day as misalignment of incentive, there's an internal struggle where some values are simply fluctuating in near real time. This is what we're trying to address.


  • It is not plug and play

One thing that Chef and Puppet do very well is that you can, with next to no change to your systems, predefine how something should look or behave and have those tools "make it so".

With these realtime/dynamic configuration atoms your application needs to be aware of them and react to them intelligently.

Okay seriously. Get to the point
So let's take walk through a scenario where we might implement this ad-hoc philosophy in a way that gives us the power we're seeking.

The base configuration

  •  application server (fooapp) uses memcached, two internal services called "lookup" and "evaluate" and a data store of somekind.
  • "lookup" and "evaluate" are internally developed applications that provide private REST endpoints for providing a dictionary service (lookup) and a business rule parser of some kind (evaluate).
  • Every component's base configuration (including the data source that "lookup" and "evaluation" use) is managed, configured and controlled by puppet/chef.


In a standard world, we store the ip/port mappings for "lookup" and "evaluate" in our CM tool and tags those. When we do a puppet/chef client run, the values for those servers are populated based on the ip/port information our EXISTING "lookup"/"evaluate" servers.

This works. It's being done right now.

So where's the misalignment?
What do you do when you want to spin up another "lookup"/"evaluate" server? Well you would probably use a bootstrap of some kind and apply, via the CM tool, the changes to those values. However this now means that for this to take effect across your "fooapp" servers you need to do a manual run of your CM client. Based on the feedback I've seen across various lists, this is where the point of contention exists.

What about any untested CM changes (a new recipe for instance). I don't want to apply that but if I run my CM tool, I've now not only pulled those unintentional changes but also forced a bounce of all of my fooapp servers. So as a side product of scaling capacity to meet demand, I've now reduced my capacity at another point just to make my application aware of the new settings.

Enter Noah
This is where the making your application aware of its environment and allowing it to dynamically reconfigure itself pays off.

Looking at our base example now, let's do a bit of architectural work around this new model.


  • My application no longer hardcodes a base list of servers prodviding "lookup" and "evaluate" services.
  • My application understands the value of a given configuration atom
  • Instead of the hardcoded list, we convert those configuration atoms akin to something like a singleton pattern that points to a bootstrap endpoint.
  • FooApp provides some sot of "endpoint" where it can be notified of changes to the number/ip addresses or urls available a a given of our services. This can also be proxied via another endpoint.
  • The "bootstrap" location is managed by our CM tool based on some more concrete configuration - the location of the bootstrap server.


Inside our application, we're now:


  • Pulling a list of "lookup"/"evaluate" servers from the bootstrap url (i.e. http://noahserver/s/evaluate)
  • Registering a "watch" on the above "path" and providing an in-application endpoint to be notified when they change.
  • validating at startup if the results of the bootstrap call provide valid information (i.e. doing a quick connection test to each of the servers provided by the bootstrap lookup or a subset thereof)


If we dynamically add a new transient "lookup" server, Noah fires a notification to the provided endpoint with the details of the change. The application will receive a message saying "I have a new 'lookup' server available". It will run through some sanity checks to make sure that the new "lookup" server really does exist and works. It then appends the new server to the list of existing (permanent servers) and start taking advantage of the increase in capacity.

That's it. How you implement the "refresh" and "validation" mechanisms is entirely language specific. This also doesn't, despite my statements previously, have to apply to transient resources. The new "lookup" server could be a permanent addition to my infra. Of course this would have been captured as part of the bootstrapping process if that were the case.

Nutshell
And that's it in a nutshell. All of this is availalbe in Noah and Zookeeer right now. Noah is currently restricted to http POST endpoints but that will be expanded. Zookeeper treats watches as ephemeral. Once the event has fired, you must register that same watch. With Noah, watches are permanent.

Takeaway
I hope the above has made sense. This was just a basic introduction to some of the concepts and design goals. There are plenty of OTHER use cases for ZooKeeper alone. So the key take aways are:


  • Know the value of your configuration data
  • Know when and where to use that data
  • Don't supplant your existing CM tool but instead enhance it.


Links
Noah
ZooKeeper
Hadoop Book (which has some AMAZING detail around ZooKeeper, the technology and use cases

4 comments:

Ulf MÃ¥nsson said...

Interesting article about ad-hoc configuration, but I think still I haven't found a good use scenario for using zookeeper. I think in most of those use cases the databags in Chef is a good solution to propagate configuration changes without any need to change any recipe. Then it's quite easy to use "knife ssh" to trigger an update on all the nodes. This could even be valid for debug modes as well.

I also must admit that we in all of our one code has implemented rereading of the config files every minute, that means that there are no need to restart an application to get the config updated.

Ernest Mueller said...

We'll have to look into Noah, we used Zookeeper as the heart of our own in-house provisioning system. We define an XML model (kinda like CloudFormation) and spin up our cloud, push software to it, and then zookeeper serves as the runtime registry and event trigger. In the cloud a lot of stuff needs dynamic config and needs to respond to scaling events etc. Zookeeper is a very fast and highly available way of doing that.

In Microsoft Azure, they have a built in 'fabric' you query to get role and endpoint information - basically serving the same purpose. It's extremely useful.

John E. Vincent said...

I would tend to agree that chef databags or puppet's new ext lookup functionality solves some of that arbitrary data. The problem I have right now with using knife is that it's not the most efficient way of propagating changes out to nodes. It's still a manual process. Mind you, I could just as easily define an ssh callback in Noah to do the client run.

As for the "reread config files every minute", that always felt like a hack to me. Why should my app be burning those cycles (however small now). If my app is already listening on the network, it's no real overhead to run an additional endpoint that waits to be told when something has changed.

Thanks very much for the comments! I'm really interested in how people feel about the idea.

John E. Vincent said...

I was not aware of the Azure fabric stuff. Microsoft still surprises me now and then.

What's really cool about ZooKeeper is how flexible it is in that regard. You're using it in two different aspects (provisioning descriptors and runtime registry). I'm hoping Noah can be that flexible too.