Tuesday, March 8, 2011

Ad-Hoc Configuration, Coordination and the value of change

For those who don't know, I'm currently in Boston for DevOps Days. It's been amazing so far and I've met some wonderful people. One thing that was REALLY awesome was the open space program that Patrick set up. You won't believe it works until you've tried it. It's really powerful.

In one of our open spaces, the topic of ZooKeeper came up. At this point I made a few comments, and at the additional prodding of everyone went into a discussion about ZooKeeper and Noah. I have a tendency to monopolize discussions around topics I'm REALLY passionate about so many thanks for everyone who insisted I go on ;)

Slaughter the deviants!
The most interesting part of the discussion about ZooKeeper (or at least the part I found most revealing) was that people tended to have trouble really seeing the value in it. One of the things I've really wanted to do with Noah is provide (via the wiki) some really good use cases about where it makes sense.

I was really excited to get a chance to talk with Alex Honor (one of the co-founders of DTO along with Damon Edwards) about his ideas after his really interesting blog post around ad-hoc configuration. If you haven't read it, I suggest you do so.

Something that often gets brought up and, oddly, overlooked at the same time is the where ad-hoc change fits into a properly managed environment (using a tool like puppet or chef).

At this point, many of you have gone crazy over the thought of polluting your beautifully organized environment with something so dirty as ad-hoc changes. I mean, here we've spent all this effort on describing our infrastructure as code and you want to come in and make a random, "undocumented" change? Perish the thought!

However, as with any process or philosophy, strict adherence with out understanding WHEN to deviate will only lead to frustration. Yes, there is a time to deviate and knowing when is the next level of maturity in configuration management.

So when do I deviate
Sadly, knowing when it's okay to deviate is as much a learning experience as it was getting everything properly configured in the first place. To make it even worse, that knowledge is most often specific to the environment in which you operate. The whole point of the phrase ad-hoc is that it's..well...ad-hoc. It's 1 part improvisation/.5 parts stumbling in the dark and the rest is backfilled with a corpus of experience. I don't say this to sound elitist.

So, really, when do I deviate. When/where/why and how do I deviate from this beautifully described environment? Let's go over some use cases and point out that you're probably ALREADY doing it to some degree.

Production troubleshooting
The most obvious example of acceptable deviation is troubleshooting. We pushed code, our metrics are all screwed up and we need to know what the hell just happened. Let's crank up our logging.

At this point, changing your log level, you've deviated from what your system of record (your CM tool) says you should be. Our manifests, our cookbooks, our templates all have us using a loglevel of ERROR but we just bumped up one server to DEBUG. so we could troubleshoot. That system is now a snowflake. Unless you change that log level back to ERROR, you know have one system that will, until you do a puppetrun of chef-client run is different than all the other servers of the class/role.

Would you codify that in the manifest? No. This is an exception. A (should be) short-lived exception to the rules you've defined.

Dynamic environments
Another area where you might deviate is in highly elastic environments. Let's say you've reached the holy grail of elasticity. You're growing and shrinking capacity based on some external trigger. You can't codify this. I might run 20 instances of my app server now but drop back down to 5 instances when the "event" has passed. In a highly elastic environment, are you running your convergence tool after every spin up? Not likely. In an "event" you don't want to have to take down your load balancer (and thus affect service to the existing intstances) just to add capacity. A bit of a contrived example but you get the idea.

So what's the answer?
I am by far not the smartest cookie in the tool shed but I'm opinionated so that has to count for something. These "exception" events are where I see additional tools like Zookeeper (or my pet project Noah) stepping in to handle things.

Distributed coordination, dynamically reconfigurable code, elasticity and environment-aware applications.
These are all terms I've used to describe this concept to people. Damon Edwards provided me with the last one and I really like it.

Enough jibber-jabber, hook a brother up!
So before I give you the ability to shoot yourself in the foot, you should be aware of a few things:

It's not a system of record

Your DDCS (dynamic distributed coordination service as I'll call it because I can't ever use enough buzzwords) is NOT your system of record. It can be but it shouldn't be. Existing tools provide that service very well and they do it in an idempotent manner.

Know your configuration

This is VERY important. As I said before, much of this is environment specific. The category of information you're changing in this way is more "transient" or "point-in-time". Any given atom of configuration information has a specific value associated with it. Different levels of volatility. Your JDBC connection string is probably NOT going to change that often. However, the number of application servers might be at different amounts of capacity based on some dynamic external factor.

Your environment is dynamic and so should be your response

This is where I probably get some pushback. Just as one of the goals of "devops" was to deal with, what Jesse Robbins described to day as misalignment of incentive, there's an internal struggle where some values are simply fluctuating in near real time. This is what we're trying to address.

It is not plug and play

One thing that Chef and Puppet do very well is that you can, with next to no change to your systems, predefine how something should look or behave and have those tools "make it so".

With these realtime/dynamic configuration atoms your application needs to be aware of them and react to them intelligently.

Okay seriously. Get to the point
So let's take walk through a scenario where we might implement this ad-hoc philosophy in a way that gives us the power we're seeking.

The base configuration

application server (fooapp) uses memcached, two internal services called "lookup" and "evaluate" and a data store of somekind.
"lookup" and "evaluate" are internally developed applications that provide private REST endpoints for providing a dictionary service (lookup) and a business rule parser of some kind (evaluate).
Every component's base configuration (including the data source that "lookup" and "evaluation" use) is managed, configured and controlled by puppet/chef.

In a standard world, we store the ip/port mappings for "lookup" and "evaluate" in our CM tool and tags those. When we do a puppet/chef client run, the values for those servers are populated based on the ip/port information our EXISTING "lookup"/"evaluate" servers.

This works. It's being done right now.

So where's the misalignment?
What do you do when you want to spin up another "lookup"/"evaluate" server? Well you would probably use a bootstrap of some kind and apply, via the CM tool, the changes to those values. However this now means that for this to take effect across your "fooapp" servers you need to do a manual run of your CM client. Based on the feedback I've seen across various lists, this is where the point of contention exists.

What about any untested CM changes (a new recipe for instance). I don't want to apply that but if I run my CM tool, I've now not only pulled those unintentional changes but also forced a bounce of all of my fooapp servers. So as a side product of scaling capacity to meet demand, I've now reduced my capacity at another point just to make my application aware of the new settings.

Enter Noah
This is where the making your application aware of its environment and allowing it to dynamically reconfigure itself pays off.

Looking at our base example now, let's do a bit of architectural work around this new model.

My application no longer hardcodes a base list of servers prodviding "lookup" and "evaluate" services.
My application understands the value of a given configuration atom
Instead of the hardcoded list, we convert those configuration atoms akin to something like a singleton pattern that points to a bootstrap endpoint.
FooApp provides some sot of "endpoint" where it can be notified of changes to the number/ip addresses or urls available a a given of our services. This can also be proxied via another endpoint.
The "bootstrap" location is managed by our CM tool based on some more concrete configuration - the location of the bootstrap server.

Inside our application, we're now:

Pulling a list of "lookup"/"evaluate" servers from the bootstrap url (i.e. http://noahserver/s/evaluate)
Registering a "watch" on the above "path" and providing an in-application endpoint to be notified when they change.
validating at startup if the results of the bootstrap call provide valid information (i.e. doing a quick connection test to each of the servers provided by the bootstrap lookup or a subset thereof)

If we dynamically add a new transient "lookup" server, Noah fires a notification to the provided endpoint with the details of the change. The application will receive a message saying "I have a new 'lookup' server available". It will run through some sanity checks to make sure that the new "lookup" server really does exist and works. It then appends the new server to the list of existing (permanent servers) and start taking advantage of the increase in capacity.

That's it. How you implement the "refresh" and "validation" mechanisms is entirely language specific. This also doesn't, despite my statements previously, have to apply to transient resources. The new "lookup" server could be a permanent addition to my infra. Of course this would have been captured as part of the bootstrapping process if that were the case.

Nutshell
And that's it in a nutshell. All of this is availalbe in Noah and Zookeeer right now. Noah is currently restricted to http POST endpoints but that will be expanded. Zookeeper treats watches as ephemeral. Once the event has fired, you must register that same watch. With Noah, watches are permanent.

Takeaway
I hope the above has made sense. This was just a basic introduction to some of the concepts and design goals. There are plenty of OTHER use cases for ZooKeeper alone. So the key take aways are:

Know the value of your configuration data
Know when and where to use that data
Don't supplant your existing CM tool but instead enhance it.

Links
Noah
ZooKeeper
Hadoop Book (which has some AMAZING detail around ZooKeeper, the technology and use cases

Friday, February 25, 2011

Thank You

In case you hadn't heard, today Amazon went all Top Gun today and gave the world Cloud Formation. This, of course, gave rise to tweets and one-offs from pundits all over the world stating that it was the death knell of tools like Chef and Puppet.

Amazon had usurped yet another business model with the stroke of its mighty hand!

Let's ignore for a moment the fact that:

Amazon had the Chef and Puppet folks in beta
Chef and Puppet are on the block to be supported as part of CloudFormation
CloudFormation is actually nothing like Chef and Puppet and serves an entirely different purpose

I was pretty heads down at the office today (as was everyone else) so I didn't get a chance to catch up a bit until tonight. That's when I saw some of the most ignorant tweets from some supposedly smart people that I've ever seen. I ended up having to prune quite a bit from my Twitter list.

These were obviously inspired by the CloudFormation announcement and discussions around how it relates to existing CM tools. There were gems like this:

"process of orchestration, policy, governance, stacks, cross clouds, billback, etc. way too complex for some scripts"

"Scripts also wouldn't cover complexity of trying to mng a variety of clouds, all w/differing APIs & Front ends"

"You heard it here first. All you need for cloud automation, orchestration and provisioning is some Perl and you're golden! #DevFlOps"

Now maybe I'm taking these a bit out of context. Maybe I was just being a pissy bastard but these really got me riled up. Mind you not so riled up that I ran downstairs because "someone was wrong on the internet". I put my son to bed, fell asleep and when I woke up, I was still pissed off about it. I figured an hour of sleeping on it was enough justification so here I am.

Thank You

Before I get into the bitching and moaning though, I want to say "Thank you" to some people.

To Mark Burgess, Luke Kanies, Adam Jacob, Damon Edwards and any other system administrator who got so fed up with the bullshit to write the tools that we're using today, THANK YOU.

Thank you for not accepting that we had to manage systems the way we always had. Thank you for stepping outside the comfort zone and writing amazing code. Thank you for taking 20 minutes to actually think about it when you we're only given 10 minutes to get it done. Thank you.

To Patrick Debois, John Allspaw, Andrew Clay Shafer and everyone who has promoted the idea of what we call "devops" today, THANK YOU.

Thank you for pushing the idea into the mainstream with a phrase that so accurately captures what is trying to be accomplished. Thank you for being innovative and being open and sharing about it.

To everyone else who's blog posts, newsgroup postings, tweets, emails, books, irc discussions that I've had the extreme pleasure of learning from over these past 17 years in this business, THANK YOU.

Thank you for sharing. Thank you for saying it even if you thought no one was reading or listening. Thank you for challenging me to learn more and inspiring me to grow as a person and as, what I'll always be at heart, a system administrator.

To everyone above and those who I didn't mention, thank you. I thank you because it's ideas like "opensource" and "devops" and "configuration management" that free us up as individuals to think and achieve more as individuals personally and professionally. It frees us up to spend time with our families instead of answering a page at 2AM troubleshooting a stupid issue that should have never happened in the first place.

These things are more valuable than gold.

And to the haters...

Seriously.

To the vendors who write stupid applications that require me to have X11 installed on a freaking server against ALL best practices forcing me to click through a goddamn powerpoint to install your "enterprise" software, FU.

I don't need your shit and I'm luckily at a point in my career where I don't have to put up with it anymore.

To the virtualization companies who require me to have a goddamn Windows VM to manage my Linux servers because, after how many f'ing years?, you can't write a Linux port even though your product IS BASED ON LINUX? FU.

Don't worry. I can Xen and KVM like mofo. I can go to Amazon or GoGrid or Rackspace or any other provider if I don't need to host it in house. And guess what? I can do it all from the same platform I'm deploying without jumping through any hoops.

To the networking vendors who give me a choice between telnet or some overpriced java gui to do configuration of your gear, FU.

"Oh sorry about the downtime. Because we have to drop and recreate rule sets just to add a new rule, we used copy/paste from Wordpad into HyperTerminal durdurdur".

To the pundits who think that "devops" is just a bunch of perl scripts that can't "cover the complexity of blah blah blah"...I think you know the drill by now.

Really? A bunch of scripts can't cover the complexity of the various cloud providers? Interesting. I guess fog or jclouds or libcloud are just toys then.

Oh wait, what's this? You mean I can use the same commands in my CM tool regardless of where my systems are hosted? I mean Chef's command-line tool uses Fog. Any provider Fog supports, Chef will support.

But really I feel for you all. I do. You're in a losing battle. Here's the thing. People like me. People like those I mentioned above. The up and coming decision makers? We won't settle for your shitty mediocrity anymore. We won't be beholden to doing it your way. When we pick a vendor or a product or a provider, we're going to go with the ones that provide us the flexibility to manage our infrastructure in the way that's best for our company. Not for you.

We've tasted what's it like to do things the "right way" and we won't take anything less.

Friday, January 14, 2011

Follow up to "No Operations Team Left Behind"

Jim Bird over at the swreflections blog, recently posted an article entitled "What I like (and don't like) about DevOps". I've attempted to post a comment but something about my comment is making Blogger go batshit so I'm posting it here instead along with some additional notes. Jim, for the record I don't think it's anything on the Blogger side. My comment is triggering an HTTP post too large error.

Here's my original comment:

As the author of one of your links, I should probably qualify a few things that weren't originally clear. I don't think that DevOps and ITIL are mutually exclusive and I don't think that anything about DevOps inherently subverts any existing policy. The point of my original post was that the enthusiasm that so many of us have can cause a negative reaction. I've often told people that you can get to the point where you can do things like continuous deployment without actually "flipping the switch". I clarified some of this in a presentation I made to the local Atlanta devops user group:

http://devops-culture-hurdles.heroku.com/

One thing that's not clear in the slides regarding "boogeymen" is that very little of the regulation from things like HIPPA and SOX impose specific technical requirements. Much of the policy is around auditability and accountability. The problem is that companies use a checklist approach to addressing those regulations because it's most cost-effective. If,for instance, the requirement is that all user access and actions are logged why is it not acceptable to simply eliminate that user access altogether and use an automated tool instead?

Auditor: "Show me who logged on to the server and what they did"

Me: "I can do you one better. No one logs onto the servers. Here's an exact list of every single configuration change applied to the server and when."

In fact, Tools like puppet, chef, mcollective, run-deck and the like actually encourage MORE security, auditability and accountability. By approaching your infrastructure as code, using configuration management tools and automation you can eliminate most if not all of the cases where, for instance, a person needs to physically log in to a server. You get disaster recovery built in because you've now codified in "code" how to define your infrastructure and you can "compile" that infrastructure into a finished product. You attack the root cause and not just bandaid it.

I think companies like WealthFront (originally Kaching) are a good example of what's possible in a regulated industry. It will be interesting to see how Facebook deals with the additional regulation should they ever go public.

Sadly my original post has been used as "See? DevOps isn't for REAL enterprises" fodder. That was not my intention. The intention was simply this:

Do not let the "cool" factor of DevOps cloud the practical factor of DevOps.

Yes, continuous deployment and fully automated environments are freaking awesome and they are truly laudable goals but they aren't the only reason to adopt these practices. Using configuration management is a no-brainer. Automated testing is a no-brainer. Having teams work more closely together SHOULD be a no-brainer. You can implement 100% of the capabilities that allow you to do those things and never actually do them. If you do flip that switch, don't belittle another person who can't flip that switch for whatever reason.

THAT was the point of my original post.

Wednesday, January 5, 2011

Chef and Encrypted Data Bags - Revisted

In my previous post here I described the logic behind wanting to store data in an encrypted form in our Chef data bags. I also described some general encryption techniques and gotchas for making that happen.

I've since done quite a bit of work in that regard and implemented this at our company. I wanted to go over a bit of detail about how to use my solution. Fair warning, this is a long post. Lot's of scrolling.

A little recap

As I mentioned in my previous post, the only reliable way to do the encryption of data bag items in an automated fashion is to handle key management yourself outside of Chef. I mentioned two techniques:

storing the decryption key on the server in a flat file
calling a remote resource to grab the key

Essentially the biggest problem of this issue is key management and, in an optimal world, how to automate it reliably. For this demonstration, I've gone with storing a flat text file on the server. As I also said in my previous post, this assumes you tightly control access to that server. We're going with the original assumption that if a malicious person gets on your box, you're screwed no matter what.

Creating the key file

I used the knife command to handle my key creation for now:

knife ssh '*:*' interactive
echo "somedecryptionstringblahblahblah" > /tmp/.chef_decrypt.key
chmod 0640 /tmp/.chef_decrypt.key

Setting up the databags and the rake tasks

One of the previous things I mentioned is knowing when and what to encrypt. Be sensible and keep it simple. We don't want to throw out the baby with the bath water. The Chef platform has lots of neat search capabilities that we'd like to keep. In this vein, I've created a fairly opinionated method for storing the encrypted data bag items.

We're going to want to create a new databag called "passwords". The format of the data bag is VERY simple:

We have an "id" that we want to use and the plaintext value that we want to encrypt.

Rake tasks

In my local chef-repo, I've created a 'tasks' folder. In that folder, I've added the following file:

As you can see, this requires a rubygem called encrypted_strings. I've done a cursory glance over the code and I can't see anything immediately unsafe about it. It only provides an abstraction to the native OpenSSL support in Ruby with an additional String helper. However I'm not a cryptographer by any stretch so you should do your own due diligence.

At the end of your existing Rakefile, add the following:

load File.join(TOPDIR, 'tasks','encrypt_databag_item.rake')

If you now run rake -T you should see the new task listed:

rake encrypt_databag[databag_item]  # Encrypt a databag item in the passwords databag

If you didn't already create a sample data bag and item, do so now:

mkdir data_bags/passwords/
echo '{"id":"supersecretpassword","data":"mysupersecretpassword"}' > data_bags/passwords/supersecretpassword.json

Now we run the rake task:

rake encrypt_databag[supersecretpassword]

Found item: supersecretpassword. Encrypting
Encrypted data is <some ugly string>
Uploading to Chef server
INFO: Updated data_bag_item[supersecretpassword_crypted.json]

You can test that the data was uploaded successfully:

knife data bag show passwords supersecretpassword

{
"data": "<some really ugly string>",
"id": "supersecretpassword"
}

Additionally, you should have in your 'data_bags/passwords' directory a new file called 'supersecretpassword_crypted.json'. The reason for keeping both files around is for key management. Should you need to change your passphrase/key, you'll need the original file around to reencrypt with the new key. You can decided to remove the unencrypted file if you want as long as you have a way of recreating it.

Using the encrypted data

So now that we have a data bag item uploaded that we need to use, how do we get it on the client?
That will require two cookbooks:

databag_decrypt
A cookbook which needs the decrypted data. example

The general idea is that, in any cookbook you need decrypted data, you essentially do three things:

include the decryption recipe

include_recipe "databag_decrypt::default"

assign the crypted data to a value via databag search

password = search(:passwords, "id:supersecretpassword").first

assign the decrypted data to a value for use in the rest of the recipe
```
decrypted_password = item_decrypt(password[:data])
```

From there, it's no different that any other recipe. Here's an example of how I use it to securely store Amazon S3 credentials as databag items:

include_recipe "databag_decrypt::default"
s3_access_key = item_decrypt(search(:passwords, "id:s3_access_key").first[:data])
s3_secret_key = item_decrypt(search(:passwords, "id:s3_secret_key").first[:data])
s3_file erlang_tar_gz do
  bucket "our-packages"
  object_name erlang_file_name
  aws_access_key_id s3_access_key
  aws_secret_access_key s3_secret_key
  checksum erl_checksum
end

Changing the key

Should you need to change the key, you'll need to jump through a few hoops:

Update the passphrase on each client. Ease depends on your method of key distribution
Update the passphrase in the rake task
Reencypt all your data bag items.

The last one can be a pain in the ass. Since Chef currently doesn't support multiple items in a data bag json file, I created a small helper script in my chef-repo called 'split-em.rb'.
I store all of my data bag items in large json files and use split-em.rb to break them into individual files. Those file I upload with knife:

bin/split-em.rb -f data_bags/passwords/passwords.json -d passwords -o

Parsing data for svnpass into file data_bags/passwords/svnpass.json
Parsing data for s3_access_key into file data_bags/passwords/s3_access_key.json
Parsing data for s3_secret_key into file data_bags/passwords/s3_secret_key.json
#Run the following command to load the split bags into the passwords in chef
for i in svnpass s3_access_key s3_secret_key; do knife data bag from file passwords $i.json; done

You could then run that through the rake task to reupload the encrypted data:

for i in svnpass s3_access_key s3_secret_key; do rake encrypt_databag[$i]; done

Limitations/Gotchas/Additional Tips

Take note of the following, please.

Key management

The current method of key management is somewhat cumbersome. Ideally, the passphrase should be moved outside of the rake task. Additionally, the rekey process should be made a distinct rake task. I imagine a workflow similar to this:

rake accepts a path to the encryption key
additional rake task to change the encryption key in the form of oldpassfile/newpassfile.
Existing data is decrypted using oldpassfile, reencrypted using new passfile and sent back to the chef server.

Optimally, the rake task would understand the same attributes that the decryption cookbook does so it can handle key managment on the client for you. I'd also like to make the cipher selection configurable as well an integrate it into the above steps.

Duplicate work

Seth Falcon at Opscode is already in the process of adding official support for encrypted data bags to Chef. His method involves converting the entire databag sans "id" to YAML and encrypting it. I wholeheartedly support that effort but that would obviously require a universal upgrade to Chef as well. The purpose of my cookbook and tasks is to work with the existing version.

AWS IAM

If you're an Amazon EC2 user, you should start using IAM NOW. Stop putting your master credentials in to recipes and limit your risk. I've created a 'chef' user who I give limited access to certain AWS operations. You can see the policy file here. It gives the chef user read-only access to 'my_bucket' and 'my_other_bucket'.
If you wanted to get REALLY sneaky, you could use fake two-factor authentication to store your key in S3:

Encrypt data bag items with "crediential B" password except for one item "s3_credentials"
s3_credentials (crendential A) is encrypted with a passphrase and managed similar to this article
Use transient credentials to access S3 and grab a passphrase file (credential B)
Decrypt data with secondary credentials

You would have to heavily modify the cookbook to do this. I think the current implementation is fine.

File-based passphrases

I'm not a big fan of the file-based passphrase method. While we agreed that you should consider yourself screwed if someone gets on the box, that still leaves poorly coded applications running as an attack vector. Imagine you have an application that must run as root. Now it can read the passphrase. Should that application become remotely exploitable, the passphrase file is vulnerable. I'm leaning to the method of a private server that allows RESTful access to grab the key. I've already added support in the cookbook for a passphrase type of 'url'.

Wrapup

I think that covers anything. I'd love some feedback on what people think. We've already implemented this in a limited scope for using IAM credentials in our cookbooks. I can easily revoke those should they get compromised without having to generate all new master keys.

Wednesday, December 15, 2010

Chef and encrypted data bags.

As part of rolling out Chef at the new gig, we had a choice - stand up our own Chef server and maintain it or use the Opscode platform. From a cost perspective, the 50 node platform cost was pretty much break even with standing up another EC2 instance of our own. The upshot was that I didn't have to maintain it.

However, part of due diligence was making sure everything was covered from a security perspective. We use quite a few hosted/SaaS tools but this one had the biggest possible security risk. The biggest concern is dealing with sensitive data such as database passwords and AWS credentials. The Opscode platform as a whole is secure. It makes heavy use of SSL not only for transport layer encryption but also for authentication and authorization. That wasn't a concern. What was a concern was what should happen if a copy of our CouchDB database fell into the wrong hands or a "site reliability engineer" situation happened. That's where the concept of "encrypted data bags" came from for me.

Atlanta Chef Hack Day

I had the awesome opportunity to stop by the Atlanta Chef Hack day this past weekend. I couldn't stay long and came in fairly late in the afternoon. However I happened to come in right at the time that @botchagalupe (John Willis) and @schisamo (Seth Chisamore) brought up encrypted data bags. Of course, Willis proceeded to turn around and put me on the spot. After explaining the above use case, we all threw out some ideas but I think everyone came to the conclusion that it's a tough nut to crack with a shitload of gotchas.

Before I left, I got a chance to talk with @sfalcon (Seth Falcon) about his ideas. While he totally understood the use cases and mentioned that other people had asked about it as well, he had a few ideas but nothing that stood out as the best way.

So what are the options? I'm going to list a few here but I wanted to discuss a little bit about the security domain we're dealing with and what inherent holes exist.

Reality Checks

Nothing is totally secure.

Deal with it. Even though it's a remote chance in hell, your keys and/or data are going to be decrypted somewhere at some point in time. The type of information we need to read, unfortunately, can't use a one-way encryption algo like MD5 or SHA because we NEED to know what the data actually is. I need that MySQL password to provide to my application server to talk to the database. That means it has to be decrypted and during that process and during usage of that data, it's going to exist in a possible place that it can be snagged.

You don't need to encrypt everything

You need to understand what exactly needs to be encrypted and why. Yes, there's the "200k winter coats to troops" scenario and every bit of information you expose provides additional material for an attack vector but really think about what you need to encrypt. Application database account usernames? Probably not. The passwords for those accounts? Yes. Consider the "value" of the data you're considering encrypting.

Don't forget the "human" factor

So you've got this amazing library worked out, added it to your cookbooks and you're only encrypting what you need to really encrypt. Then some idiot puts the decryption key on the wiki or the master password is 5 alphabetical characters. As we often said when I was a kid, "Smooth move, exlax"

There might be another way

There might be another way to approach the issue. Make sure you've looked at all the options.

Our Use Case

So understanding that, we can narrow down our focus a bit. Let's use the use case of our applications database password because it's a simple enough case. It's a single string.

Now in a perfect world, Opscode would encrypt each CouchDB database with customer specific credentials (like say an organizational level client cert) and discards the credentials once you've downloaded them.

That's our first gotcha - What happens when the customer loses the key? All that data is now lost to the world.

But let's assume you were smart and kept a backup copy of the key in a secure location. There's another gotcha inherent in the platform itself - Chef Solr. If that entire database is encrypted, unless Opscode HAS the key, they can't index the data with Solr and all those handy searches you're using in your recipes to pull in all your users is gone. Now you'll have to manage the map/reduce views yourself and deal with the performance impact where you don't have one of those views in place.

So that option is out. The Chef server has to be able to see the data to actually work.

What about a master key? That has several problems.

You have to store the key somewhere accessible to the client (i.e. the client chef.rb or in an external file that your recipes can read to decrypt those data bag items).

How do you distribute the master key to the clients?
How do you revoke the master key to the clients and how does that affect future runs? See the previous line - how do you then distribute the updated key?

I'm sure someone just said "I'll put it in a data bag" and then promptly smacked themselves in the head. Chicken - meet Egg. Or is it the other way around?

You could have the Chef client ASK you for the key (remember apache SSL startups where the startup script required a password? Yeah, that sucked.

Going the Master Key Route

So let's assume that we want to go this route and use a master key. We know we can't store in with Opscode because that defeats the purpose. We need a way to distribute the master key to the clients so they can decrypt the data so how do we do it?

If you're using Amazon, you might say "I'll store it in S3 or on an EBS volume". That's great! Where do you store the AWS credentials? "In a data ba...oh wait. I've seen this movie before, haven't I?"

So we've come to the conclusion that we must store the master key somewhere ourselves locally available to the client. Depending on your platforming, you have a few options:

Make it part of the base AMI
Make it part of your kickstart script
Make it part of your vmware image

All of those are acceptable but they don't deal with updating/revocation. Creating new AMIs is a pain in the ass and you have to update all your scripts with new AMI ids when you do that. Golden images are never golden. Do you really want to rekick a box just to update the key?

Now we realize we have to make it dynamic. You could make it a part of a startup script in the AMI, first boot of the image or the like. Essentially, "when you startup, go here and grab this key". Of course now you've got to maintain a server to distribute the information and you probably want two of them just to be safe, right? Now we're spreading our key around again.

This is starting to look like an antipattern.

But let's just say we got ALL of that worked out. We have a simple easy way for clients to get and maintain the key. It works and your data is stored "securely" and you feel comfortable with it.

Then your master key gets compromised. No problem, you think. I'll just use my handy update mechanism to update the keys on all the clients and...shit...now I've got to re-encrypt EVERYTHING and re-upload my data bags. Where the hell is the plaintext of those passwords again? This is getting complicated, no?

So what's the answer? Is there one? Obviously, if you were that hypersensitive to the security implications you'd just run your own server anyway. You still have the human factor and backups can still be stolen but that's an issue outside of Chef as a tool. You just move the security up the stack a bit. You've got to secure the Chef server itself. But can you still use the Opscode platform? I think so. With careful deliberation and structure, you can reach a happy point that allows you to still automate your infrastructure with Chef (or some other tool) and host the data off-site.

Some options

Certmaster

Certmaster spun out of the Func project. It's essentially an SSL certificate server at the base. It's another thing you have to manage but it can handle all the revocation and distribution issues.

Riak

This is one idea I came up with tonight. The idea is that you run a very small Riak instance on all the nodes that require the ability to decrypt the data. Every node is a part of the same cluster and this can all be easily managed with Chef. It will probably have a single bucket containing the master key. You get the fault tolerance built in and you can pull the keys as part of your recipe using basic Chef resources. Resource utilization on the box should be VERY low for the erlang processes. You'll have a bit more network chatter as the intra-cluster gossip goes on though. Revocation is still an issue but that's VERY easily managed since it's a simple HTTP put to update. And while the data is easily accessible to anyone who can get access to the box, you should consider yourself "proper f'cked" if that happens anyway.

But you still have the issue of re-encrypting the databags should that need to happen. My best suggestion is to store the encrypted values in a single data bag and add a rake task that does the encryption/revocation for you. Then you minimize the impact of something that simply should not need to happen that often.

Another option is to still use Riak but store the credentials themselves (as opposed to a decryption key) and pull them in when the client runs. The concern I have there is how that affects idempotence and would it cause the recipe to be run every single time just because it can't checksum properly? You probably get around this with a file on the filesystem telling Chef to skip the update using "not_if".

Wrap Up

As you can see, there's no silver bullet here. Right now I have two needs, storing credentials for S3/EBS access and storing database passwords. That's it. We don't use passwords for user accounts at all. You can't even use password authentication with SSH on our servers. If I don't have your pubkey in the users data bag, you can't log in.

The AWS credentials are slowly becoming less of an issue. With the Identity Access beta product, I can create limited use keys that can only do certain things and grant them access to specific AWS products. I can make it a part of node creation to generate that access programatically. That means I still have the database credentials issue though. For that, I'm thinking that the startup script for an appserver, for instance, will just have to pull the credentials from Riak (or whatever central location you choose) and update a JNDI string. It spreads your configuration data out a bit but these things shouldn't need to change to often and with proper documented process you know exactly how to update it.

One thing that this whole thing causes is that it begins to break down the ability to FULLY automate everything. I don't like running the knife command to do things. I want to be able to programatically run the same thing that Knife does from my own scripts. I suppose I could simply popen and run the knife commands but shelling out always feels like an anti-pattern to me.

I'd love some feedback on how other people are addressing the same issues!

Thursday, December 2, 2010

Automating EBS Snapshot validation with @fog - Part 2

This is part 2 in a series of posts I'm doing - You can read part 1 here

Getting started

I'm not going to go into too much detail on how to get started with Fog. There's plenty of documentation on the github repo (protip: read the test cases) and Wesley a.k.a @geemus has done some awesome screencasts. I'm going to assume at this point that you've at least got Fog installed, have an AWS account set up and have Fog talking to it. The best way to verify is to create your .fog yaml file, start the fog command line tool and start looking at some of the collections available to you.

For the purpose of this series of posts, I've actually created a small script that you can use to spin up two ec2 instances (m1.small) running CentOS 5.5, create four (4) 5GB EBS volumes and attach them to the first instance. In addition to the fog gem, I also have awesome_print installed and use it in place of prettyprint. This is, of course, optional but you should be aware.

WARNING: The stuff I'm about to show you will cost you money. I tried to stick to minimal resource usage but please be aware you need to clean up after yourself. If, at any time, you feel like you can't follow along with the code or something isn't working - terminate your instances/volumes/resources using the control panel or command-line tools. PLEASE DO NOT JUST SIMPLY RUN THESE SCRIPTS WITHOUT UNDERSTANDING THEM.

The setup script

The full setup script is available as gist on github - https://gist.github.com/724912#file_fog_ebs_demo_setup.rb

Things to note:

Change the key_name to a valid key pair you have registered with EC2
There's a stopping point halfway down after the EBS volumes are created. You should actually stop there and read the comments.
You can run everything inside of an irb session if you like.

The first part of the setup script does some basic work for you - it reads in your fog configuration file (~/.fog) and creates an object you can work with (AWS). As I mentioned earlier, we're creating two servers - hdb and tdb. HDB is the master server - say your production MySQL database. TDB is the box which will be running as the validation of the snapshots.

In the Fog world, there are two big concepts - models and collections. Regardless of cloud provider, there are typically at least two models available - Compute and Storage. Collections are data objects under a given model. For instance in the AWS world, you might have under the Compute model - servers, volumes, snapshots or addresses. One thing that's nice about Fog is that, once you establish your connection to your given cloud, most of your interactions are the same across cloud providers. In the example above, I've created a connection with Amazon using my credentials and have used that Compute connection to create two new servers - hdb and tdb. Notice the options I pass in when I instantiate those servers.

image_id
key_name

If I wanted to make these boxes bigger, I might also pass in 'flavor_id'. If you're running the above code in an irb session, you might see something like the following when you instantiate those servers: Not all of the fields may be available depending on how long it takes Amazon to spin up the instance. The above shot is after the instance was up and running. For instance, when you first created 'tdb', you'll probably see "state" as pending for quite some time. Fog has a nice helper method for all models call 'wait_for'. In my case I could do:

tdb.wait_for { print "."; ready?}

And it would print dots across the screen until the instance is ready for me to log in. At the end, it will tell you the amount of time you spent waiting. Very handy. You have direct access to all of the attributes above via the instance 'tdb' or 'hdb'. You can use 'tdb.dns_name' to get the dns name for use in other parts of your script for example. In my case, after the server 'hdb' is up and running, I now want to create the four 5GB EBS volumes and attach them to the instance:

I've provided four device names (sdi through sdl) and I'm using the "volumes" collection to create them (AWS.volumes.new). As I mentioned earlier, all of the attributes for 'hdb' and 'tdb' are accessible by name. In this case, I have to create my volumes in the same availability zone as the hdb instance. Since I didn't specify where to create it when I started it, Amazon has graciously chosen 'us-east-1d' for me. As you can see, I can easily access that as 'hdb.availability_zone' and pass it to the volume creation section. I've also specified that the volume should be 5GB in size.

At the point where I've created the volume with '.new' it hasn't actually been created. I want to bind it to a server first so I simply set the volume.server attribute equal to my server object. Then I 'save' it. If I were to log into my running instance, I'd probably see something like this in the 'dmesg' output now:

sdj: unknown partition table

sdk: unknown partition table

sdl: unknown partition table

sdi: unknown partition table

As you can see from the comments in the full file, you should stop at this point and setup the volumes on your instance. In my case, I used mdadm and created a RAID0 array using those four volumes. I then formatted them, made a directory and mounted the md0 device to that directory. If you look, you should now have an additional 20GB of free space mounted on /data. Here I might make this the data directory for mysql (which is the case in our production environment). Let's just pretend you've done all that. I simulated it with a few text files and a quick 1GB dd. We'll consider that the point-in-time that we want to snapshot from. Since there's no actual constant data stream going to the volumes, I can assume for this exercise that we've just locked mysql, flushed everything and frozen the XFS filesystem. Let's make our snapshots. In this case I'm going to be using Fog to do the snapshots but in our real environment we're using the ec2-consistent-snapshot script from Aelastic. First let's take a look at the state of the hdb object:

Notice that the 'block_device_mapping' attribute now consist of an array of hashes. Each hash is a subset of the data about the volume attached to it. If you aren't seeing this, you might have to run 'hdb.reload' to refresh the state of the object. To create our snapshots, we're going to iterate over the block_device_mapping attribute and use the 'snapshots' collection to make those snapshots:

One thing you'll notice is that I'm being fairly explicity here. I could shorthand and chain many of these method calls but for clarity, I'm not.

And now we have 4 snapshots available to us. The process is fairly instant but sometimes it can lag. As always, you should check the status via the .state attribute of an object to verify that it's ready for the next step. Here's a shot of our snapshots right now:

That's the end of Part 2. In the next part, we'll have a full fledged script that does the work of making the snapshots usable on the 'tdb' instance.

Automating EBS Snapshot validation with @fog - Part 1

Background

One thing that's very exciting about the new company is that I'm getting to use quite a bit of Ruby and also the fact that we're entirely hosted on Amazon Web Services. We currently leverage EBS, ELB, EC2 S3 and CloudFront for our environment. The last time I used AWS in a professional setting, they didn't even have Elastic IPs much less EBS with snapshots and all the nice stuff that makes it viable for a production environment. I did, however, manage to keep abreast of changes using my own personal AWS account.

Fog

Of course the combination of Ruby and AWS really means one thing - Fog. And lot's of it.

When EngineYard announced the sponsorship of the project, I dove headlong into the code base and spent what time I could trying to contribute code back. The half-assed GoGrid code in there right now? Sadly, some of it is mine. Time is hard to come by these days. Regardless, I'm no stranger to Fog and when I had to dive into the environment and start getting it documented and automated, Fog was the first tool I pulled out and when the challenge of verifying our EBS snapshots (of which we're currently at a little over 700), I had no choice but to automate it.

Environment

A little bit about the environment:

- A total of 9 EBS volumes are snapshotted each day
- 8 of the EBS volumes are actually raid0 mysql data stores across two DB servers (so 4 disks on one/4 disks on another)
- The remaining EBS volume is a single mysql data volume
- Filesystem is XFS and backups are done using the Aleastic ec2-consistent-snapshot script (which currently doesn't support tags)

The end result of this is to establish a rolling set of validated snapshots. 7 daily, 3 weekly, 2 monthly. Fun!

Mapping It Out

Here was the attack plan I came up with:

- Identify snapshots and groupings where appropriate (raid0, remember?)
- create volumes from snapshots
- create an m1.xlarge EC2 instance to test the snapshots
- attach volume groups to the test instance
- assemble the array on the test instance
- start MySQL using the snapshotted data directory
- run some validation queries using some timestamp columns in our schema
- stop MySQL, unmount volume, stop the array
- detach and destroy the volumes from the test instance
- tag the snapshots as "verified"
- roll off any old snapshots based on retention policy
- automate all of the above!

I've got lots of code samples and screenshots so I'm breaking this up into multiple posts. Hopefully part 2 will be up some time tomorrow