As part of rolling out Chef at the new gig, we had a choice - stand up our own Chef server and maintain it or use the Opscode platform. From a cost perspective, the 50 node platform cost was pretty much break even with standing up another EC2 instance of our own. The upshot was that I didn't have to maintain it.
However, part of due diligence was making sure everything was covered from a security perspective. We use quite a few hosted/SaaS tools but this one had the biggest possible security risk. The biggest concern is dealing with sensitive data such as database passwords and AWS credentials. The Opscode platform as a whole is secure. It makes heavy use of SSL not only for transport layer encryption but also for authentication and authorization. That wasn't a concern. What was a concern was what should happen if a copy of our CouchDB database fell into the wrong hands or a "site reliability engineer" situation happened. That's where the concept of "encrypted data bags" came from for me.
Atlanta Chef Hack Day
I had the awesome opportunity to stop by the Atlanta Chef Hack day this past weekend. I couldn't stay long and came in fairly late in the afternoon. However I happened to come in right at the time that @botchagalupe (John Willis) and @schisamo (Seth Chisamore) brought up encrypted data bags. Of course, Willis proceeded to turn around and put me on the spot. After explaining the above use case, we all threw out some ideas but I think everyone came to the conclusion that it's a tough nut to crack with a shitload of gotchas.
Before I left, I got a chance to talk with @sfalcon (Seth Falcon) about his ideas. While he totally understood the use cases and mentioned that other people had asked about it as well, he had a few ideas but nothing that stood out as the best way.
So what are the options? I'm going to list a few here but I wanted to discuss a little bit about the security domain we're dealing with and what inherent holes exist.
Reality Checks
- Nothing is totally secure.
Deal with it. Even though it's a remote chance in hell, your keys and/or data are going to be decrypted somewhere at some point in time. The type of information we need to read, unfortunately, can't use a one-way encryption algo like MD5 or SHA because we NEED to know what the data actually is. I need that MySQL password to provide to my application server to talk to the database. That means it has to be decrypted and during that process and during usage of that data, it's going to exist in a possible place that it can be snagged.
- You don't need to encrypt everything
You need to understand what exactly needs to be encrypted and why. Yes, there's the "200k winter coats to troops" scenario and every bit of information you expose provides additional material for an attack vector but really think about what you need to encrypt. Application database account usernames? Probably not. The passwords for those accounts? Yes. Consider the "value" of the data you're considering encrypting.
- Don't forget the "human" factor
So you've got this amazing library worked out, added it to your cookbooks and you're only encrypting what you need to really encrypt. Then some idiot puts the decryption key on the wiki or the master password is 5 alphabetical characters. As we often said when I was a kid, "Smooth move, exlax"
- There might be another way
There might be another way to approach the issue. Make sure you've looked at all the options.
Our Use Case
So understanding that, we can narrow down our focus a bit. Let's use the use case of our applications database password because it's a simple enough case. It's a single string.
Now in a perfect world, Opscode would encrypt each CouchDB database with customer specific credentials (like say an organizational level client cert) and discards the credentials once you've downloaded them.
That's our first gotcha - What happens when the customer loses the key? All that data is now lost to the world.
But let's assume you were smart and kept a backup copy of the key in a secure location. There's another gotcha inherent in the platform itself - Chef Solr. If that entire database is encrypted, unless Opscode HAS the key, they can't index the data with Solr and all those handy searches you're using in your recipes to pull in all your users is gone. Now you'll have to manage the map/reduce views yourself and deal with the performance impact where you don't have one of those views in place.
So that option is out. The Chef server has to be able to see the data to actually work.
What about a master key? That has several problems.
You have to store the key somewhere accessible to the client (i.e. the client chef.rb or in an external file that your recipes can read to decrypt those data bag items).
- How do you distribute the master key to the clients?
- How do you revoke the master key to the clients and how does that affect future runs? See the previous line - how do you then distribute the updated key?
I'm sure someone just said "I'll put it in a data bag" and then promptly smacked themselves in the head. Chicken - meet Egg. Or is it the other way around?
You could have the Chef client ASK you for the key (remember apache SSL startups where the startup script required a password? Yeah, that sucked.
Going the Master Key Route
So let's assume that we want to go this route and use a master key. We know we can't store in with Opscode because that defeats the purpose. We need a way to distribute the master key to the clients so they can decrypt the data so how do we do it?
If you're using Amazon, you might say "I'll store it in S3 or on an EBS volume". That's great! Where do you store the AWS credentials? "In a data ba...oh wait. I've seen this movie before, haven't I?"
So we've come to the conclusion that we must store the master key somewhere ourselves locally available to the client. Depending on your platforming, you have a few options:
- Make it part of the base AMI
- Make it part of your kickstart script
- Make it part of your vmware image
All of those are acceptable but they don't deal with updating/revocation. Creating new AMIs is a pain in the ass and you have to update all your scripts with new AMI ids when you do that. Golden images are never golden. Do you really want to rekick a box just to update the key?
Now we realize we have to make it dynamic. You could make it a part of a startup script in the AMI, first boot of the image or the like. Essentially, "when you startup, go here and grab this key". Of course now you've got to maintain a server to distribute the information and you probably want two of them just to be safe, right? Now we're spreading our key around again.
This is starting to look like an antipattern.
But let's just say we got ALL of that worked out. We have a simple easy way for clients to get and maintain the key. It works and your data is stored "securely" and you feel comfortable with it.
Then your master key gets compromised. No problem, you think. I'll just use my handy update mechanism to update the keys on all the clients and...shit...now I've got to re-encrypt EVERYTHING and re-upload my data bags. Where the hell is the plaintext of those passwords again? This is getting complicated, no?
So what's the answer? Is there one? Obviously, if you were that hypersensitive to the security implications you'd just run your own server anyway. You still have the human factor and backups can still be stolen but that's an issue outside of Chef as a tool. You just move the security up the stack a bit. You've got to secure the Chef server itself. But can you still use the Opscode platform? I think so. With careful deliberation and structure, you can reach a happy point that allows you to still automate your infrastructure with Chef (or some other tool) and host the data off-site.
Some options
Certmaster
Certmaster spun out of the Func project. It's essentially an SSL certificate server at the base. It's another thing you have to manage but it can handle all the revocation and distribution issues.
Riak
This is one idea I came up with tonight. The idea is that you run a very small Riak instance on all the nodes that require the ability to decrypt the data. Every node is a part of the same cluster and this can all be easily managed with Chef. It will probably have a single bucket containing the master key. You get the fault tolerance built in and you can pull the keys as part of your recipe using basic Chef resources. Resource utilization on the box should be VERY low for the erlang processes. You'll have a bit more network chatter as the intra-cluster gossip goes on though. Revocation is still an issue but that's VERY easily managed since it's a simple HTTP put to update. And while the data is easily accessible to anyone who can get access to the box, you should consider yourself "proper f'cked" if that happens anyway.
But you still have the issue of re-encrypting the databags should that need to happen. My best suggestion is to store the encrypted values in a single data bag and add a rake task that does the encryption/revocation for you. Then you minimize the impact of something that simply should not need to happen that often.
Another option is to still use Riak but store the credentials themselves (as opposed to a decryption key) and pull them in when the client runs. The concern I have there is how that affects idempotence and would it cause the recipe to be run every single time just because it can't checksum properly? You probably get around this with a file on the filesystem telling Chef to skip the update using "not_if".
Wrap Up
As you can see, there's no silver bullet here. Right now I have two needs, storing credentials for S3/EBS access and storing database passwords. That's it. We don't use passwords for user accounts at all. You can't even use password authentication with SSH on our servers. If I don't have your pubkey in the users data bag, you can't log in.
The AWS credentials are slowly becoming less of an issue. With the Identity Access beta product, I can create limited use keys that can only do certain things and grant them access to specific AWS products. I can make it a part of node creation to generate that access programatically. That means I still have the database credentials issue though. For that, I'm thinking that the startup script for an appserver, for instance, will just have to pull the credentials from Riak (or whatever central location you choose) and update a JNDI string. It spreads your configuration data out a bit but these things shouldn't need to change to often and with proper documented process you know exactly how to update it.
One thing that this whole thing causes is that it begins to break down the ability to FULLY automate everything. I don't like running the knife command to do things. I want to be able to programatically run the same thing that Knife does from my own scripts. I suppose I could simply popen and run the knife commands but shelling out always feels like an anti-pattern to me.
I'd love some feedback on how other people are addressing the same issues!