Showing posts with label system administration. Show all posts
Showing posts with label system administration. Show all posts

Sunday, June 5, 2011

Why Monitoring Sucks

Why Monitoring Sucks (and what we're doing about it)

About two weeks ago someone made a tweet. At this point, I don't remember who said it but the gist was that "monitoring sucks". I happened to be knee-deep in frustrating bullshit around that topic and was currently evaluating the same effing tools I'd evaluated at every other company over the past 10 years or so. So I did what seems to be S.O.P for me these days. I started something.

But does monitoring REALLY suck?

Heck no! Monitoring is AWESOME. Metrics are AWESOME. I love it. Here's what I don't love: - Having my hands tied with the model of host and service bindings. - Having to set up "fake" hosts just to group arbitrary metrics together - Having to either collect metrics twice - once for alerting and another for trending - Only being able to see my metrics in 5 minute intervals - Having to chose between shitty interface but great monitoring or shitty monitoring but great interface - Dealing with a monitoring system that thinks IT is the system of truth for my environment - Perl (I kid...sort of) - Not actually having any real choices

Yes, yes I know:

You can just combine Nagios + collectd + graphite + cacti + pnp4nagios and you have everything you need!

Seriously? Kiss my ass. I'm a huge fan of the Unix pipeline philosophy but, christ, have you ever heard the phrase "antipattern"?

So what the hell are you going to do about it?

I'm going to let smart people be smart and do smart things.

Step one was getting everyone who had similar complaints together on IRC. That went pretty damn well. Step two was creating a github repo. Seriously. Step two should ALWAYS be "create a github repo". Step three? Hell if I know.

Here's what I do know. There are plenty of frustrated system administrators, developers, engineers, "devops" and everything under the sun who don't want much. All they really want is for shit to work. When shit breaks, they want to be notified. They want pretty graphs. They want to see business metrics along side operational ones. They want to have a 52-inch monitor in the office that everyone can look at and say:

See that red dot? That's bad. Here's what was going on when we got that red dot. Let's fix that shit and go get beers

About the "repo"

So the plan I have in place for the repository is this. We don't really need code. What we need is an easy way for people to contribute ideas. The plan I have in place for this is partially underway. There's now a monitoringsucks organization on Github. Pretty much anyone who is willing to contribute can get added to the team. The idea is that, as smart people think of smart shit, we can create new repository under some unifying idea and put blog posts, submodules, reviews, ideas..whatever into that repository so people have an easy place to go get information. I'd like to assign someone per repository to be the owner. We're all busy but this is something we're all highly interested in. If we spread the work out and allow easy contribution, then we can get some real content up there.

I also want to keep the repos as light and cacheable as possible. The organization is under the github "free" plan right now and I'd like to keep it that way.

Blog Posts Repo

This repo serves as a place to collect general information about blog posts people come across. Think of it as hyper-local delicious in a DVCS.

Currently, by virtue of the first commit, Michael Conigliaro is the "owner". You can follow him on twitter and github as @mconigliaro

IRC Logs Repo

This repo is a log of any "scheduled" irc sessions. Personally, I don't think we need a distinct #monitoringsucks channel but people want to keep it around. The logs in this repo are not full logs. Just those from when someone says "Hey smart people. Let's think of smart shit at this date/time" on twitter.

Currently I appear to be the owner of this repo. I would love for someone who can actually make the logs look good to take this over.

Tools Repo

This repo is really more of a "curation" repo. The plan is that each directory is the name of some tool with two things it in:

  • A README.md as a review of the tool
  • A submodule link to the tool's repo (where appropriate)

Again, I think I'm running point on this one. Please note that the submodule links APPEAR to have some sort of UI issue on github. Every submodule appears to point to Dan DeLeo's 'critical' project.

Metrics Catalog Repo

This is our latest member and it already has an official manager! Jason Dixon (@obfuscurity on github/twitter - jdixon on irc) suggested it so he get's to run it ;) The idea here is that this will serves as a set of best practices around what metrics you might want to collect and why. I'm leaving the organization up to Jason but I suggested a per-app/service/protocol directory.

Wrap Up

So that's where we are. Where it goes, I have no idea. I just want to help where ever I can. If you have any ideas, hit me up on twitter/irc/github/email and let me know. It might help to know that if you suggest something, you'll probably be made the person reponsible for it ;)

Update!

It was our good friend Sean Porter (@portertech on twitter), that we have to thank for all of this ;)

From Public Photos

Update (again)

It was kindly pointed out that I never actually included a link to the repositories. Here they are:

https://github.com/monitoringsucks

Wednesday, December 15, 2010

Chef and encrypted data bags.

As part of rolling out Chef at the new gig, we had a choice - stand up our own Chef server and maintain it or use the Opscode platform. From a cost perspective, the 50 node platform cost was pretty much break even with standing up another EC2 instance of our own. The upshot was that I didn't have to maintain it.

 

However, part of due diligence was making sure everything was covered from a security perspective. We use quite a few hosted/SaaS tools but this one had the biggest possible security risk. The biggest concern is dealing with sensitive data such as database passwords and AWS credentials. The Opscode platform as a whole is secure. It makes heavy use of SSL not only for transport layer encryption but also for authentication and authorization. That wasn't a concern. What was a concern was what should happen if a copy of our CouchDB database fell into the wrong hands or a "site reliability engineer" situation happened. That's where the concept of "encrypted data bags" came from for me.

 

Atlanta Chef Hack Day

I had the awesome opportunity to stop by the Atlanta Chef Hack day this past weekend. I couldn't stay long and came in fairly late in the afternoon. However I happened to come in right at the time that @botchagalupe (John Willis) and @schisamo (Seth Chisamore) brought up encrypted data bags. Of course, Willis proceeded to turn around and put me on the spot. After explaining the above use case, we all threw out some ideas but I think everyone came to the conclusion that it's a tough nut to crack with a shitload of gotchas.

 

Before I left, I got a chance to talk with @sfalcon (Seth Falcon) about his ideas. While he totally understood the use cases and mentioned that other people had asked about it as well, he had a few ideas but nothing that stood out as the best way.

 

So what are the options? I'm going to list a few here but I wanted to discuss a little bit about the security domain we're dealing with and what inherent holes exist.

 

Reality Checks

  • Nothing is totally secure.

          Deal with it. Even though it's a remote chance in hell, your keys and/or data are going to be decrypted somewhere at some point in time. The type of information we need to read, unfortunately, can't use a one-way encryption algo like MD5 or SHA because we NEED to know what the data actually is. I need that MySQL password to provide to my application server to talk to the database. That means it has to be decrypted and during that process and during usage of that data, it's going to exist in a possible place that it can be snagged.

  • You don't need to encrypt everything

          You need to understand what exactly needs to be encrypted and why. Yes, there's the "200k winter coats to troops" scenario and every bit of information you expose provides additional material for an attack vector but really think about what you need to encrypt. Application database account usernames? Probably not. The passwords for those accounts? Yes. Consider the "value" of the data you're considering encrypting.

  • Don't forget the "human" factor

          So you've got this amazing library worked out, added it to your cookbooks and you're only encrypting what you need to really encrypt. Then some idiot puts the decryption key on the wiki or the master password is 5 alphabetical characters. As we often said when I was a kid, "Smooth move, exlax"

  • There might be another way

          There might be another way to approach the issue. Make sure you've looked at all the options.

 

Our Use Case

So understanding that, we can narrow down our focus a bit. Let's use the use case of our applications database password because it's a simple enough case. It's a single string.

 

Now in a perfect world, Opscode would encrypt each CouchDB database with customer specific credentials (like say an organizational level client cert) and discards the credentials once you've downloaded them.

 

That's our first gotcha - What happens when the customer loses the key? All that data is now lost to the world. 

 

But let's assume you were smart and kept a backup copy of the key in a secure location. There's another gotcha inherent in the platform itself - Chef Solr. If that entire database is encrypted, unless Opscode HAS the key, they can't index the data with Solr and all those handy searches you're using in your recipes to pull in all your users is gone. Now you'll have to manage the map/reduce views yourself and deal with the performance impact where you don't have one of those views in place.

 

So that option is out. The Chef server has to be able to see the data to actually work.

 

What about a master key? That has several problems.

 

You have to store the key somewhere accessible to the client (i.e. the client chef.rb or in an external file that your recipes can read to decrypt those data bag items).

  • How do you distribute the master key to the clients?
  • How do you revoke the master key to the clients and how does that affect future runs? See the previous line - how do you then distribute the updated key?

 

I'm sure someone just said "I'll put it in a data bag" and then promptly smacked themselves in the head. Chicken - meet Egg. Or is it the other way around?

 

You could have the Chef client ASK you for the key (remember apache SSL startups where the startup script required a password? Yeah, that sucked.

 

 

Going the Master Key Route

So let's assume that we want to go this route and use a master key. We know we can't store in with Opscode because that defeats the purpose. We need a way to distribute the master key to the clients so they can decrypt the data so how do we do it?

 

If you're using Amazon, you might say "I'll store it in S3 or on an EBS volume". That's great! Where do you store the AWS credentials? "In a data ba...oh wait. I've seen this movie before, haven't I?"

 

So we've come to the conclusion that we must store the master key somewhere ourselves locally available to the client. Depending on your platforming, you have a few options:

  • Make it part of the base AMI
  • Make it part of your kickstart script
  • Make it part of your vmware image

 

All of those are acceptable but they don't deal with updating/revocation. Creating new AMIs is a pain in the ass and you have to update all your scripts with new AMI ids when you do that. Golden images are never golden. Do you really want to rekick a box just to update the key?

 

Now we realize we have to make it dynamic. You could make it a part of a startup script in the AMI, first boot of the image or the like. Essentially, "when you startup, go here and grab this key". Of course now you've got to maintain a server to distribute the information and you probably want two of them just to be safe, right? Now we're spreading our key around again.

 

This is starting to look like an antipattern.

 

But let's just say we got ALL of that worked out. We have a simple easy way for clients to get and maintain the key. It works and your data is stored "securely" and you feel comfortable with it.

 

Then your master key gets compromised. No problem, you think. I'll just use my handy update mechanism to update the keys on all the clients and...shit...now I've got to re-encrypt EVERYTHING and re-upload my data bags. Where the hell is the plaintext of those passwords again? This is getting complicated, no?

 

So what's the answer? Is there one? Obviously, if you were that hypersensitive to the security implications you'd just run your own server anyway. You still have the human factor and backups can still be stolen but that's an issue outside of Chef as a tool. You just move the security up the stack a bit. You've got to secure the Chef server itself. But can you still use the Opscode platform? I think so. With careful deliberation and structure, you can reach a happy point that allows you to still automate your infrastructure with Chef (or some other tool) and host the data off-site.

 

Some options

Certmaster

 

Certmaster spun out of the Func project. It's essentially an SSL certificate server at the base. It's another thing you have to manage but it can handle all the revocation and distribution issues.

Riak

 

This is one idea I came up with tonight. The idea is that you run a very small Riak instance on all the nodes that require the ability to decrypt the data. Every node is a part of the same cluster and this can all be easily managed with Chef. It will probably have a single bucket containing the master key. You get the fault tolerance built in and you can pull the keys as part of your recipe using basic Chef resources. Resource utilization on the box should be VERY low for the erlang processes. You'll have a bit more network chatter as the intra-cluster gossip goes on though. Revocation is still an issue but that's VERY easily managed since it's a simple HTTP put to update. And while the data is easily accessible to anyone who can get access to the box, you should consider yourself "proper f'cked" if that happens anyway.

 

But you still have the issue of re-encrypting the databags should that need to happen. My best suggestion is to store the encrypted values in a single data bag and add a rake task that does the encryption/revocation for you. Then you minimize the impact of something that simply should not need to happen that often.

 

Another option is to still use Riak but store the credentials themselves (as opposed to a decryption key) and pull them in when the client runs. The concern I have there is how that affects idempotence and would it cause the recipe to be run every single time just because it can't checksum properly? You probably get around this with a file on the filesystem telling Chef to skip the update using "not_if". 

 

Wrap Up

 

As you can see, there's no silver bullet here. Right now I have two needs, storing credentials for S3/EBS access and storing database passwords. That's it. We don't use passwords for user accounts at all. You can't even use password authentication with SSH on our servers. If I don't have your pubkey in the users data bag, you can't log in.  

 

The AWS credentials are slowly becoming less of an issue. With the Identity Access beta product, I can create limited use keys that can only do certain things and grant them access to specific AWS products. I can make it a part of node creation to generate that access programatically. That means I still have the database credentials issue though. For that, I'm thinking that the startup script for an appserver, for instance, will just have to pull the credentials from Riak (or whatever central location you choose) and update a JNDI string. It spreads your configuration data out a bit but these things shouldn't need to change to often and with proper documented process you know exactly how to update it.

 

One thing that this whole thing causes is that it begins to break down the ability to FULLY automate everything. I don't like running the knife command to do things. I want to be able to programatically run the same thing that Knife does from my own scripts. I suppose I could simply popen and run the knife commands but shelling out always feels like an anti-pattern to me.

 

I'd love some feedback on how other people are addressing the same issues!

 

Thursday, December 2, 2010

Automating EBS Snapshot validation with @fog - Part 2

This is part 2 in a series of posts I'm doing - You can read part 1 here

Getting started

I'm not going to go into too much detail on how to get started with Fog. There's plenty of documentation on the github repo (protip: read the test cases) and Wesley a.k.a @geemus has done some awesome screencasts. I'm going to assume at this point that you've at least got Fog installed, have an AWS account set up and have Fog talking to it. The best way to verify is to create your .fog yaml file, start the fog command line tool and start looking at some of the collections available to you.

For the purpose of this series of posts, I've actually created a small script that you can use to spin up two ec2 instances (m1.small) running CentOS 5.5, create four (4) 5GB EBS volumes and attach them to the first instance. In addition to the fog gem, I also have awesome_print installed and use it in place of prettyprint. This is, of course, optional but you should be aware.

WARNING: The stuff I'm about to show you will cost you money. I tried to stick to minimal resource usage but please be aware you need to clean up after yourself. If, at any time, you feel like you can't follow along with the code or something isn't working - terminate your instances/volumes/resources using the control panel or command-line tools. PLEASE DO NOT JUST SIMPLY RUN THESE SCRIPTS WITHOUT UNDERSTANDING THEM.

The setup script

The full setup script is available as gist on github - https://gist.github.com/724912#file_fog_ebs_demo_setup.rb

Things to note:

  • Change the key_name to a valid key pair you have registered with EC2
  • There's a stopping point halfway down after the EBS volumes are created. You should actually stop there and read the comments.
  • You can run everything inside of an irb session if you like.

The first part of the setup script does some basic work for you - it reads in your fog configuration file (~/.fog) and creates an object you can work with (AWS). As I mentioned earlier, we're creating two servers - hdb and tdb. HDB is the master server - say your production MySQL database. TDB is the box which will be running as the validation of the snapshots.

In the Fog world, there are two big concepts - models and collections. Regardless of cloud provider, there are typically at least two models available - Compute and Storage. Collections are data objects under a given model. For instance in the AWS world, you might have under the Compute model - servers, volumes, snapshots or addresses. One thing that's nice about Fog is that, once you establish your connection to your given cloud, most of your interactions are the same across cloud providers. In the example above, I've created a connection with Amazon using my credentials and have used that Compute connection to create two new servers - hdb and tdb. Notice the options I pass in when I instantiate those servers.

  • image_id
  • key_name

If I wanted to make these boxes bigger, I might also pass in 'flavor_id'. If you're running the above code in an irb session, you might see something like the following when you instantiate those servers: Not all of the fields may be available depending on how long it takes Amazon to spin up the instance. The above shot is after the instance was up and running. For instance, when you first created 'tdb', you'll probably see "state" as pending for quite some time. Fog has a nice helper method for all models call 'wait_for'. In my case I could do:

tdb.wait_for { print "."; ready?}

And it would print dots across the screen until the instance is ready for me to log in. At the end, it will tell you the amount of time you spent waiting. Very handy. You have direct access to all of the attributes above via the instance 'tdb' or 'hdb'. You can use 'tdb.dns_name' to get the dns name for use in other parts of your script for example. In my case, after the server 'hdb' is up and running, I now want to create the four 5GB EBS volumes and attach them to the instance:

I've provided four device names (sdi through sdl) and I'm using the "volumes" collection to create them (AWS.volumes.new). As I mentioned earlier, all of the attributes for 'hdb' and 'tdb' are accessible by name. In this case, I have to create my volumes in the same availability zone as the hdb instance. Since I didn't specify where to create it when I started it, Amazon has graciously chosen 'us-east-1d' for me. As you can see, I can easily access that as 'hdb.availability_zone' and pass it to the volume creation section. I've also specified that the volume should be 5GB in size.

At the point where I've created the volume with '.new' it hasn't actually been created. I want to bind it to a server first so I simply set the volume.server attribute equal to my server object. Then I 'save' it. If I were to log into my running instance, I'd probably see something like this in the 'dmesg' output now:

sdj: unknown partition table

sdk: unknown partition table

sdl: unknown partition table

sdi: unknown partition table

As you can see from the comments in the full file, you should stop at this point and setup the volumes on your instance. In my case, I used mdadm and created a RAID0 array using those four volumes. I then formatted them, made a directory and mounted the md0 device to that directory. If you look, you should now have an additional 20GB of free space mounted on /data. Here I might make this the data directory for mysql (which is the case in our production environment). Let's just pretend you've done all that. I simulated it with a few text files and a quick 1GB dd. We'll consider that the point-in-time that we want to snapshot from. Since there's no actual constant data stream going to the volumes, I can assume for this exercise that we've just locked mysql, flushed everything and frozen the XFS filesystem. Let's make our snapshots. In this case I'm going to be using Fog to do the snapshots but in our real environment we're using the ec2-consistent-snapshot script from Aelastic. First let's take a look at the state of the hdb object:

Notice that the 'block_device_mapping' attribute now consist of an array of hashes. Each hash is a subset of the data about the volume attached to it. If you aren't seeing this, you might have to run 'hdb.reload' to refresh the state of the object. To create our snapshots, we're going to iterate over the block_device_mapping attribute and use the 'snapshots' collection to make those snapshots:

One thing you'll notice is that I'm being fairly explicity here. I could shorthand and chain many of these method calls but for clarity, I'm not.

And now we have 4 snapshots available to us. The process is fairly instant but sometimes it can lag. As always, you should check the status via the .state attribute of an object to verify that it's ready for the next step. Here's a shot of our snapshots right now:

That's the end of Part 2. In the next part, we'll have a full fledged script that does the work of making the snapshots usable on the 'tdb' instance.

Tuesday, November 9, 2010

Fix it or Kick It and the ten minute maxim

One of the things I brought up in my presentation to the Atlanta DevOps group was the concept of "Payment". One of the arguments that people like to trot out when you suggest an operational shift is that "We can't afford to change right now". My argument is that you CAN'T afford to change. It's going to cost you more in the long run. The problem is that in many situations, the cost is detached from the original event.

Take testing. Let's assume you don't make unit testing an enforced part of your development cycle. There are tons of reasons people do this but much of it revolves around time. We don't have time to write tests. We don't have time to wait for tests to run. We've heard them all. Sure you get lucky. Maybe things go out the door with no discernible bugs. But what happens 3 weeks down the road when the same bug that you solved 6 weeks ago crops up again? It's hard to measure the cost when it's so far removed from the origination.

Configuration management is the same way. I'm not going to lie. Configuration management is a pain in the ass especially if you didn't make it a core concept from inception. You have to think about your infrastructure a bit. You'll have to duplicate work initially (i.e. templating config files). It's not easy but it pays off in the long run. However as with so many things, the cost is detached from the original purchase.

Fix it?

Walk with me into my imagination. A scary place where a server has started to misbehave. What's your initial thought? What's the first thing you do? You've seen this movie and done this interview:

  • log on to the box
  • perform troubleshooting
  • think
  • perform troubleshooting
  • call vendor support (if it's an option)
  • update trouble ticket system
  • wait
  • troubleshoot
  • run vendor diag tools

What's the cost of all that work? What's the cost of that downtime? Let's be generous. Let's assume this is a physical server and you paid for 24x7x4 hardware support and a big old RHEL subscription. How much time would you spend on each task? What's the turn around time to getting that server back into production?

Let's say that the problem was resolved WITHOUT needing replacement hardware but came in at the four hour mark. That's three hours that the server was costing you money instead of making you money. Assuming a standard SA salary of $75k/year in Georgia, that works out to $150. That's just doing a base salary conversion not calculating all the other overhead associated with staffing an employee. What if that person consulted with someone else during that time, a coworker at the same rate, for two of those hours. $225. Not too bad, right? Still a tangible cost. Maybe one you're willing to eat.

But let's assume the end result was to wipe and reinstall. Let's say it takes another hour to get back to operational status. Woops. Forgot to make that tweek to Apache that we made a few weeks ago. Let's spend an hour troubleshooting that.

But we're just talking man power at this point. This doesn't even take into account end-user productivity, loss of customers from degraded performance or any host of issues. God forbid that someone misses something that causes problems to other parts of the environment (like not setting the clock and inserting invalid timestamps into the database or something. Forget that you shouldn't let your app server handle timestamps). Now there's cleanup. All told your people spent 5 hours to get this server back into production while you've been running in a degraded state. What does that mean when our LOB is financial services and we have an SLA and attached penalties? I'm going to go easy on you and let you off with 10k per hour of degraded performance.

Get ready to credit someone $50k or worse cut a physical check.

Kick it!

Now I'm sure everyone is thinking about things like having enough capacity to maintain your SLA even with the loss of one or two nodes but be honest. How many companies actually let you do that? Companies will cut corners. They roll the dice or worse have a misunderstanding of HA versus capacity planning.

What you should have done from the start was kick the box. By kicking the box, I mean performing the equivalent of a kickstart or jumpstart. You should, at ANY time, be able to reinstall a box with no user interaction (other than the action of kicking it) and return it to service in 10 minutes. I'll give you 15 minutes for good measure and bad cabling. My RHEL/CentOS kickstarts are done in 6 minutes on my home network and most of that time is the physical hardware power cycling. With virtualization you don't even have a discernible bootup time.

Unit testing for servers

I'll go even farther. You should be wiping at least one of your core components every two weeks. Yes. Wiping. It should be a part of your deploy process in fact. You should be absolutely sure that should you ever need to reinstall under duress that you can get that server back into service in an acceptable amount of time. Screw the yearly DR tests. I'm giving you a world where you can perform bi-monthly DR tests as a matter of standard operation. All it takes is a little bit of up front planning.

The 10 minute maxim

I have a general rule. Anything that has to be done in ten minutes can be afforded twenty minutes to think it through. Obviously, it's a general rule. The guy holding the gun might not give you twenty minutes. And twenty minutes isn't a hard number. The point is that nothing is generally so critical that it has to be SOLVED that instant. You can spend a little more time up front to do things right or you can spend a boatload of time on the backside trying to fix it.

Given the above scenario, you would think I'm being hypocritical or throwing out my own rule. I'm not. The above scenario should have never happened. This is a solved problem. You should have spent 20 minutes actually putting the config file you just changed into puppet instead of making undocumented ad-hoc changes. You should have spent an hour when bringing up the environment to stand up a CM tool instead of just installing the servers and doing everything manually. That's the 10 minute maxim. Take a little extra time now or take a lot of time later.

You decide how much you're willing to spend.

Wednesday, October 6, 2010

.plan

TODO

 

 

 

 

Wednesday, September 22, 2010

Hiring for #devops - a primer

I've written about this previously as part of another post but I've had a few things on my mind recently about the topic and needed to do a brain dump.

As I mentioned in that previous post, I'm currently with a company where devops is part of the title of our team. I won't go into the how and why again for that use case. What I want to talk about is why organizations are using DevOps as title in both hiring and as an enumerated skillset.

We know that what makes up DevOps isn't anything new. I tend to agree with what John Willis wrote on the Opscode blog about CAMS as what it means to him. The problem is that even with such a clear cut definition, companies are still struggling with how to hire people who approach Operations with a DevOps "slant". Damon Edwards says "You wouldn't hire an Agile" but I don't think that's the case at all. While the title might not have Agile, it's definitely an enumerated skill set. A quick search on monster in a 10 mile radius from my house turned up 102 results with "Agile" in the description such as:

  • experienced Project Manager with heavy Agile Scrum experience
  • Agile development methodologies 
  • Familiar with agile development techniques
  • Agile Scrum development team 

Yes, it's something of a misuse of the word Agile in many situations but the fact of the matter is that when a company is looking for a specific type of person, they tend to list that as a skill or in the job description. Of course Agile development is something of a formal methodology whereas DevOps isn't really. I think that's why I like the term "Agile Operations" more in that regard. But in the end, you don't have your "Agile Development" team and so you really wouldn't have your "Agile Operations" team. You have development and you have operations.

So what's a company to do? They want someone who "does that devops thing". How do they find that person? Some places are listing "tools like puppet, chef and cfengine" as part of skill sets. That goes a long way to helping job seekers key off of the mindset of an organization but what about the organization? How do they determine if the person actually takes the message of DevOps to heart? I think CAMS provides that framework.

Culture and Sharing

What kind of culture are you trying to foster? Is it one where Operations and Development are silos or one where, as DevOps promotes, the destruction of artificial barriers between the groups? Ask questions of potential employees that attempt to draw that out of them. Relevance to each role is in parenthesis.

  • Should developers have access to production? Why or why not? (for Operations staff)
  • Should you have access to production? Why or why not? (for Development staff)
  • Describe a typical release workflow at a previous company. What were the gaps? Where did it fail? (Both)
  • Describe your optimal release workflow. (Both)
  • Have you even been to a SCRUM? (Operations)
  • Have you ever had operations staff in a SCRUM? (Development)
  • At what point should your team start being involved/stop being involved in a product lifecycle? (Both)
  • What are the boundaries between Development and Operations? (Both)
  • Do you have any examples of documentation you've written? (Both)
  • What constitutes a deployable product? (Both)
  • Describe your process for troubleshooting an outage? What's the most important aspect of an outage? (Both)

Automation and Metrics

This is somewhat equivalent to a series of technical questions. The key is to deduce the thought process a person uses to approach a problem. Some of these aren't devops specific but have ties to it. Obviously these might be tailored to the specific environment you 

  • Describe your process for troubleshooting an outage? What's the most important aspect of an outage? (Both)
  • Do you code at all? What languages? Any examples? Github repo? (Operations)
  • Do you code outside of work at all? Any examples? Github repo? (Development)
  • Using psuedo-code, describe a server.  An environment. A deployable. (Operations)
  • How might you "unit test" a server? (Operations)
  • Have you ever exposed application metrics to operations staff? How would you go about doing that? (Development)
  • What process would you use to recreate a server from bare metal to running in production? (Operations)
  • How would you automate a process that does X in your application? How do you expose that automation? (Development)
  • What does a Dashboard mean to you? (Both)
  • How would you go about automating production deploys? (Both)

A few of these questions straddle both aspects. Some questions are "trick questions". I'm going to assume that these questions are also tailored to the specifics of your environment. I'm also assuming that basic vetting has been done.

So what are some answers I like to hear vice don't ever want to hear? Anything that sounds like an attitude of "pass the buck" is a red-flag. I really like seeing an operations person who has some sort of code they've written. I also like the same from developers outside of work. I don't expect everyone to live, breathe and eat code but I've known too many people who ONLY code at work and have no interest in keeping abreast of new technologies. They might as well be driving a forklift as opposed to writing code.

I think companies will benefit more from a "technologist" than someone who is only willing to put in 9to5 and never step outside of a predefined box of responsibilities. I'm not suggesting that someone forsake family life for the job. What I'm saying is that there are people who will drag your organization down because they have no aspirations or motivations to make things better. I love it when someone comes in the door and says "Hey I saw this cool project online and it might be useful around here". I love it from both developers and operations folks.

Do with these what you will. I'd love to hear other examples that people might have.

Sunday, September 12, 2010

Follow up to #vogeler post

Patrick Debois was kind enough to comment on my previous post and asked some very good questions. I thought they would fit better in a new post instead of a comment box so here it is:

I read your post and I must say I'm puzzled on what you are actually achieving. Is this a CMDB in the traditional way? Or is it an autodiscover type of CMDB, that goes out to the different systems for information? In the project page you mention à la mcollective. Does this mean you are providing GUI for the collected information? Anyway, I'm sure you are working on something great. But for now, the end goal is not so clear to me. Enlighten me!

Good question ;) I think it sits in an odd space at the moment because it tries to be flexible and by design could do all of those things. Mentioning Mcollective may have clouded the issue but is was more of a nod to similar architectural decisions - using a queue server to execute commands on multiple nodes.

My original goal (outside of learning Python) was to address two key things. I mentioned these on the Github FAQ for Vogeler but it doesn't hurt to repost them here for this discussion: 

  • What need is Vogeler trying to fill?

Well, I would consider it a “framework” for establishing a configuration management database. One problem that something like a CMDB can create is that, to meet every individual need, it tends to over complicate. One thing I really wanted to do is avoid forcing you into my model and trying to provide ways for you to customize the application.

I went the other way. Vogeler at the core, provides two things – a place to dump “information” about “things” and a method for getting that information in a scalable manner. By using a document database like CouchDB, you don’t have to worry about managing a schema. I don’t need to know what information is actually valuable to you. You know best what information you want to store. By using a message queue with some reasonable security precautions, you don’t have to deal with another listening daemon. You don’t have to worry about affecting the performance of your system because you’re opening 20 SSH connections to get information or running some statically linked off-the-shelf binary that leaks memory and eventually zombies (Why hello, SCOM agent!).

In the end, you define what information you need, how to get it and how to interpret it. I just provide the framework to enable that.

So to address the question:

If we're being semantic, yes it's probably more of a configuration database than a configuration MANAGEMENT database. Autodiscovery, though not in the traditional sense, is indeed a feature. Install the client, stand up the server side parts and issue a facter command via the runner. You instantly have all the information that facter understands about your systems in CouchDB viewable via Futon. I could probably easily write something that scanned the network and installed the client but I have a general aversion to anything that sweeps networks that way. More than likely, you would install Vogeler when you kicked a new server and managed the "plugins" via puppet.

 

I hope that makes sense. Vogeler is the framework that allows you to get whatever information about your systems you need, store it, keep that information up to date and interpret it however you want. That's one reason I'm not currently providing a web interface for reporting right now. I just simply don't know what information is valuable to you as an operations team. Tools like puppet, cfengine, chef and the like are great and I have no desire to replace them but you COULD use this to build that replacement. That's also why I use facter as an example plugin with the code. I don't want to rewrite facteri. It just provides a good starting tool for getting some base data from all your systems.

Let's try a use case:

I need to know which systems have X rpm package installed.

You could write an SSH script, hit each box and parse the results or you could have Vogeler tell you. Let's assume that the last run of "package inventory" was a week ago:

vogeler-runner -c rpms -n all

The architecture is already pretty clear. Runner pushes a message on the broadcast queue, all clients see it ('-n all' means all nodes online) and they in turn push the results into another queue. Server pops the messages and dumps them into the CouchDB document for each node. You could then load up Futon or a custom interface you wrote and load the CouchDB design doc that does the map reduce for that information. You have your answer.

Now let's try something of a more complicated example:

I need to know what JMX port all my JBoss instances are listening on in my network.

Well I don't provide a "plugin" for you to get that information, a key for you to store it under in CouchDB or a design doc to parse it by default. But I don't need to. We take the Nagios approach. You define what command returns that information. A shell script, a python script, a ruby script whatever works for you. All you need to tell me is what key you want to store it under and something about the basic structure of the data itself. Maybe your script provides emits JSON. Maybe it emits YAML. Maybe it's a single string. Maybe you run multiple JBoss instances per machine each listening on different JMX ports (as opposed to aliasing IPs and using the standard). I'll take that and create a new key with that data in the Couch document for that system. You can peruse it with a custom web interface or, again, just use Futon.

Does that help?

 

Notes on #vogeler and #devops

UPDATE: There's some additional information about Vogeler in the followup post to this one:

Background

So I've been tweeting quite a bit about my current project Vogeler. Essentially it's a basic configuration management database built on RabbitMQ and CouchDB. I had to learn Python for work, we may or may not be using those two technologies so Vogeler was born.

There's quite a bit of information on Github about it but essentially the basic goals are these:

  • Provide a place to store configuration about systems
  • Provide a way to update that configuration easily and scalably
  • Provide a way for users to EASILY extend it with the information they need

I'm not doing a default web interface or much else right now. There's three basic components - a server process, a client process and a script runner. The first two don't act as traditional daemons but instead monitor a queue server for messages and act on that.

In the case of the client, it waits for a command alias and acts on that alias. The results are stuck on another queue for the server. The server sits and monitors that queue. When it sees a message, it takes it and inserts it in the database with some formatting based on the message type. That's it. The server doesn't initiate and connections directly to the clients and neither do the clients talk directly to the server. All messages that the clients see are initiated by the runner script only.

That's it in a nutshell.

0.7 release

I just released 0.7 of the library to PyPi (no small feat with a teething two year old and 5 month old) and with it, what I consider the core functionality it needs to be useful for people who really are interested in testing it. Almost everything is configurable now. Server, Client and Runner can specify where each component it needs lives on the network. CouchDB and RabbitMQ are running in different locations from the server process? No problem. Using authentication in CouchDB? You can configure that too. Want to use different RabbitMQ credentials? Got it covered.

Another big milestone was getting it working with Python 2.6. No distro out there that I know of is using 2.7 which is what I was using to develop Vogeler. The reason I chose 2.7 is that was the version we standardized on and since I was learning a new language and 2.7 was a bridge to 3, I chose that one. But when I went to started looking at trying the client on other machines at home, I realized I didn't want to compile and setup the whole virtualenv thing on each of them. So I got it working with 2.6 which is what Ubuntu is using. For CentOS and RedHat testing, I just used ActivePython 2.7 in /opt/.

Milestones

As I said 0.7 was a big milestone release for me because of the above things. Now I've got to do some of the stuff I would have done before if I hadn't been learning a new language:

  • Unit Tests - These are pretty big for me. Much of my work on Padrino has been as the Test nazi. Your test fails, I'm all up in your grill.
  • Refactor - Once the unit tests are done, I can safely being to refactor the codebase. I need to move everything out of a single .py with all the classes. This also paves the way for allowing swappable messaging and persistence layers. This is where unit tests shine, IMHO. Additionally, I'll finish up configuration file setup at this point.
  • Logging and Exception handling - I need to setup real loggers and stop using print messages. This is actually pretty easy. Exception handling may come as a result of the refactor but I consider it a distinct milestone.
  • Plugin stabilization - I'm still trying to figure out the best way to handle default plugins and what basic document layout I want.

Once those are done, I should be ready for a 1.0 release however before I cut that release, I have one last test.....

The EC2 blowout

This is the part I'm most excited about. When I feel like I'm ready to cut 1.0, I plan on spinning up a few hundred EC2 vogeler-client instances of various flavors (RHEL, CentOS, Debian, Ubuntu, Suse...you name it). I'll also stand up distinct RabbitMQ, CouchDB and vogeler-server instances.

Then I fire off the scripts. Multiple vogeler-runner invocations concurrently from different hosts and distros. I need to work out the final matrix but I'll probably use Hudson to build it.

While you might think that this is purely for load testing, it's not. Load testing is a part of it but another part is seeing how well Vogeler works as a configuration management database - the intended usage. What better way than to build out a large server farm and see where the real gaps are in the default setup? Additionally, this will allow me to really standardize on some things in the default based on the results.

At THAT point, I cut 1.0 and see what happens.

How you can help

What I really need help with now is feedback. I've seen about a 100 or so total downloads on PyPi across releases but no feedback on Github yet. That's probably mostly due to such minimal functionality before now and the initial hurdle. I've tried to keep the Github docs up to date. I think if I convert the github markdown to rst and load it on PyPi, that will help.

I also need advice from real Python developers. I know I'm doing some crazy stupid shit. It's all a part of learning. Know a way to optimize something I'm doing? Please tell me. Is something not working properly? Tell me. I've tried to test in multiple virtualenvs on multiple distros between 2.6 and 2.7 but I just don't know if I've truly isolated each manual test.

Check the wiki on github and try to install it yourself. Please!

I'm really excited about how things are coming along and about the project itself. If you have ANY feedback or comments, whatsoever, please pass it on even if it's negative. Feel free to tell me that it's pointless but at least tell me why you think so. While this started out as a way to learn Python, I really think it could be useful to some people and that's kept me going more than anything despite the limited time I've had to work on it (I can't work on it as part of my professional duties for many reasons). I've been trying to balance my duties as a father of two, husband, Padrino team member along with this and I think my commitment (4AM...seriously?) is showing.

Thanks!

Tuesday, July 13, 2010

No operations team left behind - Where DevOps misses the mark

I'm a big fan of the "DevOps" movement. I follow any and everyone on twitter who's involved. I've watched SlideShare presentation after presentation. I've pined for a chance to go to Velocity. I watched the keynote live. I've got "the book". These guys are my heroes, not because they did something new per se but because they put a name on it. Gave it a face from the formless mass. Brought it to the forefront.

Any operations guy worth his salt has been doing some part of what is constituting DevOps for a long time. We automated builds. If we had to do something more than once, we wrote a script to handle it. My favorite item from ThinkGeek was a sticker that said "Go away or I will replace you with a very small shell script". We pxe-booted machines from kickstart files. We were lazy and didn't want to have to deal with the same bullshit mistakes over and over. When I read the intro to the Web Operations book, I was shouting outloud because this was the FIRST book that accurately described what I've been doing for the past 15 years.

I tell you all that so you don't think I'm down on the "tribe" (as Tim O'Reilly called us). These are my people. We're on the same wavelength. I love you guys. Seriously. But just like any intervention, someone has to speak out. There's a "trend" that seems to be forming that's leaving some operations teams behind and those folks don't have a choice.

I mentioned in a previous post that I'm working for a new company. Because of legal restrictions and company security policy, among other things, I can't go into too many details. However, the same things I'm going to be talking about apply to more than just our company.

The company recently formed a dedicated group called "DevOps". The traditional SA/Operations team was reformed into a "DevOps Support" and a handful of other folks were formed into a "DevOps Architecture" team. Right now that second group consists of me and two of the senior staff who moved over from the original SA team. Now you might look at this and say "Yer doin' it wrong!" but there's some logic behind this thought process. Without breaking out a few people from the daily operational support issues, no headway could be really made on implementing anything. This isn't to imply anything about how the company operates or the quality of the product. It's simply a fact of trying to retrofit a new operational model on top of an already moving traditional business process. The same issues arose when teams started migrating from a waterfall to agile. Sure you could implement agile in the NEXT project but forget about upsetting the boat on the current product line. In addition to changing how developers operated, you had a whole host of other stakeholders who needed to be convinced.

I once had a manager who I really disliked but he had a saying - "It's like changing the tires on the race car while it's going around the track"

That's the position many traditional companies are in right now. Walking in the door and telling them they really should be doing X instead of Y is nice. Everyone with a brain knows it makes sense. It's obviously more efficient, reduces support issues, makes for a better work environment and cures cancer but it simply cannot be implemented by burning the boat. So, yes, some companies will have to form dedicated groups and work with stakeholders and go through the whole process that a DevOps mentality is trying to replace just to implement it.

But that's not the only roadblock.

Sarbanes-Oxley

Any publicly traded company regardless of industry has its hands tied by three letters - SOX. Excluding specific sector requirements - HIPPA for medical, PCI for financial, FCPA, GLBA or (insert acronym here), Sarbanes-Oxley puts vague and onerous demands on public companies. Hell, you don't even have to be publicly traded. You could be a vendor to a publicly traded company and subject to it by proxy. Sarbanes-Oxley is notoriously ambiguous about what you actually have to DO to pass an audit. Entire industries have sprung up around it from hardware and software to wetware.

What's most amazing about it is that, I personally think implementing a DevOps philosophy across the board would make compliance EASIER. All change control is AUTOMATICALLY documented. Traditional access rules aren't an issue because no human actually logs onto servers for instance.

However in the end you have to convince the auditor that what you are doing matches with the script they have. In every company I've been at we've had the same workflow. It's like all the auditors went to the same fly by night school based on some infomercial: "Make big money as a SOX auditor. Call now for your free information packet!"
  • Change is requested by person W
  • Change is approved by X stake holders
  • Change is approved by Y executive
  • Change is performed by person Z
  • The person who requested the change can't approve it.
  • The person approving the change can't perform the actual work.
  • So on and so forth.

Continuous deployment? Not gonna happen. It can't be done with that level of handcuffing.

Security Controls

Moving past the whole SOX issue, there are also security concerns that prevent automation. It's not uncommon for companies to have internal VPNs that have to be used to reach the production environment. That means the beautiful automated build system you have is good up until, say, QA. Preproduction and on requires manual access to even GET to the servers. This model is used in companies all over the world. Mandatory encryption requirements can further complicate things.

Corporate Hierarchy

I was recently asked what I found was the biggest roadblock to implementing the DevOps philosophy. In my standard roundabout way of thinking something through, I realized that the biggest roadblock is people. People with agendas. People who have control issues. People who are afraid of sharing knowledge for fear of losing some sort of role as "Keeper of the Knowledge". Those issues can extend all the way to the top of a company. There's also the fear of change. It's a valid concern and it's even MORE valid when a misstep can cost your company millions of dollars. You have no choice but to move slow and use what works because you know it works. It's not the most efficient but when it's bringing in the money, you can afford to throw bodies at the issue. You can afford to have 10 developers on staff focused on nothing but maintaining the current code base and another 10 working on new features.

The whole point of this long-winded post is to say "Don't write us off". We know. You're preaching to the choir. It takes baby steps and we have to pursue it in a way that works with the structure we have in place. It's great that you're a startup and don't have the legacy issues older companies have. We're all on the same team. Don't leave us behind.

Thursday, July 8, 2010

Locked down!

So I started at the new company today. I was supposed to start Tuesday but there was some delay in my on-boarding. 
But that's neither here nor there.Here's the interesting thing. The company is a publicly traded financial services company. It's not enough to be publicly traded but to also be in financial services is like taking that giant cake of government scrutiny and slapping on another layer for fun.

Did I mention they're also international?

Anyway, I get my laptop and get logged in. This thing is locked down TIGHT. The kicker is that it's running Windows XP. Because of corporate policy, the only tools I'm allowed to install are cygwin, putty and winscp3. If I want to use an IDE, it's got to be eclipse. Nothing else is approved. Boot-level disk encryption. OS level disk encryption. Locked....down.

So I pretty much spent my entire afternoon trying to get cygwin running in something resembling usefulness. Mind you I haven't used Cygwin in AGES. I haven't used a windows machine for work in at least 6+ years. I've been fortunate enough to work for companies that allowed me to wipe the corporate install and run Linux as long as I didn't bother to ask for help with it. Where I ran into the next problem was with dealing with random cygwin issues.

So I hit google and start searching. Click the first result:

KA-BLOCK as Kevin Smith is fond of saying on twitter.

Blocked because it's a blog. Next result. Same thing. Finally I get a mailing list archive that isn't blocked and get most of the issues resolved. Meanwhile I've set off probably 20 alerts not because of any malicious activity but because I couldn't be sure if the search result would be a proxy violation or not. Hell, half the mirrors for cygwin I tried were blocked in the category of "Software Downloads". Really frustrating.

I finally get some semblance of a working system but I find myself wondering how I'm going to manage my standard workflow with this machine. It's going to be a challenge to say the least. In talking with my peers, it's pretty clear that they all have the same concerns and issues. Most of the time they work entirely in windowed screen sessions on one of the internal servers. This is fine by me but it's a big change in my workflow. I've been using the same keybinds for the past 6 years or so. I pretty much have to unlearn ALL of them because I can't use them on Windows.  The upshot is that I got gnome-terminator installed via cygwin ports. The hardest part was the fact that the homepage for gnome-terminator was blocked, you got it, because it was a "blog".

The point of this post is not to disparage the company in any way, shape, form or fashion. It got me wondering though how in the world people accomplish anything in environments like this? 

Forget the standard employee who uses email and the standard MS Office suite. What about developers who are developing code that runs on an entirely different OS. How many bugs and delays have companies had because the developer was unable to use an OS that mirrors that of the production environment. This particular company is a java shop. Java is a little more lax in this area but you still have oddities like "c:/path/to/file" that are entirely different on the server side. More so how many steps had to be injected in the workflow to get around that kind of issue. 

While I really HATE working on OSX at least it's more posix compliant than windows. My biggest headaches are how services are managed differently and the fact that it's not quite unix-alike for my tastes. It's like the uncanny valley.

I guess I'm feeling some trepidation because in addition to having to learn a new workflow - and a slower one at that - I'm also going to be working in Python. I'm excited about the work I'll be doing (DevOps - see my previous post about DevOps as a title) and the impact it will have but I also feel like I'm doubly behind - new workflow and a new language. The only thing that could make me more nervous is if the entire backend were Solaris - my weakest unix ;)

Anyway, I'll be fine. One upshot is that I AM allowed (as far as I was told) to run VirtualBox in host-only mode. Using some guest/host shared folder magic, I should be able to minimize the impact of the slower workflow.

Monday, March 29, 2010

DevOps - Operations to Developers

This is part 2 in a general set of discussions on DevOps. Part 1 is here

Production
I have a general rule I've lived by that has served me well and it's NSFW. I learned it in these exact words from a manager many years ago:

"Don't f*** with production"

Production is sacrosanct. You can dick around with anything else but if it's critical to business operations, don't mess with it without good reason and without an audit trail. There is nothing more frustrating than trying to diagnose an outage because someone did something they THOUGHT was irrelevant (like a DNS change - I speak from experience) and causing a two hour outage of a critical system. It's even more frustrating when there's no audit trail of WHAT was done so that it can be undone. Meanwhile, you've got 20 different concerned parties calling you every five minutes asking "are we there yet?". How much development work would get done if you operated under the same interrupt driven environment?

Change Control
Yes, it's a hassle and boring and not very rockstar but it's not only critical but sometimes it's the law.

Side note: I pretty much hate meetings in general but they do serve a purpose. My main frustration is that meetings take away time where work could actually be getting done. They always devolve into a glorified gossip session. What should have taken 15 minutes to discuss ends up taking an hour as conversations that started while waiting for that last person to show up carry over into meeting proper. Sadly the person who is late is usually Red Leader and we can't seem to stay on target. Everyone has something they would rather be doing and usually it's something that will actually accomplish something rather than the stupid meeting.

The exception for me, has always been change control meetings. I typically enjoy those because that's when things happen. We're finally going to release your cool new feature into production that you've spent a month developing and fine tuning. Of course, this is when we find out that you neglected to mention that you needed firewall rules to go along with it. This is when we find out exactly what that new table is going to be used for and that we MIGHT want to put it in its own bufferpool. All of the things you didn't think of?We bring them to the surface in these meetings because these are pain points we've seen in the past. We think of these things.

Auditing
As mentioned in production, typically we don't have the benefit of looking over changes in source control. We can't check a physical object into SVN. Sure, there are amazing products like Puppet and Cfengine that make managing server configurations easier. We have applications that can track changes. We have applications that map our switch ports but it's simply not that easy for us to track down what changed.

Your application is encapsulated in that way. You know what changed, who changed it and (with appropriate comments) WHY it was changed.

Meanwhile a DNS change may have happened, a VLAN change, a DAS change...you name it. Production isn't just your application. It's all the moving parts underneath that power it. That application that you developed that is tested on a single server doesn't always account for the database being on a different machine or the firewall rules associated with it.

Yes, we'd love to have a preproduction environment that mimics production but that's not always an option. We have to have an audit trail. Things have to be repeatable. So no, we can't just change a line in a jsp for you to fix a bug that didn't get caught in testing. It would take us longer to do that on 10 servers than if we just pushed a new build.

Outages
Outages are bad, mmkay? You probably won't lose your job over a bug but I've had to deal with someone being fired because he didn't follow the process and caused an outage. It sucks but we're the one who gets the phone call at 2AM when something is amiss.

And even AFTER the outage, we have to fill out Root Cause Analysis reports sometimes after being up for 24 hours straight fixing a serious issue. You can either write a unit test for a bit of code or you can keep fixing the same bug after every release. We'd prefer you write the unit test, personally.

I know all of these things make us look like a slow, unmoving beast. I know you hate sitting in meeting after meeting explaining that the bug will be fixed just as soon as ops pushes the code. I know that we get pissy and blame you for everything that goes wrong with an application. We're sorry. We're just running on 2 hours of sleep in three days getting the new hardware installed for your application that someone thinks has to go online yesterday. Meanwhile, we're dealing with a full disk on this server and a flaky network connection on another. Cut us some slack.