Wednesday, September 29, 2010

Distributions and Dynamic Languages - A Manifesto

Background

There's been a lot of talk recently on Twitter and in various posts across the Intertubes about how the various distributions handle dynamic languages and the package system those languages use. This has been a sore spot for me for a LONG time. Recently I had a chance to "stub out" my feelings in a comment on HN. I've been meaning to write this post for a few weeks but just haven't had the time. I'm making the time now.

Distro vendors find themselves in an interesting spot. In general, the difference between Linux distributions has boiled down to a few categories:

  • Support
  • Management tools
  • Package format
  • Default desktop

For the home/desktop user, the last two (and more importantly the last one) are the biggest deciding factors. For the "enterprise" user, the first is typically key. But not all enterprises are enterprises. Would anyone argue that Facebook or Google or Twitter are not enterprise users? Of course not. However those companies don't tend to need the same level of support and have the same hang-ups as Coca-Cola or Home Depot. The latter two companies are the traditional enterprise that does things like troubleshoot servers when they fail. The former are the forward thinking companies that say "Fuck it. Pull the server and put another one in. We don't have time for this 'bench' shit."

In the same vein, the first group of companies are the kind that use Linux as a platform where the second group uses RedHat or Suse as an OS to host JBoss or Oracle or DB2. Those vendors say "We run on distros X and Y and we only support those". You don't have a choice in the second group. The first group may have standardized on a distro but the distro itself is also irrelevant. Those companies use Chef and Puppet and similar tools to totally abstract that out. The distro becomes a commodity. They just want Linux.

This is the new type of company and this is the type of company that distro vendors have to worry about.

So having said that, how does those tie into the dynamic language debacle of late? Increasingly, applications in the PaaS/SaaS space are being written in dynamic languages. The product is just different than Oracle or DB2. So these companies need to consider which distro will make using those dynamic languages as easy as possible. Frankly, they've all pretty much fucked it up. The main reason? Traditional software products.

The biggest selling point of an enterprise distro was support. That, or the fact that you were required to run RedHat or Suse for your RAC cluster. One of the main reasons that enterprise distros were able to be supported platforms for Oracle or DB2 is that they "stabilized" things. In this case that meant long term support (LTS) models and a consistent base operating system. If you ported your product to run on RHEL4, you could guarantee that RedHat would never break compatibility for the life of that product support cycle (I think it's 7 years right now?). You could also be assured that version X of a package would be available for the platform should you need it.

The Problem

That worked fine for binary COTS products. Not so fine for the world of dynamic languages where new versions of a Gem or Python package come out daily. And ESPECIALLY not when the language package system allows for multiple versions of the same package to be installed alongside each other. But is this really a big deal? The distros can just upgrade python to 2.7 right? Nope and the reason why?

Management tools

I don't fault the distro vendors for using python (as an example) as the higher level management language for the OS. In fact, having now gotten into Python, I think it's a wonderful idea. It is, language wars aside, a very approachable and consistent language. It allows them to quickly iterate those tools and especially in the case of Python, the core language changes very little. It's mature.

So now distro vendors have gone and written core parts of the operating system to use Python. Combine that with the package manager restrictions and LTS and you have a system where, if you upgrade Python, you've broken the system beyond repair. This is why RHEL5 is still on Python 2.4.

This is the where we find ourselves today. Distro vendors have to continually package all the python modules they want to supply in native package format to the version of the runtime they use. Eventually the module/gem maintainer is going to stop supporting that module on such old runtimes. Now they essentially have to maintain backports for the life of the LTS terms. This is madness. Why would you put yourself in this situation? I didn't know this but FreeBSD evidently solved this problem a while ago by moving all core scripts away from Perl.

The Manifesto

So here's my manifesto. My suggestion if you will as a long time Linux user, enterprise customer and dynamic language programmer.

Stop it. Get out of the game now. As much as you would like to think your customers care about LTS for Perl/Python/Ruby, they don't. Your LTS is irrelevant six months after you cut a new release of a distro. RHEL6 is shipping with Ruby 1.8.6. Seriously? Not even 1.8.7? I understand they have a long development cycle for new distro versions which is why I'm saying get out. You can't keep up.

But what about our management tools?

I've solved that for you to. system-python, system-ruby, system-perl. Isolate them. Treat them as you would /opt/python or /opt/ruby. Make them untouchable. Minimize your reliance on any module/gem/library you don't directly maintain (i.e. a gtk python module). Understand that you will be wasting resource on backporting this module for 5 or 7 years. No more '/usr/bin/env python'. Shebang that bastard to something like '/usr/lib/system-python/bin/python'

So now that you've isolated that dependency, what about people who don't WANT to compile a new ruby or python vm? How do you provide value to them? The ActiveState model. /usr/lib/python27, /usr/lib/python31, /usr/lib/ruby187.

But wasn't the point of this whole discussion around DLR package management? We don't want to maintain a package per vm version of some library.

Then don't.

This is where the onus is on the language writers. Your package format needs to FULLY support installing from a locally hosted repo of some kind. You may not believe it but not every server has internet access. At our company, NONE of the servers can get to the Internet. The still serve content TO the internet but can't get out. Not by proxy. Not at all.

We're essentially forced to download python packages or jar files and copy them to a maven server or host them from apache to use them internally. Either that, or package them as RPMs. With the python packages, it's especially annoying because, while pip will happily pull from any apache-served directory of tarballs, we can't push from setup.py to it. We don't have ANY metadata associated with it at all.

So Ruby/Python/Perl guys, you need to either provide a PyPi/Gem server package that operates in the same way as your public repos do or make those tools operate EXACTLY the same with a local file path as they do with a URL. Look at createrepo for RPMs for an idea of how it can work if you need to. Additionally, tools like RVM and virtualenv really need to work with distro vendors. RVM does a stellar job at this point. Virtualenv has a way to go.

So now the distro vendors have things isolated. They ship said language repo server and by default point all the local language package tools to that repo path or server. Now if the user chooses to grab module X from PyPi to host locally, they've made that decision. It doesn't break they OS. You don't offer support for it unless you really want to and this whole fucking problem goes away.

EDIT:

I realize I'm not saying anything new here. I also realize that distro vendors realize that the distro itself is a commodity. RedHat figured that out a long time ago. Look at the JBoss purchase and everything since then. Additionally, virtualization removes any reason you might have for picking distro X over distro Y because of hardware support in the distro.

Wednesday, September 22, 2010

Hiring for #devops - a primer

I've written about this previously as part of another post but I've had a few things on my mind recently about the topic and needed to do a brain dump.

As I mentioned in that previous post, I'm currently with a company where devops is part of the title of our team. I won't go into the how and why again for that use case. What I want to talk about is why organizations are using DevOps as title in both hiring and as an enumerated skillset.

We know that what makes up DevOps isn't anything new. I tend to agree with what John Willis wrote on the Opscode blog about CAMS as what it means to him. The problem is that even with such a clear cut definition, companies are still struggling with how to hire people who approach Operations with a DevOps "slant". Damon Edwards says "You wouldn't hire an Agile" but I don't think that's the case at all. While the title might not have Agile, it's definitely an enumerated skill set. A quick search on monster in a 10 mile radius from my house turned up 102 results with "Agile" in the description such as:

  • experienced Project Manager with heavy Agile Scrum experience
  • Agile development methodologies 
  • Familiar with agile development techniques
  • Agile Scrum development team 

Yes, it's something of a misuse of the word Agile in many situations but the fact of the matter is that when a company is looking for a specific type of person, they tend to list that as a skill or in the job description. Of course Agile development is something of a formal methodology whereas DevOps isn't really. I think that's why I like the term "Agile Operations" more in that regard. But in the end, you don't have your "Agile Development" team and so you really wouldn't have your "Agile Operations" team. You have development and you have operations.

So what's a company to do? They want someone who "does that devops thing". How do they find that person? Some places are listing "tools like puppet, chef and cfengine" as part of skill sets. That goes a long way to helping job seekers key off of the mindset of an organization but what about the organization? How do they determine if the person actually takes the message of DevOps to heart? I think CAMS provides that framework.

Culture and Sharing

What kind of culture are you trying to foster? Is it one where Operations and Development are silos or one where, as DevOps promotes, the destruction of artificial barriers between the groups? Ask questions of potential employees that attempt to draw that out of them. Relevance to each role is in parenthesis.

  • Should developers have access to production? Why or why not? (for Operations staff)
  • Should you have access to production? Why or why not? (for Development staff)
  • Describe a typical release workflow at a previous company. What were the gaps? Where did it fail? (Both)
  • Describe your optimal release workflow. (Both)
  • Have you even been to a SCRUM? (Operations)
  • Have you ever had operations staff in a SCRUM? (Development)
  • At what point should your team start being involved/stop being involved in a product lifecycle? (Both)
  • What are the boundaries between Development and Operations? (Both)
  • Do you have any examples of documentation you've written? (Both)
  • What constitutes a deployable product? (Both)
  • Describe your process for troubleshooting an outage? What's the most important aspect of an outage? (Both)

Automation and Metrics

This is somewhat equivalent to a series of technical questions. The key is to deduce the thought process a person uses to approach a problem. Some of these aren't devops specific but have ties to it. Obviously these might be tailored to the specific environment you 

  • Describe your process for troubleshooting an outage? What's the most important aspect of an outage? (Both)
  • Do you code at all? What languages? Any examples? Github repo? (Operations)
  • Do you code outside of work at all? Any examples? Github repo? (Development)
  • Using psuedo-code, describe a server.  An environment. A deployable. (Operations)
  • How might you "unit test" a server? (Operations)
  • Have you ever exposed application metrics to operations staff? How would you go about doing that? (Development)
  • What process would you use to recreate a server from bare metal to running in production? (Operations)
  • How would you automate a process that does X in your application? How do you expose that automation? (Development)
  • What does a Dashboard mean to you? (Both)
  • How would you go about automating production deploys? (Both)

A few of these questions straddle both aspects. Some questions are "trick questions". I'm going to assume that these questions are also tailored to the specifics of your environment. I'm also assuming that basic vetting has been done.

So what are some answers I like to hear vice don't ever want to hear? Anything that sounds like an attitude of "pass the buck" is a red-flag. I really like seeing an operations person who has some sort of code they've written. I also like the same from developers outside of work. I don't expect everyone to live, breathe and eat code but I've known too many people who ONLY code at work and have no interest in keeping abreast of new technologies. They might as well be driving a forklift as opposed to writing code.

I think companies will benefit more from a "technologist" than someone who is only willing to put in 9to5 and never step outside of a predefined box of responsibilities. I'm not suggesting that someone forsake family life for the job. What I'm saying is that there are people who will drag your organization down because they have no aspirations or motivations to make things better. I love it when someone comes in the door and says "Hey I saw this cool project online and it might be useful around here". I love it from both developers and operations folks.

Do with these what you will. I'd love to hear other examples that people might have.

Sunday, September 12, 2010

Follow up to #vogeler post

Patrick Debois was kind enough to comment on my previous post and asked some very good questions. I thought they would fit better in a new post instead of a comment box so here it is:

I read your post and I must say I'm puzzled on what you are actually achieving. Is this a CMDB in the traditional way? Or is it an autodiscover type of CMDB, that goes out to the different systems for information? In the project page you mention à la mcollective. Does this mean you are providing GUI for the collected information? Anyway, I'm sure you are working on something great. But for now, the end goal is not so clear to me. Enlighten me!

Good question ;) I think it sits in an odd space at the moment because it tries to be flexible and by design could do all of those things. Mentioning Mcollective may have clouded the issue but is was more of a nod to similar architectural decisions - using a queue server to execute commands on multiple nodes.

My original goal (outside of learning Python) was to address two key things. I mentioned these on the Github FAQ for Vogeler but it doesn't hurt to repost them here for this discussion: 

  • What need is Vogeler trying to fill?

Well, I would consider it a “framework” for establishing a configuration management database. One problem that something like a CMDB can create is that, to meet every individual need, it tends to over complicate. One thing I really wanted to do is avoid forcing you into my model and trying to provide ways for you to customize the application.

I went the other way. Vogeler at the core, provides two things – a place to dump “information” about “things” and a method for getting that information in a scalable manner. By using a document database like CouchDB, you don’t have to worry about managing a schema. I don’t need to know what information is actually valuable to you. You know best what information you want to store. By using a message queue with some reasonable security precautions, you don’t have to deal with another listening daemon. You don’t have to worry about affecting the performance of your system because you’re opening 20 SSH connections to get information or running some statically linked off-the-shelf binary that leaks memory and eventually zombies (Why hello, SCOM agent!).

In the end, you define what information you need, how to get it and how to interpret it. I just provide the framework to enable that.

So to address the question:

If we're being semantic, yes it's probably more of a configuration database than a configuration MANAGEMENT database. Autodiscovery, though not in the traditional sense, is indeed a feature. Install the client, stand up the server side parts and issue a facter command via the runner. You instantly have all the information that facter understands about your systems in CouchDB viewable via Futon. I could probably easily write something that scanned the network and installed the client but I have a general aversion to anything that sweeps networks that way. More than likely, you would install Vogeler when you kicked a new server and managed the "plugins" via puppet.

 

I hope that makes sense. Vogeler is the framework that allows you to get whatever information about your systems you need, store it, keep that information up to date and interpret it however you want. That's one reason I'm not currently providing a web interface for reporting right now. I just simply don't know what information is valuable to you as an operations team. Tools like puppet, cfengine, chef and the like are great and I have no desire to replace them but you COULD use this to build that replacement. That's also why I use facter as an example plugin with the code. I don't want to rewrite facteri. It just provides a good starting tool for getting some base data from all your systems.

Let's try a use case:

I need to know which systems have X rpm package installed.

You could write an SSH script, hit each box and parse the results or you could have Vogeler tell you. Let's assume that the last run of "package inventory" was a week ago:

vogeler-runner -c rpms -n all

The architecture is already pretty clear. Runner pushes a message on the broadcast queue, all clients see it ('-n all' means all nodes online) and they in turn push the results into another queue. Server pops the messages and dumps them into the CouchDB document for each node. You could then load up Futon or a custom interface you wrote and load the CouchDB design doc that does the map reduce for that information. You have your answer.

Now let's try something of a more complicated example:

I need to know what JMX port all my JBoss instances are listening on in my network.

Well I don't provide a "plugin" for you to get that information, a key for you to store it under in CouchDB or a design doc to parse it by default. But I don't need to. We take the Nagios approach. You define what command returns that information. A shell script, a python script, a ruby script whatever works for you. All you need to tell me is what key you want to store it under and something about the basic structure of the data itself. Maybe your script provides emits JSON. Maybe it emits YAML. Maybe it's a single string. Maybe you run multiple JBoss instances per machine each listening on different JMX ports (as opposed to aliasing IPs and using the standard). I'll take that and create a new key with that data in the Couch document for that system. You can peruse it with a custom web interface or, again, just use Futon.

Does that help?

 

Notes on #vogeler and #devops

UPDATE: There's some additional information about Vogeler in the followup post to this one:

Background

So I've been tweeting quite a bit about my current project Vogeler. Essentially it's a basic configuration management database built on RabbitMQ and CouchDB. I had to learn Python for work, we may or may not be using those two technologies so Vogeler was born.

There's quite a bit of information on Github about it but essentially the basic goals are these:

  • Provide a place to store configuration about systems
  • Provide a way to update that configuration easily and scalably
  • Provide a way for users to EASILY extend it with the information they need

I'm not doing a default web interface or much else right now. There's three basic components - a server process, a client process and a script runner. The first two don't act as traditional daemons but instead monitor a queue server for messages and act on that.

In the case of the client, it waits for a command alias and acts on that alias. The results are stuck on another queue for the server. The server sits and monitors that queue. When it sees a message, it takes it and inserts it in the database with some formatting based on the message type. That's it. The server doesn't initiate and connections directly to the clients and neither do the clients talk directly to the server. All messages that the clients see are initiated by the runner script only.

That's it in a nutshell.

0.7 release

I just released 0.7 of the library to PyPi (no small feat with a teething two year old and 5 month old) and with it, what I consider the core functionality it needs to be useful for people who really are interested in testing it. Almost everything is configurable now. Server, Client and Runner can specify where each component it needs lives on the network. CouchDB and RabbitMQ are running in different locations from the server process? No problem. Using authentication in CouchDB? You can configure that too. Want to use different RabbitMQ credentials? Got it covered.

Another big milestone was getting it working with Python 2.6. No distro out there that I know of is using 2.7 which is what I was using to develop Vogeler. The reason I chose 2.7 is that was the version we standardized on and since I was learning a new language and 2.7 was a bridge to 3, I chose that one. But when I went to started looking at trying the client on other machines at home, I realized I didn't want to compile and setup the whole virtualenv thing on each of them. So I got it working with 2.6 which is what Ubuntu is using. For CentOS and RedHat testing, I just used ActivePython 2.7 in /opt/.

Milestones

As I said 0.7 was a big milestone release for me because of the above things. Now I've got to do some of the stuff I would have done before if I hadn't been learning a new language:

  • Unit Tests - These are pretty big for me. Much of my work on Padrino has been as the Test nazi. Your test fails, I'm all up in your grill.
  • Refactor - Once the unit tests are done, I can safely being to refactor the codebase. I need to move everything out of a single .py with all the classes. This also paves the way for allowing swappable messaging and persistence layers. This is where unit tests shine, IMHO. Additionally, I'll finish up configuration file setup at this point.
  • Logging and Exception handling - I need to setup real loggers and stop using print messages. This is actually pretty easy. Exception handling may come as a result of the refactor but I consider it a distinct milestone.
  • Plugin stabilization - I'm still trying to figure out the best way to handle default plugins and what basic document layout I want.

Once those are done, I should be ready for a 1.0 release however before I cut that release, I have one last test.....

The EC2 blowout

This is the part I'm most excited about. When I feel like I'm ready to cut 1.0, I plan on spinning up a few hundred EC2 vogeler-client instances of various flavors (RHEL, CentOS, Debian, Ubuntu, Suse...you name it). I'll also stand up distinct RabbitMQ, CouchDB and vogeler-server instances.

Then I fire off the scripts. Multiple vogeler-runner invocations concurrently from different hosts and distros. I need to work out the final matrix but I'll probably use Hudson to build it.

While you might think that this is purely for load testing, it's not. Load testing is a part of it but another part is seeing how well Vogeler works as a configuration management database - the intended usage. What better way than to build out a large server farm and see where the real gaps are in the default setup? Additionally, this will allow me to really standardize on some things in the default based on the results.

At THAT point, I cut 1.0 and see what happens.

How you can help

What I really need help with now is feedback. I've seen about a 100 or so total downloads on PyPi across releases but no feedback on Github yet. That's probably mostly due to such minimal functionality before now and the initial hurdle. I've tried to keep the Github docs up to date. I think if I convert the github markdown to rst and load it on PyPi, that will help.

I also need advice from real Python developers. I know I'm doing some crazy stupid shit. It's all a part of learning. Know a way to optimize something I'm doing? Please tell me. Is something not working properly? Tell me. I've tried to test in multiple virtualenvs on multiple distros between 2.6 and 2.7 but I just don't know if I've truly isolated each manual test.

Check the wiki on github and try to install it yourself. Please!

I'm really excited about how things are coming along and about the project itself. If you have ANY feedback or comments, whatsoever, please pass it on even if it's negative. Feel free to tell me that it's pointless but at least tell me why you think so. While this started out as a way to learn Python, I really think it could be useful to some people and that's kept me going more than anything despite the limited time I've had to work on it (I can't work on it as part of my professional duties for many reasons). I've been trying to balance my duties as a father of two, husband, Padrino team member along with this and I think my commitment (4AM...seriously?) is showing.

Thanks!