Showing posts with label agile. Show all posts
Showing posts with label agile. Show all posts

Tuesday, March 8, 2011

Ad-Hoc Configuration, Coordination and the value of change

For those who don't know, I'm currently in Boston for DevOps Days. It's been amazing so far and I've met some wonderful people. One thing that was REALLY awesome was the open space program that Patrick set up. You won't believe it works until you've tried it. It's really powerful.

In one of our open spaces, the topic of ZooKeeper came up. At this point I made a few comments, and at the additional prodding of everyone went into a discussion about ZooKeeper and Noah. I have a tendency to monopolize discussions around topics I'm REALLY passionate about so many thanks for everyone who insisted I go on ;)

Slaughter the deviants!
The most interesting part of the discussion about ZooKeeper (or at least the part I found most revealing) was that people tended to have trouble really seeing the value in it. One of the things I've really wanted to do with Noah is provide (via the wiki) some really good use cases about where it makes sense.

I was really excited to get a chance to talk with  Alex Honor (one of the co-founders of DTO along with Damon Edwards) about his ideas after his really interesting blog post around ad-hoc configuration. If you haven't read it, I suggest you do so.

Something that often gets brought up and, oddly, overlooked at the same time is the where ad-hoc change fits into a properly managed environment (using a tool like puppet or chef).

At this point, many of you have gone crazy over the thought of polluting your beautifully organized environment with something so dirty as ad-hoc changes. I mean, here we've spent all this effort on describing our infrastructure as code and you want to come in and make a random, "undocumented" change? Perish the thought!

However, as with any process or philosophy, strict adherence with out understanding WHEN to deviate will only lead to frustration. Yes, there is a time to deviate and knowing when is the next level of maturity in configuration management.

So when do I deviate
Sadly, knowing when it's okay to deviate is as much a learning experience as it was getting everything properly configured in the first place. To make it even worse, that knowledge is most often specific to the environment in which you operate. The whole point of the phrase ad-hoc is that it's..well...ad-hoc. It's 1 part improvisation/.5 parts stumbling in the dark and the rest is backfilled with a corpus of experience. I don't say this to sound elitist.

So, really, when do I deviate. When/where/why and how do I deviate from this beautifully described environment? Let's go over some use cases and point out that you're probably ALREADY doing it to some degree.

Production troubleshooting
The most obvious example of acceptable deviation is troubleshooting. We pushed code, our metrics are all screwed up and we need to know what the hell just happened. Let's crank up our logging.

At this point, changing your log level, you've deviated from what your system of record (your CM tool) says you should be. Our manifests, our cookbooks, our templates all have us using a loglevel of ERROR but we just bumped up one server to DEBUG. so we could troubleshoot. That system is now a snowflake. Unless you change that log level back to ERROR, you know have one system that will, until you do a puppetrun of chef-client run is different than all the other servers of the class/role.

Would you codify that in the manifest? No. This is an exception. A (should be) short-lived exception to the rules you've defined.

Dynamic environments
Another area where you might deviate is in highly elastic environments. Let's say you've reached the holy grail of elasticity. You're growing and shrinking capacity based on some external trigger. You can't codify this. I might run 20 instances of my app server now but drop back down to 5 instances when the "event" has passed. In a highly elastic environment, are you running your convergence tool after every spin up? Not likely. In an "event" you don't want to have to take down your load balancer (and thus affect service to the existing intstances) just to add capacity. A bit of a contrived example but you get the idea.

So what's the answer?
I am by far not the smartest cookie in the tool shed but I'm opinionated so that has to count for something. These "exception" events are where I see additional tools like Zookeeper (or my pet project Noah) stepping in to handle things.

Distributed coordination, dynamically reconfigurable code, elasticity and environment-aware applications.
These are all terms I've used to describe this concept to people. Damon Edwards provided me with the last one and I really like it.

Enough jibber-jabber, hook a brother up!
So before I give you the ability to shoot yourself in the foot, you should be aware of a few things:


  • It's not a system of record

Your DDCS (dynamic distributed coordination service as I'll call it because I can't ever use enough buzzwords) is NOT your system of record. It can be but it shouldn't be. Existing tools provide that service very well and they do it in an idempotent manner.


  • Know your configuration

This is VERY important. As I said before, much of this is environment specific. The category of information you're changing in this way is more "transient" or "point-in-time". Any given atom of configuration information has a specific value associated with it. Different levels of volatility. Your JDBC connection string is probably NOT going to change that often. However, the number of application servers might be at different amounts of capacity based on some dynamic external factor.


  • Your environment is dynamic and so should be your response

This is where I probably get some pushback. Just as one of the goals of "devops" was to deal with, what Jesse Robbins described to day as misalignment of incentive, there's an internal struggle where some values are simply fluctuating in near real time. This is what we're trying to address.


  • It is not plug and play

One thing that Chef and Puppet do very well is that you can, with next to no change to your systems, predefine how something should look or behave and have those tools "make it so".

With these realtime/dynamic configuration atoms your application needs to be aware of them and react to them intelligently.

Okay seriously. Get to the point
So let's take walk through a scenario where we might implement this ad-hoc philosophy in a way that gives us the power we're seeking.

The base configuration

  •  application server (fooapp) uses memcached, two internal services called "lookup" and "evaluate" and a data store of somekind.
  • "lookup" and "evaluate" are internally developed applications that provide private REST endpoints for providing a dictionary service (lookup) and a business rule parser of some kind (evaluate).
  • Every component's base configuration (including the data source that "lookup" and "evaluation" use) is managed, configured and controlled by puppet/chef.


In a standard world, we store the ip/port mappings for "lookup" and "evaluate" in our CM tool and tags those. When we do a puppet/chef client run, the values for those servers are populated based on the ip/port information our EXISTING "lookup"/"evaluate" servers.

This works. It's being done right now.

So where's the misalignment?
What do you do when you want to spin up another "lookup"/"evaluate" server? Well you would probably use a bootstrap of some kind and apply, via the CM tool, the changes to those values. However this now means that for this to take effect across your "fooapp" servers you need to do a manual run of your CM client. Based on the feedback I've seen across various lists, this is where the point of contention exists.

What about any untested CM changes (a new recipe for instance). I don't want to apply that but if I run my CM tool, I've now not only pulled those unintentional changes but also forced a bounce of all of my fooapp servers. So as a side product of scaling capacity to meet demand, I've now reduced my capacity at another point just to make my application aware of the new settings.

Enter Noah
This is where the making your application aware of its environment and allowing it to dynamically reconfigure itself pays off.

Looking at our base example now, let's do a bit of architectural work around this new model.


  • My application no longer hardcodes a base list of servers prodviding "lookup" and "evaluate" services.
  • My application understands the value of a given configuration atom
  • Instead of the hardcoded list, we convert those configuration atoms akin to something like a singleton pattern that points to a bootstrap endpoint.
  • FooApp provides some sot of "endpoint" where it can be notified of changes to the number/ip addresses or urls available a a given of our services. This can also be proxied via another endpoint.
  • The "bootstrap" location is managed by our CM tool based on some more concrete configuration - the location of the bootstrap server.


Inside our application, we're now:


  • Pulling a list of "lookup"/"evaluate" servers from the bootstrap url (i.e. http://noahserver/s/evaluate)
  • Registering a "watch" on the above "path" and providing an in-application endpoint to be notified when they change.
  • validating at startup if the results of the bootstrap call provide valid information (i.e. doing a quick connection test to each of the servers provided by the bootstrap lookup or a subset thereof)


If we dynamically add a new transient "lookup" server, Noah fires a notification to the provided endpoint with the details of the change. The application will receive a message saying "I have a new 'lookup' server available". It will run through some sanity checks to make sure that the new "lookup" server really does exist and works. It then appends the new server to the list of existing (permanent servers) and start taking advantage of the increase in capacity.

That's it. How you implement the "refresh" and "validation" mechanisms is entirely language specific. This also doesn't, despite my statements previously, have to apply to transient resources. The new "lookup" server could be a permanent addition to my infra. Of course this would have been captured as part of the bootstrapping process if that were the case.

Nutshell
And that's it in a nutshell. All of this is availalbe in Noah and Zookeeer right now. Noah is currently restricted to http POST endpoints but that will be expanded. Zookeeper treats watches as ephemeral. Once the event has fired, you must register that same watch. With Noah, watches are permanent.

Takeaway
I hope the above has made sense. This was just a basic introduction to some of the concepts and design goals. There are plenty of OTHER use cases for ZooKeeper alone. So the key take aways are:


  • Know the value of your configuration data
  • Know when and where to use that data
  • Don't supplant your existing CM tool but instead enhance it.


Links
Noah
ZooKeeper
Hadoop Book (which has some AMAZING detail around ZooKeeper, the technology and use cases

Wednesday, September 22, 2010

Hiring for #devops - a primer

I've written about this previously as part of another post but I've had a few things on my mind recently about the topic and needed to do a brain dump.

As I mentioned in that previous post, I'm currently with a company where devops is part of the title of our team. I won't go into the how and why again for that use case. What I want to talk about is why organizations are using DevOps as title in both hiring and as an enumerated skillset.

We know that what makes up DevOps isn't anything new. I tend to agree with what John Willis wrote on the Opscode blog about CAMS as what it means to him. The problem is that even with such a clear cut definition, companies are still struggling with how to hire people who approach Operations with a DevOps "slant". Damon Edwards says "You wouldn't hire an Agile" but I don't think that's the case at all. While the title might not have Agile, it's definitely an enumerated skill set. A quick search on monster in a 10 mile radius from my house turned up 102 results with "Agile" in the description such as:

  • experienced Project Manager with heavy Agile Scrum experience
  • Agile development methodologies 
  • Familiar with agile development techniques
  • Agile Scrum development team 

Yes, it's something of a misuse of the word Agile in many situations but the fact of the matter is that when a company is looking for a specific type of person, they tend to list that as a skill or in the job description. Of course Agile development is something of a formal methodology whereas DevOps isn't really. I think that's why I like the term "Agile Operations" more in that regard. But in the end, you don't have your "Agile Development" team and so you really wouldn't have your "Agile Operations" team. You have development and you have operations.

So what's a company to do? They want someone who "does that devops thing". How do they find that person? Some places are listing "tools like puppet, chef and cfengine" as part of skill sets. That goes a long way to helping job seekers key off of the mindset of an organization but what about the organization? How do they determine if the person actually takes the message of DevOps to heart? I think CAMS provides that framework.

Culture and Sharing

What kind of culture are you trying to foster? Is it one where Operations and Development are silos or one where, as DevOps promotes, the destruction of artificial barriers between the groups? Ask questions of potential employees that attempt to draw that out of them. Relevance to each role is in parenthesis.

  • Should developers have access to production? Why or why not? (for Operations staff)
  • Should you have access to production? Why or why not? (for Development staff)
  • Describe a typical release workflow at a previous company. What were the gaps? Where did it fail? (Both)
  • Describe your optimal release workflow. (Both)
  • Have you even been to a SCRUM? (Operations)
  • Have you ever had operations staff in a SCRUM? (Development)
  • At what point should your team start being involved/stop being involved in a product lifecycle? (Both)
  • What are the boundaries between Development and Operations? (Both)
  • Do you have any examples of documentation you've written? (Both)
  • What constitutes a deployable product? (Both)
  • Describe your process for troubleshooting an outage? What's the most important aspect of an outage? (Both)

Automation and Metrics

This is somewhat equivalent to a series of technical questions. The key is to deduce the thought process a person uses to approach a problem. Some of these aren't devops specific but have ties to it. Obviously these might be tailored to the specific environment you 

  • Describe your process for troubleshooting an outage? What's the most important aspect of an outage? (Both)
  • Do you code at all? What languages? Any examples? Github repo? (Operations)
  • Do you code outside of work at all? Any examples? Github repo? (Development)
  • Using psuedo-code, describe a server.  An environment. A deployable. (Operations)
  • How might you "unit test" a server? (Operations)
  • Have you ever exposed application metrics to operations staff? How would you go about doing that? (Development)
  • What process would you use to recreate a server from bare metal to running in production? (Operations)
  • How would you automate a process that does X in your application? How do you expose that automation? (Development)
  • What does a Dashboard mean to you? (Both)
  • How would you go about automating production deploys? (Both)

A few of these questions straddle both aspects. Some questions are "trick questions". I'm going to assume that these questions are also tailored to the specifics of your environment. I'm also assuming that basic vetting has been done.

So what are some answers I like to hear vice don't ever want to hear? Anything that sounds like an attitude of "pass the buck" is a red-flag. I really like seeing an operations person who has some sort of code they've written. I also like the same from developers outside of work. I don't expect everyone to live, breathe and eat code but I've known too many people who ONLY code at work and have no interest in keeping abreast of new technologies. They might as well be driving a forklift as opposed to writing code.

I think companies will benefit more from a "technologist" than someone who is only willing to put in 9to5 and never step outside of a predefined box of responsibilities. I'm not suggesting that someone forsake family life for the job. What I'm saying is that there are people who will drag your organization down because they have no aspirations or motivations to make things better. I love it when someone comes in the door and says "Hey I saw this cool project online and it might be useful around here". I love it from both developers and operations folks.

Do with these what you will. I'd love to hear other examples that people might have.

Tuesday, July 13, 2010

No operations team left behind - Where DevOps misses the mark

I'm a big fan of the "DevOps" movement. I follow any and everyone on twitter who's involved. I've watched SlideShare presentation after presentation. I've pined for a chance to go to Velocity. I watched the keynote live. I've got "the book". These guys are my heroes, not because they did something new per se but because they put a name on it. Gave it a face from the formless mass. Brought it to the forefront.

Any operations guy worth his salt has been doing some part of what is constituting DevOps for a long time. We automated builds. If we had to do something more than once, we wrote a script to handle it. My favorite item from ThinkGeek was a sticker that said "Go away or I will replace you with a very small shell script". We pxe-booted machines from kickstart files. We were lazy and didn't want to have to deal with the same bullshit mistakes over and over. When I read the intro to the Web Operations book, I was shouting outloud because this was the FIRST book that accurately described what I've been doing for the past 15 years.

I tell you all that so you don't think I'm down on the "tribe" (as Tim O'Reilly called us). These are my people. We're on the same wavelength. I love you guys. Seriously. But just like any intervention, someone has to speak out. There's a "trend" that seems to be forming that's leaving some operations teams behind and those folks don't have a choice.

I mentioned in a previous post that I'm working for a new company. Because of legal restrictions and company security policy, among other things, I can't go into too many details. However, the same things I'm going to be talking about apply to more than just our company.

The company recently formed a dedicated group called "DevOps". The traditional SA/Operations team was reformed into a "DevOps Support" and a handful of other folks were formed into a "DevOps Architecture" team. Right now that second group consists of me and two of the senior staff who moved over from the original SA team. Now you might look at this and say "Yer doin' it wrong!" but there's some logic behind this thought process. Without breaking out a few people from the daily operational support issues, no headway could be really made on implementing anything. This isn't to imply anything about how the company operates or the quality of the product. It's simply a fact of trying to retrofit a new operational model on top of an already moving traditional business process. The same issues arose when teams started migrating from a waterfall to agile. Sure you could implement agile in the NEXT project but forget about upsetting the boat on the current product line. In addition to changing how developers operated, you had a whole host of other stakeholders who needed to be convinced.

I once had a manager who I really disliked but he had a saying - "It's like changing the tires on the race car while it's going around the track"

That's the position many traditional companies are in right now. Walking in the door and telling them they really should be doing X instead of Y is nice. Everyone with a brain knows it makes sense. It's obviously more efficient, reduces support issues, makes for a better work environment and cures cancer but it simply cannot be implemented by burning the boat. So, yes, some companies will have to form dedicated groups and work with stakeholders and go through the whole process that a DevOps mentality is trying to replace just to implement it.

But that's not the only roadblock.

Sarbanes-Oxley

Any publicly traded company regardless of industry has its hands tied by three letters - SOX. Excluding specific sector requirements - HIPPA for medical, PCI for financial, FCPA, GLBA or (insert acronym here), Sarbanes-Oxley puts vague and onerous demands on public companies. Hell, you don't even have to be publicly traded. You could be a vendor to a publicly traded company and subject to it by proxy. Sarbanes-Oxley is notoriously ambiguous about what you actually have to DO to pass an audit. Entire industries have sprung up around it from hardware and software to wetware.

What's most amazing about it is that, I personally think implementing a DevOps philosophy across the board would make compliance EASIER. All change control is AUTOMATICALLY documented. Traditional access rules aren't an issue because no human actually logs onto servers for instance.

However in the end you have to convince the auditor that what you are doing matches with the script they have. In every company I've been at we've had the same workflow. It's like all the auditors went to the same fly by night school based on some infomercial: "Make big money as a SOX auditor. Call now for your free information packet!"
  • Change is requested by person W
  • Change is approved by X stake holders
  • Change is approved by Y executive
  • Change is performed by person Z
  • The person who requested the change can't approve it.
  • The person approving the change can't perform the actual work.
  • So on and so forth.

Continuous deployment? Not gonna happen. It can't be done with that level of handcuffing.

Security Controls

Moving past the whole SOX issue, there are also security concerns that prevent automation. It's not uncommon for companies to have internal VPNs that have to be used to reach the production environment. That means the beautiful automated build system you have is good up until, say, QA. Preproduction and on requires manual access to even GET to the servers. This model is used in companies all over the world. Mandatory encryption requirements can further complicate things.

Corporate Hierarchy

I was recently asked what I found was the biggest roadblock to implementing the DevOps philosophy. In my standard roundabout way of thinking something through, I realized that the biggest roadblock is people. People with agendas. People who have control issues. People who are afraid of sharing knowledge for fear of losing some sort of role as "Keeper of the Knowledge". Those issues can extend all the way to the top of a company. There's also the fear of change. It's a valid concern and it's even MORE valid when a misstep can cost your company millions of dollars. You have no choice but to move slow and use what works because you know it works. It's not the most efficient but when it's bringing in the money, you can afford to throw bodies at the issue. You can afford to have 10 developers on staff focused on nothing but maintaining the current code base and another 10 working on new features.

The whole point of this long-winded post is to say "Don't write us off". We know. You're preaching to the choir. It takes baby steps and we have to pursue it in a way that works with the structure we have in place. It's great that you're a startup and don't have the legacy issues older companies have. We're all on the same team. Don't leave us behind.

Thursday, July 8, 2010

Locked down!

So I started at the new company today. I was supposed to start Tuesday but there was some delay in my on-boarding. 
But that's neither here nor there.Here's the interesting thing. The company is a publicly traded financial services company. It's not enough to be publicly traded but to also be in financial services is like taking that giant cake of government scrutiny and slapping on another layer for fun.

Did I mention they're also international?

Anyway, I get my laptop and get logged in. This thing is locked down TIGHT. The kicker is that it's running Windows XP. Because of corporate policy, the only tools I'm allowed to install are cygwin, putty and winscp3. If I want to use an IDE, it's got to be eclipse. Nothing else is approved. Boot-level disk encryption. OS level disk encryption. Locked....down.

So I pretty much spent my entire afternoon trying to get cygwin running in something resembling usefulness. Mind you I haven't used Cygwin in AGES. I haven't used a windows machine for work in at least 6+ years. I've been fortunate enough to work for companies that allowed me to wipe the corporate install and run Linux as long as I didn't bother to ask for help with it. Where I ran into the next problem was with dealing with random cygwin issues.

So I hit google and start searching. Click the first result:

KA-BLOCK as Kevin Smith is fond of saying on twitter.

Blocked because it's a blog. Next result. Same thing. Finally I get a mailing list archive that isn't blocked and get most of the issues resolved. Meanwhile I've set off probably 20 alerts not because of any malicious activity but because I couldn't be sure if the search result would be a proxy violation or not. Hell, half the mirrors for cygwin I tried were blocked in the category of "Software Downloads". Really frustrating.

I finally get some semblance of a working system but I find myself wondering how I'm going to manage my standard workflow with this machine. It's going to be a challenge to say the least. In talking with my peers, it's pretty clear that they all have the same concerns and issues. Most of the time they work entirely in windowed screen sessions on one of the internal servers. This is fine by me but it's a big change in my workflow. I've been using the same keybinds for the past 6 years or so. I pretty much have to unlearn ALL of them because I can't use them on Windows.  The upshot is that I got gnome-terminator installed via cygwin ports. The hardest part was the fact that the homepage for gnome-terminator was blocked, you got it, because it was a "blog".

The point of this post is not to disparage the company in any way, shape, form or fashion. It got me wondering though how in the world people accomplish anything in environments like this? 

Forget the standard employee who uses email and the standard MS Office suite. What about developers who are developing code that runs on an entirely different OS. How many bugs and delays have companies had because the developer was unable to use an OS that mirrors that of the production environment. This particular company is a java shop. Java is a little more lax in this area but you still have oddities like "c:/path/to/file" that are entirely different on the server side. More so how many steps had to be injected in the workflow to get around that kind of issue. 

While I really HATE working on OSX at least it's more posix compliant than windows. My biggest headaches are how services are managed differently and the fact that it's not quite unix-alike for my tastes. It's like the uncanny valley.

I guess I'm feeling some trepidation because in addition to having to learn a new workflow - and a slower one at that - I'm also going to be working in Python. I'm excited about the work I'll be doing (DevOps - see my previous post about DevOps as a title) and the impact it will have but I also feel like I'm doubly behind - new workflow and a new language. The only thing that could make me more nervous is if the entire backend were Solaris - my weakest unix ;)

Anyway, I'll be fine. One upshot is that I AM allowed (as far as I was told) to run VirtualBox in host-only mode. Using some guest/host shared folder magic, I should be able to minimize the impact of the slower workflow.

Tuesday, June 22, 2010

The opposite of DevOps

I thought about whether or not to write this post but I think it's an interesting example of the kinds of problems that the DevOps methodology is trying to solve.

I turned in my two weeks notice with the paper on Friday. There were several reasons but none of them are a negative reflection on my employer. The role I was originally hired for was no longer valid and there wasn't a transition path because of platform changes in the back office. I could have stayed (and was asked to stay) but there wasn't anything long term where my skill set was useful. As with any key team member, meetings are called to discuss any outstanding issues, transition responsibilities and the like.

In preparation for the meeting today, I drew up a list my responsibilities at a macro level and then broke that down by task. As it's always been for me, that list spanned several "silos" in the traditional IT organization model. Here's a sample snippit:


Subversion User Management, Repository Management
MySQL User Management, Database Management, Performance Management
LinuxUser Management, OS Configuration, Application Management, Code Management
Puppet Configuration Management, Recipe Management
Ruby VM Management, Gem Management










There were also entries for various internal applications that I've worked on (including some development) and supported from an operational perspective. Mind you, I was embedded as an operations guy with a specific development group in the organization. Sort of a DevOps-lite role.

What was really interesting about the meeting was how the lines were broken down. Literally lines were drawn as to where operations would stop supporting something and the group I was leaving would take over. Take the Linux example:

Application Management was intended to refer to software that was "standard" as a part of the distro. Things like Apache or MySQL. Code Management refers to our internally developed code (all Rails applications except for the sexy Sinatra webservice I wrote). In the end, the responsibilities were shifted divided like so:

- Operations Team supported up to the installation and configuration of Apache (including vhosts) with the exclusion of Passenger configuration.
- Passenger configuration and internal code would now be managed by the Development group. They would handle deployments themselves via Webistrano.
- MySQL? Passed on to those currently managing the MSSQL database servers.

That's the very definition of a silo'd IT infrastructure. Everything is thrown over the wall. At least the deploys remain with the development team. There is nothing "wrong" with this model of IT governance. It's not agile but many companies use it. Contrast that to the position I'll be starting on the 6th of July.

An explicit group is being formed inside the company called "DevOps". I know that at this point everyone is incredulous. I can hear it now; "You're doing it wrong!". Interestingly enough, I asked the same question during the interview process. We all know that DevOps is model and philosophy and not a title or department. You don't have two development teams - Agile and Waterfall. You have Developers and they use one methodology over the other. The same goes for Operations. The people I was interviewing with were cognizant of this fact as well. The reason the group is being called "DevOps" is strictly for organizational and political reasons. The goal of the team is to actually develop a set of operational and developmental processes,tools and guidelines that embody everything that DevOps represents.

This is being done inside of one division of the company for now with the Director of Development and Director of Operations essentially sharing a brain and heading things up. The work that this group does will establish and codify something that will be used throughout the rest of the company. We'll be doing development of tools, architecting systems and recommending/implementing solutions that will, among other things, define how the organization operates itself from an IT perspective. Additionally we'll be supporting this as any traditional operations team would but, for now, the group has a distinct title. My title is "Systems Architect" which nice and generic enough to apply to both traditional groups ;)

The only remotely distressing part of the whole thing is that I'll probably have to learn some Python. (tongue firmly in cheek). There's a point to be made that distribution vendors have settled on Python as the Lingua Franca of OS management - excluding things like Chef, Puppet and Nagios which have their own respective DSLs that abstract away much of the language they were written in. Yes, my precious Ruby will still be there when it comes to extending, say, Puppet. I'll probably still write quite a few service checks for Nagios in shell (should Nagios be the best fit).

But language wars aside, I think this gives a clear picture of exactly the types of problems the DevOps methodology is trying to solve. Agility and flexibility across all branches of IT produces a leaner, strong and faster organization that can spend less time "in the muck" (as John Willis likes to say) and start making the company more money.

Monday, March 29, 2010

DevOps and NoSQL - bad naming leads to confusion

I've recently started following a few new topics (where recently means over the past year). Both of them have the potential to be paradigm shifts and, unfortunately, both have somewhat vague names that evoke responses on both sides of the issue.

The one I'm going to focus on right now is DevOps. I intend on doing another post on NoSQL but that all depends on how much free time I can finagle between setting up the nest for baby number 2 and work projects.

Background
I should clarify my background because that plays a large part in how I perceive both of these issues. I'm a systems engineer. No, I don't have a degree in engineering but I wouldn't call the work I've done over the years any less than that. I've been the intermediary between DBAs and Developers. I've spent 20+ hours on my feet in a frigid datacenter racking servers. I've done high-level architecture of disparate system integration. I've done low-level implementation of disparate system integration. I've been up at 4AM to do deploys of new code during the 30 minute maintenance window. I've been the guy getting the pages and been the guy calling people who we're supposed to get the pages.

I've been in big shops and small shops. I've been responsible for systems that pass millions of dollars and systems that are critical to education.

I don't say all this to toot my own horn. It's just background that is relevant to the discussion.

DevOps
So what's this DevOps thing that people keep throwing around? Well there are tons of opinions and all of them are like certain sphincter muscles. Not one is entirely on the money but the background work has been done here:


So what is it? I think at the core it's about closely integrating the "SysOp" silo with the "Developer" silo as a methodology. But why is this important?

SysOps have always been apart from the rest of the IT department in a sense. While many groups have frequent overlapping areas, the operations team has the final responsibility. As I like to put it, they're the folks getting the phone call. Unless the organization is small, most developers aren't even in the loop unless a bug report is filed after an outage. As it was put elsewhere, many times software is thrown "over the wall" to be deployed. But why is this? I think that's key to the whole issue.

Roles, Responsibilities and Titles
I'm not a stickler for titles. I've held many over the years for Administrator to Director. In one interesting case, I was given a title (and the subsequent responsibility) simply for the purpose of interacting with a client who had firm opinions about only interfacing with someone at the same level. This didn't take away any responsibilities; only added to them. Titles, roles and responsibilities are all different things.

In this way, the organizational title for "IT Operations" denotes a clear differentiator from "Developer". There are certain expectations from your operations team. Production stays stable, for instance. Many times the goals of the Operations team are in direct opposition to those of the Development team. Make no mistake, however. The developers are part of revenue generation while those of operations are not. Operations exists as fire fighters. If Operations is doing its job properly, they aren't actually doing much of their primary responsibility. They have quite a bit of downtime.

So why is there a need for a DevOps movement?
I think on one hand, there is an increasing frustration from the end-user (in this case development) in its interaction with operations. Development methodologies are changing rapidly. Some changes are for the better (less bugs, more testing) while others create friction with how a production environment operates (frequent releases). Another aspect is people transitioning from one role to the other. You have people moving into development from an operations background and vice versa. People change. They discover that they enjoy X more than Y. With each of these transitions, a mindset and attitude is brought along. An Ego.

The developer who moves into production operations laments the slow sluggish pace at which things move. The operations guy who moves into development loves the fast and fluid nature of Agile development. Both feel the need to reconcile the two worlds thinking they can impart some sort of wisdom from one side of which the other was not aware.

Additionally, in times where the leanest team that is first to market often wins many people are wearing multiple hats. See the rise of IaaS (Infrastructure as a Service), Amazon Web Services, NoSQL and other technologies where traditional roles are eliminated.

Both sides have a lot to learn from each other and both sides need to understand the constraints each team has. This is where I feel DevOps has the most to offer as an ideal. Integrating operations into development and letting development be a part of operations. The specifics are still up in the air but I think there are some key areas that each side needs to understand about the other. I'll follow those up in the next post to for logical grouping purposes.

As always, comments are welcome!