Lusis: March 2010

Monday, March 29, 2010

DevOps - Developers to Operations

I apologize in advance since this post is likely to be leaner than the previous one. My background is NOT in development and I've never officially held a development role. Having said that, I'm looking to the comments for people to provide something else I "missed". I need a guest blogger for this section ;)

Production

I understand that production is important. I understand that it sucks when bugs happen but would it kill you to give me a little more insight into HOW things operate so I can do my job better?

I mean seriously. I wouldn't have to have to bug you all the time if you would just give me the tools to get the information I need. I'm not interested in your job. I'm not going to break something. Give me a limited account on the servers so I can look at the logs real time. Better yet, I heard of this nifty thing called Splunk. You think we can get the free version up and running somewhere so I can see aggregated logging?

Oh...you have to jump through 20 hoops to make that happen? Never mind, I'll just keep pestering you to email me the logs until the bug is fixed.

Change Control

Why the hell can't we push my bug fix out tonight? It's a minor fix and leaving it out there just means more data cleanup for you until we do fix it. Yes, it's already been through testing. Yes, there's some data cleanup but I've already written the SQL to do it. Look, the DBA even signed off on it.

Seriously, I'm handing you everything you need to get it done. Oh, you don't have the resources to do it tonight? How about cross-training one of us to do it for edge cases like these. I know that new features have to wait for major release but I've got the business unit breathing down my neck on this one bug. I guess I'll keep telling them we have to wait for you guys.

Bugs

I'm sorry but bugs will happen. Yes we write unit tests but those always have to be refined as new bugs are found. I don't enjoy having to revisit old code and fix the same thing over and over. I'm not intentionally putting these bugs out there for my health! If you would give us a proper preproduction environment we might catch these sooner. Yes, I know that the production dataset is much larger than our QA set. We asked for a refresh from production but you still haven't done it. We've got to get this stuff released now. No, I'm not being hyperbolic.

Look at it from my perspective. I'm trying to get requirements from 5 different groups of people. I need to understand the intricacies of revolving credit and interest payments. I have to find the BEST way to present a complex set of information in most appealing manner because the only thing these people care about is how fast can I get it done and how pretty is it?

I'm sorry my solution isn't the most optimal but it's only a stopgap. We're trying to work out a better long-term solution.

Attitude

Look, we're good developers. We have a process that works. The business units work with us because we don't let them down. We're extremely agile. We give them what we want and we're ready to generate some income but we're ALWAYS stymied by all your stupid roadblocks. We all have the same goal so why not let us help out?

You always treat us like a bunch of idiots. We can't POSSIBLY understand the complex and rich interconnectedness of your universe despite the fact that we wrote the code that actually RUNS that universe. As I said, I have no interest in being a system administrator. I just want to get my bugs fixed. I want to get my code released because I have five other projects waiting on this one to finish. It really ISN'T as complicated as you want to make it out to be. In the end, you're just copying files somewhere. I know we have 10 servers serving the application and I took that into account. See? Totally stateless application.

Oh, you're impressed that I understand statelessness? Your condescension is not surprising.

Cooperation

I know you think we're stupid because we don't understand the interactions of your fiber channel card and the SAN. I know you think we write buggy software to keep ourselves in a job. I know you have certain regulations and rules that you have to work with to maintain the integrity of the production environment.

I know all of these things but would it KILL you to work with us on this stuff? What do you need from us? Did we not give you enough information? Let us know and we'll make sure we have that next time. Can you give us some sort of production access to help troubleshoot issues? I hate having to ask you for everything just as much as you hate being asked. There really are some things we can do on our own. Just give us the guidelines (we're good at working with those, you know) and we'll offload some of that stuff for you. Don't worry we're not going to change anything, we just need to look.

DevOps - Operations to Developers

This is part 2 in a general set of discussions on DevOps. Part 1 is here

Production

I have a general rule I've lived by that has served me well and it's NSFW. I learned it in these exact words from a manager many years ago:

"Don't f*** with production"

Production is sacrosanct. You can dick around with anything else but if it's critical to business operations, don't mess with it without good reason and without an audit trail. There is nothing more frustrating than trying to diagnose an outage because someone did something they THOUGHT was irrelevant (like a DNS change - I speak from experience) and causing a two hour outage of a critical system. It's even more frustrating when there's no audit trail of WHAT was done so that it can be undone. Meanwhile, you've got 20 different concerned parties calling you every five minutes asking "are we there yet?". How much development work would get done if you operated under the same interrupt driven environment?

Change Control

Yes, it's a hassle and boring and not very rockstar but it's not only critical but sometimes it's the law.

Side note: I pretty much hate meetings in general but they do serve a purpose. My main frustration is that meetings take away time where work could actually be getting done. They always devolve into a glorified gossip session. What should have taken 15 minutes to discuss ends up taking an hour as conversations that started while waiting for that last person to show up carry over into meeting proper. Sadly the person who is late is usually Red Leader and we can't seem to stay on target. Everyone has something they would rather be doing and usually it's something that will actually accomplish something rather than the stupid meeting.

The exception for me, has always been change control meetings. I typically enjoy those because that's when things happen. We're finally going to release your cool new feature into production that you've spent a month developing and fine tuning. Of course, this is when we find out that you neglected to mention that you needed firewall rules to go along with it. This is when we find out exactly what that new table is going to be used for and that we MIGHT want to put it in its own bufferpool. All of the things you didn't think of?We bring them to the surface in these meetings because these are pain points we've seen in the past. We think of these things.

Auditing

As mentioned in production, typically we don't have the benefit of looking over changes in source control. We can't check a physical object into SVN. Sure, there are amazing products like Puppet and Cfengine that make managing server configurations easier. We have applications that can track changes. We have applications that map our switch ports but it's simply not that easy for us to track down what changed.

Your application is encapsulated in that way. You know what changed, who changed it and (with appropriate comments) WHY it was changed.

Meanwhile a DNS change may have happened, a VLAN change, a DAS change...you name it. Production isn't just your application. It's all the moving parts underneath that power it. That application that you developed that is tested on a single server doesn't always account for the database being on a different machine or the firewall rules associated with it.

Yes, we'd love to have a preproduction environment that mimics production but that's not always an option. We have to have an audit trail. Things have to be repeatable. So no, we can't just change a line in a jsp for you to fix a bug that didn't get caught in testing. It would take us longer to do that on 10 servers than if we just pushed a new build.

Outages

Outages are bad, mmkay? You probably won't lose your job over a bug but I've had to deal with someone being fired because he didn't follow the process and caused an outage. It sucks but we're the one who gets the phone call at 2AM when something is amiss.

And even AFTER the outage, we have to fill out Root Cause Analysis reports sometimes after being up for 24 hours straight fixing a serious issue. You can either write a unit test for a bit of code or you can keep fixing the same bug after every release. We'd prefer you write the unit test, personally.

I know all of these things make us look like a slow, unmoving beast. I know you hate sitting in meeting after meeting explaining that the bug will be fixed just as soon as ops pushes the code. I know that we get pissy and blame you for everything that goes wrong with an application. We're sorry. We're just running on 2 hours of sleep in three days getting the new hardware installed for your application that someone thinks has to go online yesterday. Meanwhile, we're dealing with a full disk on this server and a flaky network connection on another. Cut us some slack.

DevOps and NoSQL - bad naming leads to confusion

I've recently started following a few new topics (where recently means over the past year). Both of them have the potential to be paradigm shifts and, unfortunately, both have somewhat vague names that evoke responses on both sides of the issue.

The one I'm going to focus on right now is DevOps. I intend on doing another post on NoSQL but that all depends on how much free time I can finagle between setting up the nest for baby number 2 and work projects.

Background

I should clarify my background because that plays a large part in how I perceive both of these issues. I'm a systems engineer. No, I don't have a degree in engineering but I wouldn't call the work I've done over the years any less than that. I've been the intermediary between DBAs and Developers. I've spent 20+ hours on my feet in a frigid datacenter racking servers. I've done high-level architecture of disparate system integration. I've done low-level implementation of disparate system integration. I've been up at 4AM to do deploys of new code during the 30 minute maintenance window. I've been the guy getting the pages and been the guy calling people who we're supposed to get the pages.

I've been in big shops and small shops. I've been responsible for systems that pass millions of dollars and systems that are critical to education.

I don't say all this to toot my own horn. It's just background that is relevant to the discussion.

DevOps

So what's this DevOps thing that people keep throwing around? Well there are tons of opinions and all of them are like certain sphincter muscles. Not one is entirely on the money but the background work has been done here:

http://stochasticresonance.wordpress.com/2010/03/26/devops-misnamed/ (which, coincidentally prompted this post)

So what is it? I think at the core it's about closely integrating the "SysOp" silo with the "Developer" silo as a methodology. But why is this important?

SysOps have always been apart from the rest of the IT department in a sense. While many groups have frequent overlapping areas, the operations team has the final responsibility. As I like to put it, they're the folks getting the phone call. Unless the organization is small, most developers aren't even in the loop unless a bug report is filed after an outage. As it was put elsewhere, many times software is thrown "over the wall" to be deployed. But why is this? I think that's key to the whole issue.

Roles, Responsibilities and Titles

I'm not a stickler for titles. I've held many over the years for Administrator to Director. In one interesting case, I was given a title (and the subsequent responsibility) simply for the purpose of interacting with a client who had firm opinions about only interfacing with someone at the same level. This didn't take away any responsibilities; only added to them. Titles, roles and responsibilities are all different things.

In this way, the organizational title for "IT Operations" denotes a clear differentiator from "Developer". There are certain expectations from your operations team. Production stays stable, for instance. Many times the goals of the Operations team are in direct opposition to those of the Development team. Make no mistake, however. The developers are part of revenue generation while those of operations are not. Operations exists as fire fighters. If Operations is doing its job properly, they aren't actually doing much of their primary responsibility. They have quite a bit of downtime.

So why is there a need for a DevOps movement?

I think on one hand, there is an increasing frustration from the end-user (in this case development) in its interaction with operations. Development methodologies are changing rapidly. Some changes are for the better (less bugs, more testing) while others create friction with how a production environment operates (frequent releases). Another aspect is people transitioning from one role to the other. You have people moving into development from an operations background and vice versa. People change. They discover that they enjoy X more than Y. With each of these transitions, a mindset and attitude is brought along. An Ego.

The developer who moves into production operations laments the slow sluggish pace at which things move. The operations guy who moves into development loves the fast and fluid nature of Agile development. Both feel the need to reconcile the two worlds thinking they can impart some sort of wisdom from one side of which the other was not aware.

Additionally, in times where the leanest team that is first to market often wins many people are wearing multiple hats. See the rise of IaaS (Infrastructure as a Service), Amazon Web Services, NoSQL and other technologies where traditional roles are eliminated.

Both sides have a lot to learn from each other and both sides need to understand the constraints each team has. This is where I feel DevOps has the most to offer as an ideal. Integrating operations into development and letting development be a part of operations. The specifics are still up in the air but I think there are some key areas that each side needs to understand about the other. I'll follow those up in the next post to for logical grouping purposes.

As always, comments are welcome!

Tuesday, March 23, 2010

Code on Github

I'm working on pushing all my code snippets I've developed over the years to Github. Right now I have the Ruby-EWS and Ruby-Downtime stuff uploaded. I'll get the rest over as I can sanitize it.

http://github.com/lusis

Ummm...thank you Google Voice?

I just got this wonderful bit of translation from Google Voice on a voicemail:

I just came across your resume on Monster. Here, and had a position available on the line area thought it'd be a good fit for you sexually a infrastructure design related positions in the offer an area love the chance to discuss with you at your earliest convenience

Mind you, I love infrastructure design but I'd hate to see how I could fit it into my sex life. The job WAS with Cox....

Tuesday, March 9, 2010

Usability and Performance testing in the Analog world

Right now, our offices are located downtown. The AJC has been in this building since 1972 (I think). We're soon to be moving outside the Perimeter to new offices. Trust me, this is an important fact.

I started as a contractor here back in November of last year. I had an option for parking but at the time I didn't know how long I would be here (contracting can be volatile) and didn't know if I would plan on switching to Marta. Combined with the fact that the company WAS planning on moving offices, I decided not to get a parking pass.

Instead, I opted to use the pay parking lots behind the office. These spots are $3 a day or a monthly option of $40. This was about the same as the AJC parking pass so it was something of a wash. I went permanent with the paper in the beginning of February. Our move is scheduled at the beginning of April and parking passes are no longer available so I *HAVE* to use the parking vendor out back.

Of course this is when the parking company, Central Parking Systems decided to swap out the working payment terminals with new ones. The old ones, while a bit worn around the collar actually worked well. People understood them and things moved along quickly. I'm trying to find a picture of the old ones but the new ones look like the picture above.

Since these have been put in place, the lines to get a parking pass have been super long. I thought this was a familiarity issue. The new terminals are smaller, the screen is less readable and from a usability perspective, it's a pain in the ass. People have to stoop down to get the money in the machine and GOD FORBID you pay with a credit card. It's one of the quick swipe methods and from the angle you can NEVER "quickly remove your card" without also pulling up as you pull it out. This obviously isn't quick enough so you have to start again.

Well this morning I found out there's ANOTHER problem. There were two parking attendants watching people use the machines. Finally one guy spoke up and said the following:

"Folks, this terminal is running really slow and overloading the computer system. If you use X terminal or Y terminal, they're running really fast"

Let me position those other two terminals for you in relation to this one. X terminal is closest but by the time I get to it, pay and get back to my car (which I parked near THIS terminal I would already be done paying if I stayed in line.

The OTHER terminal would require me to get in my car, drive to it and come back to park near where I actually work.

So how slow were these new machines running? They were running so slow that they wouldn't actually take money. When I got up there, I decided to ask the guy how I would pay for a monthly option here. This is where it gets good:

"You can't buy monthly passes here anymore. Just go online to blahblahblahblah.com and register as a vendor. You get much cheaper parking and other rewards."

Are you kidding me? I'm only going to be here for another month. I'm not a "vendor" and I don't want to go to a stupid website to pay for freaking parking when I could pay just fine up until recently.

Where do I start with the screwups?

If they had ANY business intelligence or metrics, they would have known that this particular kiosk is the busiest one in the deck. It's near the best parking and it's centrally located. The fact that this one terminal was causing these kinds of issues is unacceptable. Load testing isn't strictly digital. Look at any major downtown event. The same work goes into capacity planning there from parking to traffic that any major website would undergo preparing for the Christmas holiday. The kiosk was running so slow that it was actually refusing to accept dollar bills. I decided to pay with my credit card and my ticket had printed while the screen still said processing payment.

But even IF the system could keep up with the load generated, having someone actually go onsite and see HOW users were using the damn things would have shown them the ergonomic bottlenecks.

Something as simple as a parking lot was foiled by lack of performance and usability testing. Amazing.

EDIT: I found a picture of what the old terminal looked like. It's not "pretty" but it worked and was ergonomically superior.

Friday, March 5, 2010

Finally getting to dig into Puppet

Well I'm finally getting to dig into Puppet at a professional level. I've been handed the keys to establishing the Puppet infrastructure. Mixed Linux/Solaris clients. Beyond the hassle of building up-to-date RPMs for CentOS/RHEL, everything else is going smoothly.

I've got all of my configuration in version control and handy script to create new modules from templates. I'm already pushing out OS specific configurations as well as OS version specific stuff. Augeas is AMAZING. I don't know how I never found/saw it before.

We're also pushing out Splunk (Cox is a huge Splunk customer) across the board. I'm still undecided about its value but I haven't really looked at it since it first came out. It just seemed, at the time, like a huge overhead for a problem that was already tackled (syslogging).

My next step is deciding how to integrate everything into my Kickstart scripts. There's a TON of options out there. I'm really trying to avoid setting up Cobbler but I might be at the end of what can be done with Kickstart already.

I'm still frustrated with managing current versions of Ruby and gems with RPM but gem2rpm has made some of that easier. I'm not responsible for the Solaris boxen so I have no idea how those guys plan on managing those bad boys.