Thursday, May 14, 2009

Annoyances with OpenNMS

So my company has an existing install of OpenNMS and Cacti. I don't like to barrel through the gate and make changes. I'm trying to work within the system here. So I come from a Nagios world. If I need something monitored, I write a script for it. Nagios handles it. My status can be up, down, warning, critical or unknown.

In Nagios, I do the following to add a new service to a host:
1) Define a check command.
2) Add a service to the host using that check command
3) Profit

About the check command:
Most likely, one is already defined but in the case of fairness I'm going with a brand new check. Let's use the one I'm beating my head against in OpenNMS - MySQL Replication status. So I have a script written. It runs some queries and gives me replication status. Additionally, I'm grabbing replication latency. I use this information to determine a more "fine grained" status. Being up or down is pointless if my slaves are 4 hours behind the master.

So I make my script take a few parameters - A warning level, a critical level and the hostname. So I have the following bits of configuration information:


define command {
command_name check_mysql_replication
command_line $USER99$/check_mysql_replication -h $HOSTADDRESS$ -u $USER1$ -p $USER2$ -w $ARG1$ -c $ARG2$
}


What the above gives me is the most flexible way to monitor replication on a system. Each hostgroup or host can have different levels for warning or critical. I could even change the USER macros into ARG so that I can pass the credentials. Oh and that script can be written in any language I like.

Then I add the service to my host/hostgroup. Let's assume that all my slaves are in a hostgroup together:


define service {
use production-service
hostgroup_name mysql-slaves
service_description MySQL Replication Status
check_command check_mysql_replication!200!400
}

There. Now replication is being monitored for that group of hosts. I get performance data. I get alerts based on warning vs. critical. All done.

EVIDENTLY, in OpenNMS I simply can't do this. To be fair, there's a different set of eyes that should be used with OpenNMS. It's a network monitoring system. Network. Network. Everything is centered around that.

So to accomplish the same thing in OpenNMS, I've gotten about this far:

1) Create a script. No hardship. I'm doing it anyway. This script looks a little different though. Let me paste the actual one here:


#!/bin/sh
#. /opt/sysops/lib/init

POLLHOST=""

while [ "$1" != "" ]; do
if [ "$1" = "--hostname" ]; then
shift
POLLHOST=$1
break
else
shift
fi
done

if [ "$POLLHOST" = "" ]; then
echo FAIL no host specified
exit 1
fi

QUERYPFX="mysql -u USERNAME --password=PASSWORD -h ${POLLHOST} --xml -Bse"
SLAVESTATUS=`${QUERYPFX} "show slave status;" | grep Slave_IO_Running | awk -F'[<|>]' '{print $3}'`

if [ "$SLAVESTATUS" = "Yes" ]; then
printf "SUCCESS\n"
exit 0
else
printf "FAIL\n"
printf "Status check returned: ${SLAVESTATUS}\n" 1>&2
exit 1
fi

#. /opt/sysops/lib/uninit

Now I have to edit an XML file (poller-configuration.xml). I had to do this as a package because I didn't want it applied to ALL boxes where MySQL is discovered. Only slaves:


<package name="MySQL-Replication-Slaves">
<filter>IPADDR IPLIKE *.*.*.*</filter>
<include-url xmlns="">file:/opt/opennms/include/mysql-replication-slaves.cfg</include-url>
<rrd step="300">
<rra xmlns="">RRA:AVERAGE:0.5:1:2016</rra>
<rra xmlns="">RRA:AVERAGE:0.5:12:1488</rra>
<rra xmlns="">RRA:AVERAGE:0.5:288:366</rra>
<rra xmlns="">RRA:MAX:0.5:288:366</rra>
<rra xmlns="">RRA:MIN:0.5:288:366</rra>
</rrd>
<service name="MySQL-Replication" interval="300000"
user-defined="false" status="on">
<parameter key="script" value="/scripts/cacti/checkrepl.sh"/>
<parameter key="banner" value="SUCCESS"/>
<parameter key="retry" value="1"/>
<parameter key="timeout" value="3000"/>
<parameter key="rrd-repository" value="/opt/opennms/share/rrd/response"/>
<parameter key="ds-name" value="replication-status"/>
</service>
<outage-calendar xmlns="">Nightly backup of atlsvrdbs03</outage-calendar>
<downtime begin="0" end="60000" interval="30000"/>
<downtime begin="60000" end="43200000" interval="60000"/>
<downtime begin="43200000" end="432000000" interval="600000"/>
<downtime begin="432000000" delete="true"/>
</package>


And additionally:


<monitor service="MySQL-Replication" class-name="org.opennms.netmgt.poller.monitors.GpMonitor"/>


There. Oh but wait, I also have to add something to a file called capsd-configuration.xml:


<protocol-plugin protocol="MySQL-Replication" class-name="org.opennms.netmgt.capsd.plugins.GpPlugin" scan="on" user-defined="true">
<property key="script" value="/scripts/cacti/checkrepl.sh" />
<property key="banner" value="SUCCESS" />
<property key="timeout" value="3000" />
<property key="retry" value="1" />
</protocol-plugin>


I think that's it. Now I have to wait for the scanner to run (or force it) to tie that poll to the servers in the range I defined. One thing you'll note is this GpPlugin that's being used. That's called the General Purpose Poller. It's basically the scripting interface. If you want to poll some arbitrary data that isn't via a predefined plugin or via SNMP, that's the way you have to do it.

The limitation of this is that it handles the poll in binary only. Either it's up or down. This goes back to the origins as a network monitoring system. The port is up or the port is down. The host is up or the host is down.

Over the years it appears that they've added other "plugins" that can handle information differently. These plugins support thresholding for alarms but really only in the area of latency in polling the service. Additionally, they appear to be simple port opens to the remote service - basically check_tcp in the Nagios world. There are some exceptions. I think the DNS plugin actually does a lookup. Some of the L2 related plugins do things like threshold on bandwidth. There's also a disk usage plugin that thresholds on free space. The Radius plugin actually tries to authenticate against the Radius server.

Then there's probably my biggest gripe. These are all written in Java. I'm not a java programmer. I don't want to have to write a god forsaken polling plugin in Java. If I need something faster than a perl/ruby/bash script for my plugin then I'll see about writing it in C but I've yet to come across that case.

So now I'm sitting at point where at least opennms knows when replication isn't running. I can modify my script to check latency and if it's over a certain point throw a FAIL but that's script-wide. I can't set it on a host by host basis. Replication is a bad example but it's not hard to imagine a situation where two servers running the same service would have different thresholds. In the OpenNMS case, I'd have to account for all that logic in my script.

But John OpenNMS supports Nagios plugins, you might be thinking.

No they don't. They support NRPE. This means I have to have all my scripts installed on every single host I monitor AND I have to install the agent on the system itself. Why should I have to do that when, with other systems, I can do all the work from the script on the monitoring server?

Oh but you can use extTable.extEntry in snmpd.conf. Same problem as NRPE without the agent hassle. I still have to copy the scripts to the remote server.

So after spending all morning googling, I might be able to do what I want (i.e. thresholding) if I hook another program into my script that's distributed with OpenNMS - send-event.pl. However the documentation on send-event.pl is SORELY lacking. Maybe I don't speak OpenNMS well enough for my google-foo to work but I can describe the exact same set of results I get every single time I look for ANYTHING related to send-event.pl:

1) Forcing the OpenNMS engine to discover a new node or restart a specific process within opennms using send-event.pl
2) Vague references with no actual code on using send-event.pl and SEC or swatch for log parsing. Of course these are all dated as OpenNMS has actually added a syslog plugin now.

And evidently, there's some XML editing that has to go on before I can actually do anything with send-event.pl. My experimentations this morning went like this:


./send-event.pl uei.opennms.org/internal/capsd/updateService localhost \
--interface 10.0.101.107 -n 107 --service MySQL-Replication --severity 6 --descr "Replication lagging"


Now I know I didn't get it entirely right. Let's forget for a moment the cryptic uei.opennms.org stuff. All I'm trying to do is send an event (send-event.pl sounds like the right tool) to opennms for a given node. I want that to be the equivilent of an SNMP trap. I know the uei line is wrong but I can't find a simple list of the uei strings that are valid to take a stab. There's a shell script called UEIList.sh but it complains about missing dependent shell scripts AND fires up java to do the work. Why is it so hard to have that list in the OpenNMS wiki with some notes about each one?

So after all these machinations, I'm still left with a half-assed implementation of replication monitoring in OpenNMS.

I would love for someone to point me to a rough outline of what I'm trying to accomplish in OpenNMS. I've hit every single link I could find. I've viewed EVERY SINGLE RESULT from google and the mailing lists that had send-event.pl in them. Maybe I'm going down the wrong track. Maybe I'm not looking at it the right way.

I asked my boss yesterday if there was any emotional attachment to OpenNMS and he said there wasn't. Right now they have cacti and opennms running. Double polling on most things just to get graphs into Cacti for trending purposes. At least by switching to Nagios, I can have the perfdata dumped to an RRD and just have cacti around for display purposes. Or I can have cacti continue to do some of the polling and have nagios check the RRD for data to alert on.

I'm giving OpenNMS another day and if I can't make any headway, I'm switching to Nagios.

Tuesday, May 12, 2009

Annoying sign in Roswell

So my work phone doesn't have a camera built in. Couldn't snap a picture but I saw a sign in downtown Roswell today at lunch that really pissed me off.

"Be patriotic. Stimulate the economy"

This was on the front of one of the local shops. My first reaction, which I dutifully followed through with, was to call my wife and bitch and moan.

I dread the day when my spending becomes the measure of my patriotism.

Monday, May 11, 2009

Icinga musings

So I just heard about the Icinga fork of Nagios. Looking at the motivation, I can't say that I disagree with the fork. Obviously there are a standard set of "things" that most people add to Nagios. Many times those "things" feel like they should be standard and others are just fluff.

Looking over the Icinga project team, we can get a feel (and from the project goals) the things that they want to add on:

  • PNP
  • NagVis
  • Grapher
  • NagTrap
  • more NDO

I can get behind most of those. I think they're all wonderful addons. However, I have some reservations about one of the ones not listed. This is from the Icinga page:

The most significant modification and difference to Nagios is a completely new web interface based on PHP. This allows a wider circle of developers to contribute to the web interface and easier adjustments to be made by users to their individual environments.
I honestly just cannot get behind that in principle. I fully understand the concerns and problems people have. The Nagios interface is "ugly" in a sense but change for change's sake is just silly.

Why PHP? Why not perl (see Groundwork) or ruby or python? It's an arbitrary decision. I like my Nagios installations slim and doing what they do best, monitoring and alerting. You don't even HAVE to have a web interface. I don't want to have to bog down my monitoring server with YAP (yet another package). For all its warts, I like the way the Nagios web interface worked. There's nothing wrong with CGI scripts. They work. The Nagios cgis worked.

Was it a bitch to deal with them? Of course but moving to PHP isn't going to immediately make it better unless there's a framework or an API or standards to work against. I have full faith in the Icinga team to make an outstanding interface but I'm wondering what sort of process is going to be in place to make sure the interface is "stable". One thing that can be argued in favor of the current setup is that it's not at the whims of a constantly changing language like PHP.*

I guess my feeling is that the Icinga folks want to make something MORE of Nagios. Make it more than what it is at the core - network monitoring. There's a valid argument to be made that an "enterprise" monitoring system should have an SNMP trap handler but I personally don't think snmptt is the way to go. If it's that important, it should be something NOT written in a scripting language. If handling traps is of the utmost importance, it should be able to handle whatever volume of traps per second you throw at it. I can't find any performance numbers for snmptt so I can't tell you.

I think the biggest problem I've had with Nagios is that it isn't modular enough. It lacks something we've all come to appreciate these days - the concept of plugins. Admittedly, it's one guy. If he doesn't see a need for it, then we probably won't ever see it. Nagios really needs a standard way for people to plug in to it. Right now we have bolt-on solutions that never REALLY feel integrated. Maybe that's what Icinga wants to do. I can appreciate, however, the lean-ness that Nagios has had for this long. Maybe times have changed an monitoring doesn't just encompass monitoring anymore. I don't know but in my mind, monitoring is still a distinct entity from trending. They go hand in hand but Nagios has never billed itself as an all in one monitoring and trending solution. It monitors, and it alerts. Occasionally it "event handles" but long term storage and analysis of the data is out of scope.

Anyway, much of this has been a ramble based on first blush. I'm sure I'll have more to say. I'll follow the project closely and see what it does. I fully expect a lot of people to switch over just for the "completeness" and "asthetic" factor. Groundwork has clients after all. The demand is there. However, I'm just not sure if I'll make the switch myself.

Maybe the whole thing will prompt Ethan to respond in a positive way and make my wish list come true ;)

- API into the monitoring system
- Native support for RRD storage of perfdata information

Those are my two biggest. I would LOVE to have an API into the live core of the engine to make changes to resources. One thing that I loved about Groundwork (I think it was Groundwork) was that it had a command-line API for adding and removing hosts. I'm really hoping that in the end, we end up with Nagios as a framework with its own basic functionality but that better allows the design of solutions built on top of it. Want to build your own interface? Pull a list of hosts from the API. Pull a list of last known states for each host. Display it.

* By constantly changing, I mean compared to traditional languagues like C. PHP also has (and many developers will admit this) inconsistencies and other "gotchas" left over from years of backwards compatibility.