Thursday, May 14, 2009

Annoyances with OpenNMS

So my company has an existing install of OpenNMS and Cacti. I don't like to barrel through the gate and make changes. I'm trying to work within the system here. So I come from a Nagios world. If I need something monitored, I write a script for it. Nagios handles it. My status can be up, down, warning, critical or unknown.

In Nagios, I do the following to add a new service to a host:
1) Define a check command.
2) Add a service to the host using that check command
3) Profit

About the check command:
Most likely, one is already defined but in the case of fairness I'm going with a brand new check. Let's use the one I'm beating my head against in OpenNMS - MySQL Replication status. So I have a script written. It runs some queries and gives me replication status. Additionally, I'm grabbing replication latency. I use this information to determine a more "fine grained" status. Being up or down is pointless if my slaves are 4 hours behind the master.

So I make my script take a few parameters - A warning level, a critical level and the hostname. So I have the following bits of configuration information:


define command {
command_name check_mysql_replication
command_line $USER99$/check_mysql_replication -h $HOSTADDRESS$ -u $USER1$ -p $USER2$ -w $ARG1$ -c $ARG2$
}


What the above gives me is the most flexible way to monitor replication on a system. Each hostgroup or host can have different levels for warning or critical. I could even change the USER macros into ARG so that I can pass the credentials. Oh and that script can be written in any language I like.

Then I add the service to my host/hostgroup. Let's assume that all my slaves are in a hostgroup together:


define service {
use production-service
hostgroup_name mysql-slaves
service_description MySQL Replication Status
check_command check_mysql_replication!200!400
}

There. Now replication is being monitored for that group of hosts. I get performance data. I get alerts based on warning vs. critical. All done.

EVIDENTLY, in OpenNMS I simply can't do this. To be fair, there's a different set of eyes that should be used with OpenNMS. It's a network monitoring system. Network. Network. Everything is centered around that.

So to accomplish the same thing in OpenNMS, I've gotten about this far:

1) Create a script. No hardship. I'm doing it anyway. This script looks a little different though. Let me paste the actual one here:


#!/bin/sh
#. /opt/sysops/lib/init

POLLHOST=""

while [ "$1" != "" ]; do
if [ "$1" = "--hostname" ]; then
shift
POLLHOST=$1
break
else
shift
fi
done

if [ "$POLLHOST" = "" ]; then
echo FAIL no host specified
exit 1
fi

QUERYPFX="mysql -u USERNAME --password=PASSWORD -h ${POLLHOST} --xml -Bse"
SLAVESTATUS=`${QUERYPFX} "show slave status;" | grep Slave_IO_Running | awk -F'[<|>]' '{print $3}'`

if [ "$SLAVESTATUS" = "Yes" ]; then
printf "SUCCESS\n"
exit 0
else
printf "FAIL\n"
printf "Status check returned: ${SLAVESTATUS}\n" 1>&2
exit 1
fi

#. /opt/sysops/lib/uninit

Now I have to edit an XML file (poller-configuration.xml). I had to do this as a package because I didn't want it applied to ALL boxes where MySQL is discovered. Only slaves:


<package name="MySQL-Replication-Slaves">
<filter>IPADDR IPLIKE *.*.*.*</filter>
<include-url xmlns="">file:/opt/opennms/include/mysql-replication-slaves.cfg</include-url>
<rrd step="300">
<rra xmlns="">RRA:AVERAGE:0.5:1:2016</rra>
<rra xmlns="">RRA:AVERAGE:0.5:12:1488</rra>
<rra xmlns="">RRA:AVERAGE:0.5:288:366</rra>
<rra xmlns="">RRA:MAX:0.5:288:366</rra>
<rra xmlns="">RRA:MIN:0.5:288:366</rra>
</rrd>
<service name="MySQL-Replication" interval="300000"
user-defined="false" status="on">
<parameter key="script" value="/scripts/cacti/checkrepl.sh"/>
<parameter key="banner" value="SUCCESS"/>
<parameter key="retry" value="1"/>
<parameter key="timeout" value="3000"/>
<parameter key="rrd-repository" value="/opt/opennms/share/rrd/response"/>
<parameter key="ds-name" value="replication-status"/>
</service>
<outage-calendar xmlns="">Nightly backup of atlsvrdbs03</outage-calendar>
<downtime begin="0" end="60000" interval="30000"/>
<downtime begin="60000" end="43200000" interval="60000"/>
<downtime begin="43200000" end="432000000" interval="600000"/>
<downtime begin="432000000" delete="true"/>
</package>


And additionally:


<monitor service="MySQL-Replication" class-name="org.opennms.netmgt.poller.monitors.GpMonitor"/>


There. Oh but wait, I also have to add something to a file called capsd-configuration.xml:


<protocol-plugin protocol="MySQL-Replication" class-name="org.opennms.netmgt.capsd.plugins.GpPlugin" scan="on" user-defined="true">
<property key="script" value="/scripts/cacti/checkrepl.sh" />
<property key="banner" value="SUCCESS" />
<property key="timeout" value="3000" />
<property key="retry" value="1" />
</protocol-plugin>


I think that's it. Now I have to wait for the scanner to run (or force it) to tie that poll to the servers in the range I defined. One thing you'll note is this GpPlugin that's being used. That's called the General Purpose Poller. It's basically the scripting interface. If you want to poll some arbitrary data that isn't via a predefined plugin or via SNMP, that's the way you have to do it.

The limitation of this is that it handles the poll in binary only. Either it's up or down. This goes back to the origins as a network monitoring system. The port is up or the port is down. The host is up or the host is down.

Over the years it appears that they've added other "plugins" that can handle information differently. These plugins support thresholding for alarms but really only in the area of latency in polling the service. Additionally, they appear to be simple port opens to the remote service - basically check_tcp in the Nagios world. There are some exceptions. I think the DNS plugin actually does a lookup. Some of the L2 related plugins do things like threshold on bandwidth. There's also a disk usage plugin that thresholds on free space. The Radius plugin actually tries to authenticate against the Radius server.

Then there's probably my biggest gripe. These are all written in Java. I'm not a java programmer. I don't want to have to write a god forsaken polling plugin in Java. If I need something faster than a perl/ruby/bash script for my plugin then I'll see about writing it in C but I've yet to come across that case.

So now I'm sitting at point where at least opennms knows when replication isn't running. I can modify my script to check latency and if it's over a certain point throw a FAIL but that's script-wide. I can't set it on a host by host basis. Replication is a bad example but it's not hard to imagine a situation where two servers running the same service would have different thresholds. In the OpenNMS case, I'd have to account for all that logic in my script.

But John OpenNMS supports Nagios plugins, you might be thinking.

No they don't. They support NRPE. This means I have to have all my scripts installed on every single host I monitor AND I have to install the agent on the system itself. Why should I have to do that when, with other systems, I can do all the work from the script on the monitoring server?

Oh but you can use extTable.extEntry in snmpd.conf. Same problem as NRPE without the agent hassle. I still have to copy the scripts to the remote server.

So after spending all morning googling, I might be able to do what I want (i.e. thresholding) if I hook another program into my script that's distributed with OpenNMS - send-event.pl. However the documentation on send-event.pl is SORELY lacking. Maybe I don't speak OpenNMS well enough for my google-foo to work but I can describe the exact same set of results I get every single time I look for ANYTHING related to send-event.pl:

1) Forcing the OpenNMS engine to discover a new node or restart a specific process within opennms using send-event.pl
2) Vague references with no actual code on using send-event.pl and SEC or swatch for log parsing. Of course these are all dated as OpenNMS has actually added a syslog plugin now.

And evidently, there's some XML editing that has to go on before I can actually do anything with send-event.pl. My experimentations this morning went like this:


./send-event.pl uei.opennms.org/internal/capsd/updateService localhost \
--interface 10.0.101.107 -n 107 --service MySQL-Replication --severity 6 --descr "Replication lagging"


Now I know I didn't get it entirely right. Let's forget for a moment the cryptic uei.opennms.org stuff. All I'm trying to do is send an event (send-event.pl sounds like the right tool) to opennms for a given node. I want that to be the equivilent of an SNMP trap. I know the uei line is wrong but I can't find a simple list of the uei strings that are valid to take a stab. There's a shell script called UEIList.sh but it complains about missing dependent shell scripts AND fires up java to do the work. Why is it so hard to have that list in the OpenNMS wiki with some notes about each one?

So after all these machinations, I'm still left with a half-assed implementation of replication monitoring in OpenNMS.

I would love for someone to point me to a rough outline of what I'm trying to accomplish in OpenNMS. I've hit every single link I could find. I've viewed EVERY SINGLE RESULT from google and the mailing lists that had send-event.pl in them. Maybe I'm going down the wrong track. Maybe I'm not looking at it the right way.

I asked my boss yesterday if there was any emotional attachment to OpenNMS and he said there wasn't. Right now they have cacti and opennms running. Double polling on most things just to get graphs into Cacti for trending purposes. At least by switching to Nagios, I can have the perfdata dumped to an RRD and just have cacti around for display purposes. Or I can have cacti continue to do some of the polling and have nagios check the RRD for data to alert on.

I'm giving OpenNMS another day and if I can't make any headway, I'm switching to Nagios.

12 comments:

  1. I should add that replication is only the tip of the iceberg. I have a JIRA ticket with a list of things we want to get into opennms that are currently either only being graphed in Cacti or are cronjobs running on every individual system.

    ReplyDelete
  2. Thanks for detailing this stuff, it is much more useful than OpenNMS sucks use Nagios. :)

    ReplyDelete
  3. This kind of complaint is common among folks who come to OpenNMS from a Nagios / Icinga background. It's perfectly reasonable that when you just want to do a simple test, you just write a simple script that obeys the interface convention and glue it in as a Nagios check. This model works well in an environment with a few dozen up to a few hundred managed nodes. Try to scale it up to thousands of nodes and you'll have twin headaches: the overhead of forking your check script thousands of times every few minutes will kill the server, and the overhead of managing the config files will kill you.

    OpenNMS was designed from the very beginning to manage tens of thousands of nodes from a single installation, and every decision about its design is made with that goal in mind. As developers, we're well aware that there is significantly more overhead involved in adding a service to OpenNMS than to a product like Nagios. It's an intentional trade-off between time spent up front and time saved down the road.

    At the end of the day, Nagios and OpenNMS are both great tools that can often be used to solve the same kinds of problems -- they're just designed to solve those problems on very different scales. You must have your head in the appropriate mode for the tool you're using.

    ReplyDelete
  4. @phoenix rails - thanks. I'm not a fan of people who simply say "X" sucks and "Y" rules, myself.

    @jeffg -
    I've had Nagios installations that have monitored "thousands of nodes". At a previous company we had two Nagios installs (for administrative purposes not performance).

    The first install handled our backend stuff. Monitoring DB2 statistics. AIX. Linux. HP Network gear. Standard stuff.

    The second was much "larger". It monitored our retail environment. At the time that I left the company, we had 600ish retail locations.

    Each location had anywhere from 1-6 bits of network gear based on connectivity in the region:

    - A DSL or Cable modem
    - A Netscreen 5GT
    - 1 HP ProCurve AP 420
    - 1-4 HP JetDirect Print Servers

    As far as the DSL/Cable modem went, we simply did icmp checks. We did (I think), 4 or 5 snmp checks on the Netscreen. The access point had 1 or 2 checks and the print servers had 2 or 3 checks. We didn't bother to poll the workstations at the site as those were designed to be interchangeable.

    So taking our 600 store number and working with the smallest store (Netscreen 5GT, HP 420 AP and 2 Print Servers), that's 2400 endpoints being monitored. If we assumed that each endpoint had two checks (snmp and icmp), that's more nodes depending on how you define nodes.

    The system ran fine. Mind you those checks were C-based but custom monitor scripts are more often in my experience not as heavily polled. My custom script to follow my business logic in the database is only used against one host for instance. However my icmp check is used against EVERY host.

    I'm sure others can back up with statistics on Nagios installation sizes however I'm not in this for a measurement contest ;)

    You have one point though, large installs can be a pain to manage IF you don't follow some best practices for managing your config files. I have a series that I haven't had time to finish publishing on just that topic.

    I will, however, disagree that the method of configuration has any bearing on the number of nodes that OpenNMS can support. In both cases, the configuration is stored in a flat file (XML vs. Nagios' own configuration syntax). Nagios "compiles" that down at startup from all the config files into an optimized format for the engine to use.

    Again, the biggest concern/question I had is if OpenNMS is the right tool for this task. I'm willing to work with it if it so.

    Based on my original post, I don't know what frame of mind I should be in for OpenNMS. Scaling isn't the question. It appears that OpenNMS was designed around SNMP at the core. Interface events. Up/Down. Handling traps from devices. I can't see, short of learning Java, to actually write a simple business logic check that has anything other than Yes or No as the answer.

    ReplyDelete
  5. The two most important factors in deciding which tool to use for a job are:

    1. Which one is better suited over the life of the implementation?
    2. Which is more comfortable / familiar to you and the other people who will be maintaining the solution?

    There's a third question that's harder to answer, which is why your predecessors selected OpenNMS and Cacti. Generally speaking, the former is a superset of the latter's functionality, so there's probably a complicated explanation involving organizational changes and company politics :)

    I've got only your resentment to gain by convincing you the wrong way, so I'll leave you with a simple suggestion:

    Use what works for you.

    ReplyDelete
  6. Jeff,

    Thanks for your honesty. I wasn't honestly looking for a fight when I made the post. I did tag my tweet hoping to get some feedback from some OpenNMS users.

    I did write a lot of the post out of frustration of wasting my entire morning trying to find an answer to my solution. I try the newsgroups/mailing list/irc route but I didn't feel like I had a lot of time to waste after the delay this morning.

    I'm going to give OpenNMS more time tomorrow. I'll try and get my head in the OpenNMS "mode" and see what I can "discover".

    What's interesting is that, while we're a php shop, several of our developers have java backgrounds. I actually got a loose commitment from my boss to have one of the developers see what they can come up with plugin-wise in-house once I have all the information to pass them in place.

    Not many people can say that ;) heh.

    ReplyDelete
  7. been poking around opennms for 3 days now. all i wanna do is send a snmp v3 trap from my own code to opennms at a certain port[like send(pdu, target)]. and hopefully opennms can/may detect the traps as events/alarms somewhere and record it in its notifications/events or alarms tab. been through all the config files in /etc but still no joy..
    ive search everycorner in google but fails to find some guiding thought...

    ReplyDelete
  8. We wanted to move off Nagios awhile ago and eval'd OpenNMS and Zenoss. Frankly, one of our requirements was being able to setup hosts via the GUI...the other major requirement was being able to use all of the custom Nagios checks we've written. OpenNMS is just a pain to install and manage. Most monitoring systems don't get touched very much other than to view the stats...so every time you have to change/add a host you've got to re-wrap your mind around it. Zenoss is much better from that point of view. The benefits of OpenNMS with a decent configuration GUI and the ability to use Nagios plugins (not just via NRPE).

    ReplyDelete
  9. I like nagios. I like opennms. But zenoss over opennms? c'mon :)

    Zenoss doesn't present data in intuituve ways. It's filled with data one has to sift through to find anything useful and it takes a lot of clicks to get one piece of information. I can't stand that GUI honestly. It's a waste of time to troubleshoot or baseline things with Zenoss.

    Opennms on the other hand may take a little more to configure but it presents data in a much more useful manner and the reports are far better than any of the other open source monitoring tools available. It has a good balance of meeting executive and engineer requirements. Nagios is good for the engineer. Zenoss is good for GUI users that want to avoid the command line and don't know what they're looking at anyways.

    ReplyDelete
  10. I like nagios. I like opennms. But zenoss over opennms? c'mon :)

    Zenoss doesn't present data in intuituve ways. It's filled with data one has to sift through to find anything useful and it takes a lot of clicks to get one piece of information. I can't stand that GUI honestly. It's a waste of time to troubleshoot or baseline things with Zenoss.

    Opennms on the other hand may take a little more to configure but it presents data in a much more useful manner and the reports are far better than any of the other open source monitoring tools available. It has a good balance of meeting executive and engineer requirements. Nagios is good for the engineer. Zenoss is good for GUI users that want to avoid the command line and don't know what they're looking at anyways.

    ReplyDelete
  11. I'm not sure if this was pointed out before but OpenNMS can handle Nagios agents/scripts/NPRE/NSClient/etc with very little modifications. http://www.opennms.org/wiki/Discovery#NRPE

    ReplyDelete
  12. hey bro, was wondering what happened finally, u gave up on OpenNMS or u are still using it?!

    ReplyDelete