In Nagios, I do the following to add a new service to a host:
1) Define a check command.
2) Add a service to the host using that check command
3) Profit
About the check command:
Most likely, one is already defined but in the case of fairness I'm going with a brand new check. Let's use the one I'm beating my head against in OpenNMS - MySQL Replication status. So I have a script written. It runs some queries and gives me replication status. Additionally, I'm grabbing replication latency. I use this information to determine a more "fine grained" status. Being up or down is pointless if my slaves are 4 hours behind the master.
So I make my script take a few parameters - A warning level, a critical level and the hostname. So I have the following bits of configuration information:
define command {
command_name check_mysql_replication
command_line $USER99$/check_mysql_replication -h $HOSTADDRESS$ -u $USER1$ -p $USER2$ -w $ARG1$ -c $ARG2$
}
What the above gives me is the most flexible way to monitor replication on a system. Each hostgroup or host can have different levels for warning or critical. I could even change the USER macros into ARG so that I can pass the credentials. Oh and that script can be written in any language I like.
Then I add the service to my host/hostgroup. Let's assume that all my slaves are in a hostgroup together:
define service {
use production-service
hostgroup_name mysql-slaves
service_description MySQL Replication Status
check_command check_mysql_replication!200!400
}
There. Now replication is being monitored for that group of hosts. I get performance data. I get alerts based on warning vs. critical. All done.
EVIDENTLY, in OpenNMS I simply can't do this. To be fair, there's a different set of eyes that should be used with OpenNMS. It's a network monitoring system. Network. Network. Everything is centered around that.
So to accomplish the same thing in OpenNMS, I've gotten about this far:
1) Create a script. No hardship. I'm doing it anyway. This script looks a little different though. Let me paste the actual one here:
#!/bin/sh
#. /opt/sysops/lib/init
POLLHOST=""
while [ "$1" != "" ]; do
if [ "$1" = "--hostname" ]; then
shift
POLLHOST=$1
break
else
shift
fi
done
if [ "$POLLHOST" = "" ]; then
echo FAIL no host specified
exit 1
fi
QUERYPFX="mysql -u USERNAME --password=PASSWORD -h ${POLLHOST} --xml -Bse"
SLAVESTATUS=`${QUERYPFX} "show slave status;" | grep Slave_IO_Running | awk -F'[<|>]' '{print $3}'`
if [ "$SLAVESTATUS" = "Yes" ]; then
printf "SUCCESS\n"
exit 0
else
printf "FAIL\n"
printf "Status check returned: ${SLAVESTATUS}\n" 1>&2
exit 1
fi
#. /opt/sysops/lib/uninit
Now I have to edit an XML file (poller-configuration.xml). I had to do this as a package because I didn't want it applied to ALL boxes where MySQL is discovered. Only slaves:
<package name="MySQL-Replication-Slaves">
<filter>IPADDR IPLIKE *.*.*.*</filter>
<include-url xmlns="">file:/opt/opennms/include/mysql-replication-slaves.cfg</include-url>
<rrd step="300">
<rra xmlns="">RRA:AVERAGE:0.5:1:2016</rra>
<rra xmlns="">RRA:AVERAGE:0.5:12:1488</rra>
<rra xmlns="">RRA:AVERAGE:0.5:288:366</rra>
<rra xmlns="">RRA:MAX:0.5:288:366</rra>
<rra xmlns="">RRA:MIN:0.5:288:366</rra>
</rrd>
<service name="MySQL-Replication" interval="300000"
user-defined="false" status="on">
<parameter key="script" value="/scripts/cacti/checkrepl.sh"/>
<parameter key="banner" value="SUCCESS"/>
<parameter key="retry" value="1"/>
<parameter key="timeout" value="3000"/>
<parameter key="rrd-repository" value="/opt/opennms/share/rrd/response"/>
<parameter key="ds-name" value="replication-status"/>
</service>
<outage-calendar xmlns="">Nightly backup of atlsvrdbs03</outage-calendar>
<downtime begin="0" end="60000" interval="30000"/>
<downtime begin="60000" end="43200000" interval="60000"/>
<downtime begin="43200000" end="432000000" interval="600000"/>
<downtime begin="432000000" delete="true"/>
</package>
And additionally:
<monitor service="MySQL-Replication" class-name="org.opennms.netmgt.poller.monitors.GpMonitor"/>
There. Oh but wait, I also have to add something to a file called capsd-configuration.xml:
<protocol-plugin protocol="MySQL-Replication" class-name="org.opennms.netmgt.capsd.plugins.GpPlugin" scan="on" user-defined="true">
<property key="script" value="/scripts/cacti/checkrepl.sh" />
<property key="banner" value="SUCCESS" />
<property key="timeout" value="3000" />
<property key="retry" value="1" />
</protocol-plugin>
I think that's it. Now I have to wait for the scanner to run (or force it) to tie that poll to the servers in the range I defined. One thing you'll note is this GpPlugin that's being used. That's called the General Purpose Poller. It's basically the scripting interface. If you want to poll some arbitrary data that isn't via a predefined plugin or via SNMP, that's the way you have to do it.
The limitation of this is that it handles the poll in binary only. Either it's up or down. This goes back to the origins as a network monitoring system. The port is up or the port is down. The host is up or the host is down.
Over the years it appears that they've added other "plugins" that can handle information differently. These plugins support thresholding for alarms but really only in the area of latency in polling the service. Additionally, they appear to be simple port opens to the remote service - basically check_tcp in the Nagios world. There are some exceptions. I think the DNS plugin actually does a lookup. Some of the L2 related plugins do things like threshold on bandwidth. There's also a disk usage plugin that thresholds on free space. The Radius plugin actually tries to authenticate against the Radius server.
Then there's probably my biggest gripe. These are all written in Java. I'm not a java programmer. I don't want to have to write a god forsaken polling plugin in Java. If I need something faster than a perl/ruby/bash script for my plugin then I'll see about writing it in C but I've yet to come across that case.
So now I'm sitting at point where at least opennms knows when replication isn't running. I can modify my script to check latency and if it's over a certain point throw a FAIL but that's script-wide. I can't set it on a host by host basis. Replication is a bad example but it's not hard to imagine a situation where two servers running the same service would have different thresholds. In the OpenNMS case, I'd have to account for all that logic in my script.
But John OpenNMS supports Nagios plugins, you might be thinking.
No they don't. They support NRPE. This means I have to have all my scripts installed on every single host I monitor AND I have to install the agent on the system itself. Why should I have to do that when, with other systems, I can do all the work from the script on the monitoring server?
Oh but you can use extTable.extEntry in snmpd.conf. Same problem as NRPE without the agent hassle. I still have to copy the scripts to the remote server.
So after spending all morning googling, I might be able to do what I want (i.e. thresholding) if I hook another program into my script that's distributed with OpenNMS - send-event.pl. However the documentation on send-event.pl is SORELY lacking. Maybe I don't speak OpenNMS well enough for my google-foo to work but I can describe the exact same set of results I get every single time I look for ANYTHING related to send-event.pl:
1) Forcing the OpenNMS engine to discover a new node or restart a specific process within opennms using send-event.pl
2) Vague references with no actual code on using send-event.pl and SEC or swatch for log parsing. Of course these are all dated as OpenNMS has actually added a syslog plugin now.
And evidently, there's some XML editing that has to go on before I can actually do anything with send-event.pl. My experimentations this morning went like this:
./send-event.pl uei.opennms.org/internal/capsd/updateService localhost \
--interface 10.0.101.107 -n 107 --service MySQL-Replication --severity 6 --descr "Replication lagging"
Now I know I didn't get it entirely right. Let's forget for a moment the cryptic uei.opennms.org stuff. All I'm trying to do is send an event (send-event.pl sounds like the right tool) to opennms for a given node. I want that to be the equivilent of an SNMP trap. I know the uei line is wrong but I can't find a simple list of the uei strings that are valid to take a stab. There's a shell script called UEIList.sh but it complains about missing dependent shell scripts AND fires up java to do the work. Why is it so hard to have that list in the OpenNMS wiki with some notes about each one?
So after all these machinations, I'm still left with a half-assed implementation of replication monitoring in OpenNMS.
I would love for someone to point me to a rough outline of what I'm trying to accomplish in OpenNMS. I've hit every single link I could find. I've viewed EVERY SINGLE RESULT from google and the mailing lists that had send-event.pl in them. Maybe I'm going down the wrong track. Maybe I'm not looking at it the right way.
I asked my boss yesterday if there was any emotional attachment to OpenNMS and he said there wasn't. Right now they have cacti and opennms running. Double polling on most things just to get graphs into Cacti for trending purposes. At least by switching to Nagios, I can have the perfdata dumped to an RRD and just have cacti around for display purposes. Or I can have cacti continue to do some of the polling and have nagios check the RRD for data to alert on.
I'm giving OpenNMS another day and if I can't make any headway, I'm switching to Nagios.