Friday, April 4, 2008

Nagios Configuration Tips Part 1 - cfg_dir

One of the key problems I see with people using Nagios is the fact that they add EVERYTHING into a single file for each type of object. This is obviously fine when you have only a few systems to monitor but starts to become unwieldy when you have 10, 20 or even 100 servers to monitor. This article is to show you what I consider to be a very flexible file system layout for your Nagios configurations. The end result is a configuration structure that allows you to easily jump to the source of a configuration problem and encourages the use of object templating. This is the first in a multi part series of Nagios Configuration Tips.

Since we are only concerned about configuration files, I'm only going to paste the relevant lines from the nagios.cfg:


(organization is really just an arbitrary directory to logically group a collection of objects. )

The explicitly defined files are where we keep the more "global" stuff. Objects that are shared across all configs for all organizations/domains or system-level stuff.

This file is a standard nagios cfg file containing a list of command definitions. In this case, I'm only keeping the stuff that applies to the local system (check_local_*) and the notify-service-by-email/notify-host-by-email defines.
contacts.cfg has a definition for a nagiosadmin account, contactgroups.cfg has a definition for a testgroup contactgroup and timeperiods.cfg has a definition for 24x7.

And that's it for the base configuration files. Notice that there really isn't much in them. As you'll see, all of the heavy lifting will be done by the stuff in the cfg_dir.

So now let's look at what we have in our cfg_dir.

For this example, we're going to assume that we have two areas that we need to monitor, systems and processes. Let's also use the fictional company name of widgetcorp. Systems are exactly what they sound like. This is where we monitor things at the host level like reachability, loadavg and disk utilization. Process would be things that we monitor at a higher level like database locks, http connections, jvm usage or even specific business processes like user logins, outstanding orders shipments or even the date of the last warehouse load.

So let's create the following directory structure under /etc/nagios/objects/:


Now before we write any configs, let's think about how we want to categorize our these new directories. At widgetcorp, we have three classes of systems - database and application. Let's create those directories under systems:


Being the sane company that they are, widgetcorp was smart enough to invest in a minimal level of high availability. This environment consists of 4 servers using round-robin DNS to balance between application servers and using linux-ha to provide access to the database servers. Notice that I've not yet defined WHAT application server is running or what dbms is being used. These systems are named app01,app02,dbs01,dbs02.


As far as the process monitoring goes, we have two types of "processes" we need to concern ourselves with - application response time and database server availability.


Sidebar: One thing you'll note is a particular attitude I have. I consider physical systems "interchangeable". I don't want to tie the fact that I run MySQL on to the status of as a whole. What if we're operating using MySQL Proxy or operating a Linux-HA MySQL cluster or using HACMP on AIX for DB2? The availability of a single system is really quite independent from the higher level availability of the service that grouping of systems provides. Our application servers would never be configured to talk to JUST db01 but instead would use the name of the mysql proxy server or the VIP assigned to the HA cluster - Using service dependencies, you can still tie the polling process of db.widgetcorp to a specific server or uplink.

Back to the layout. We now get to discuss what programs are actually installed on each server because the facts that we need are from those programs.

In the case of app01 and app02, they are both running tomcat and apache with mod_jk. All traffic coming in from the internet is balanced between each apache server on port 80 talking to a localhost-listening tomcat instance on the jk connector port. These details aren't really important for the purposes of this document except to say that our customers don't go to app01.widgetcorp or app02.widgetcorp but instead

As for the database, the databases are using Linux-HA and MySQL replication to talk to the database server. Each server has a VIP assigned which is aliased to dbrw.widgetcorp and dbro.widgetcorp. The current MASTER in the replication process is assigned the VIP for dbrw and the SLAVE is assigned the VIP for dbro. When one of the systems fails, the other assumes the role of BOTH as the application performs lookups against dbro while doing actual inserts and updates against dbrw.

All of the above means our directory structure now looks like this:


And that's it for the first part of this post. The next post will get into the actual naming, location and content of the configuration files. Please feel free to leave comments and let me know your thoughts. Please also be aware that I'm intentionally trying to be generic in these examples. Don't get too caught up in the fictional implementation of the company. I'm aware of the limitations of both round-robin DNS as well as the MySQL implementation. I only picked these as high-level examples.

Thanks and I look forward to the comments!


Anonymous said...

What happened to part 2?

lusis said...

It's still sitting in draft form. It's on my todo list to publish it next week since I have a bit of free time.

Thanks for the nudge ;)

snake007uk said...

any updates on that part 2? :)