Getting state information out of Nagios to the people who need it

Nagios, essentially, is a big state machine. Various actively and passively monitored host and service states are checked against specific conditions and then actions are taken upon these conditions based upon how long they've been occurring, how often they've occurred, when they've occurred etc.

The commands that are performed to check a command, the commands that are performed to try and resolve a situation and the commands that are launched to notify a relevant party about the condition at hand are all configurable... but not everything comes with Nagios, out of the box.

So that an operator doesn't need to be continually polling the web interface, or going there when they think there's something wrong, Nagios needs to be configured to be proactive in its notifications. It needs to tell people there's a problem, ideally before they know it's happened, so they can fix it before even more people notice.

In two of my prior jobs, we used a combination of:

Web - the manual checking just to get an overall picture
Email - although use of this was later deprecated
Firefox plugin - Nagios Checker is a very very useful plugin for firefox that'll scrape status information from the Nagios web interface and filter it to only show you what you need to know, all in one place
XMPP and Windows Live Messenger

The Nagios Web Interface

First off, I'll start by saying that whilst Nagios is an excellent, free, open source monitoring system, its web interface is atrocious. It's not received any significant attention in quite a few years and it would appear that this is in part because the interface is where a lot of organisations are seeking to commercialise Nagios. That is, to get a better interface at the moment, you have to pay.
Email is ok for when you expect reports from scheduled tasks that don't need an immediate response, however once a backlog of emails builds up (say for non critical services that still need to be checked in on now and then), it gets wholly unmanageable.

One of the particular issues I often had to deal with a lot was that a lot of monitored services and hosts at my old employer were transient by design - in use one day but not another, not always on 24x7 and not in an easily scheduled fashion.. so I implemented some workarounds involving passive checking and freshness active check fallbacks.. but this left me with a lot of services in a non-OK but acceptable state. This meant the web interface was always very cluttered and it was hard to get an "all green" state.

Nagios Checker Firefox Plugin

Enter, the Nagios Checker plugin. It allows filtering of things like acknowledged issues and services in specific states. If you are using a Nagios system, have access to the web interface and use firefox as your browser, I actively encourage you to give this plugin a go.. I will say this, however, the default alert noise is something I often disabled.

Sitting in the lower right hand corner of your firefox window, it will actively poll multiple configured Nagios web interface installations for status information and quickly display in one place things like how long a service has been in the non-OK state, how many times it's been checked, if the state is "hard" or not as well as the check output amongst other things.

Besides using it myself, I also installed it on several end users systems but configured to log in with their nagios credentials instead of those used by myself and the rest of the admin team. This meant that they only saw information in the plugin about services that they were allowed to see or be notified about.

XMPP (Jabber) and Windows Live Messenger (MSN) Integration

I've seen a few different ways of getting Nagios to send instant messages, but a lot of them tended to be limited to a specific protocol. As open as the XMPP protocol is, not everyone uses an IM account which handles this protocol. Also, to keep things simple, I like the idea of not adding a separate program for every single protocol.

The solution I came up with, after asking around about IM options, was to use CenterIM. CenterIM is a text based chat program that uses line drawing characters to have windows etc. It has a basic menu system and allows the use of multiple protocols (ICQ, Yahoo!, AIM TOC, IRC, MSN, Gadu-Gadu and XMPP/Jabber with work underway on a rewrite using libpurple to implement a wider protocol variety - however, only one account per protocol at this time). The really handy feature with respect to Nagios is that you can instruct it to queue messages to be sent out via IM from the command line.

The downside to CenterIM is that it's not designed to be a daemon so some additional magic needs to be worked to coax it into running in the background;
Enter GNU Screen. In their words, it's:

... a full-screen window manager that multiplexes a physical terminal between several processes, typically interactive shells...
... Programs continue to run ... even when the whole screen session is detached from the users terminal.

So, I invoke CenterIM within screen and then immediately ask screen to detach itself such that CenterIM is effectively daemonised but believes it's running in a normal, interactive VT100 terminal.

Unfortunately, when invoking CenterIM from the command line to send an instant message, the executable needs to be invoked within the login environment of the user CenterIM is being run as.

As a rule, I always try and avoid running anything as root. Unfortunately when running su, sudo or a setuid executable, the full login environment of the given user isn't necessarily loaded and this means the executable can't find the configuration files and message spool.

This bit's a bit hackish, but what I settled on was to enable the user the Nagios daemon runs as to invoke a shell script via sudo without a password, run as the user the CenterIM runs as, that allows for the queuing of an instant message. The shell script is because when invoking sudo and instructing it to load the user's login environment (-i) it passes the commands supplied to it, directly to the user's login shell. This means that the shell expects to be able to interpret the command as a shell script. This also allows for easier debugging.

Resources

If there's interest, I'll post up some of my configs here...

Welcome to HOL

Tuesday, 3 August 2010

Getting state information out of Nagios to the people who need it

No comments:

Post a Comment