Nagios /w incinerator, revisited
The Problem
After a couple of weeks testing, Nagios has worked very well. Alarms have been sent consistently and to the right addresses. Pretty soon after my first real Incinerator problem, I noticed the real problem with my setup: for each incident where the incinerator nodes crashed, I got 8 separate emails. And after I fixed the issue, I got another 8 emails telling that the problem is fixed.
I thought I could work around the problem by creating a servicegroup with the incinerator nodes in it, and only enable alarms for the group as a whole. But you cannot assign alarms for servicegroups, only individual services. Nagios comes with check_cluster, but setting it up seemed like quite a bit of work, with wrappers etc.
The Solution
After some searching (and getting a nagios specific book, Nagios: System and Network Monitoring, 2nd Edition), I came across check_multi. It is a simple plugin that I installed on the Lustre mediaserver. I moved the commands that check the nodes from nrpe.cfg to a separate .cmd file, and added a new command into nrpe.cfg, that used the cmd file to run the check_multi command. Then I just added this as a nrpe command on the nagios server.
Now I get the status to Nagios neatly under one service, but I can still check the individual status of each node.
I have also been testing Cacti. My main interest is to get more data from switches and routers (especially since QLogic is asking 10000€ for the software to monitor their Fiberchannel switches with the newest firmware). I will write something about that later.
- Lustre service status
- Nodes service status

