
Lustre Frameserver from Nagios
We have been looking at setting up an open source monitoring solution at the office for quite some time (I remember having a discussion about Nagios on my first day at work), but looking at the Nagios docs made me think that setting up all those .cfg files was going to really suck, so I looked at alternatives.
Over the christmas holidays I installed Zenoss (http://www.zenoss.com/), mainly because it promised to crawl thru our networks and be a breeze to set up using the webGUI. It did do the crawling as promised, but setting anything further than the basic settings was really, _really_ painful. Sure, it was nice with the couple of windows machines it recognised immediately, but otherwise it was pretty useless.
It took me several months to finally bite the bullet, but last week I got around to finally installing Nagios at the office. After setting up the basic checks for the localhost ( a RHEL server we had reserved for this use) in a couple of hours, I was quickly feeling pretty proficient with all the different .cfg files, and started venturing into the unknown…
RAID-Chassis
Getting information from out ten-odd RAID-subsystems would be quite important. I first thought I had struck gold when I found check_promise_vtrak from http://www.consol.com/apple/nagios-plugins/check-promise-vtrak/ . Set up was a breeze, and I quickly got it working from the CLI. But the plugin refuses to function from Nagios, it only returns (null), which ends as a critical error and an email in my box. Probably a small fix is needed in the plugin itself, because plainly it is not returning anything readable by Nagios.
Mac OS X Servers
Using the basic plugins and nrpe (http://nagios.sourceforge.net/docs/1_0/addons.html) I was able to check all the basic data on our servers. I would like to monitor the actual services themselves (usually AFP and SMB, with some DNS and OD on some machines). As above, I thought check_osx_services (http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1497.html;d=1) was going to fix all my problems. Again I was foiled at the start: this plugin wouldn’t work properly even from the CLI.
Autodesk Lustre /w Incinerator
Since this is the system that is keeping me the busiest in normal times, I wanted Nagios to help me out here.
Lustre /w Incinerator is a complex system, consisting of a workstation, frameserver, 8 rendering nodes, an ethernet network for commands and an Infiniband network for moving those frames around. With so many moving parts, there are way too many failure points here. Lately we have had a lot of issues with the renderd service (that handles the rendering and contact with the server) crashing on the nodes by itself. I have simple scripts that allow me to restart the service on all the nodes at once, and it takes just seconds to run. The difficult part is getting the info when the nodes have crashed. Installing the normal checking tools on the nodes was not an option for a couple of reasons. Firstly: I don’t want to have too many extra things installed on the nodes and secondly: the nodes are not actually on any other network that the incinerator network, so there is no access to them from the Nagios server.
I have installed nagios plugins and nrpe on the frameserver. These check the normal things on the server (root disk space, CPU loads, Processes etc.). I also created a specific check to handle Browsed, the process that handles serving the frames to the workstation and the nodes. After some searching I discovered a plugin called check_process_by_ssh (http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F2013.html;d=1)
which allowed me to formulate a suitable nrpe command to execute (from nrpe.cfg on the frameserver):
command[check_node1]=/usr/local/nagios/libexec/check_process_by_ssh -H node1 renderd
I then added the check to my Linux definitions:
define service{
use generic-service
host_name frameserver
service_description Incinerator Node 1
check_command check_nrpe!check_node1
}
This worked fine from the CLI, but nagios didn’t get thru and said the status was critical. After some thought I realized the problem: nagios executes all scripts as the user nagios, and I was doing my testing as root. The Nagios user sisntä have the needed SSH authentication settings, so I copied the needed file (id_rsa) to a suitable folder, and modified the command on the frameserver:
command[check_node1]=/usr/local/nagios/libexec/check_process_by_ssh -H node1 -k /usr/local/nagios/keys/id_rsa -u root renderd
Now the checks work without a hitch, and I get an email about the nodes being down before an operator has been wondering what is wrong for an hour…