Quick links for November

November 3rd, 2009

ZFS has full deduplication features:

http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup

This is extremely cool, and means that I will most definately finally get that Open Solaris server up and running. 10 GB Ethernet with NFS is starting to look like a viable option instead of expanding our SAN. The cost of NICs and switch can be offset quickly by the cost expanding the FC-Switches and the cost of the SAN-software…

Some more discussion on the subject:

http://tech.slashdot.org/story/09/11/02/2117206/ZFS-Gets-Built-In-Deduplication?from=rss

Filipp over at Unflying Object has some ideas on integrating the authentication for the Wiki and a Forum on a Mac OS X Server:

http://unflyingobject.com/blog/posts/1014

Quite straightforward, but not working for me yet, probably due to my “golden triangle”-setup with our AD-server.

Oh, and Spooks must the best TV-show in a long, long time. Been watching this on the iPod in the bus while commuting. Just brilliant, and quite realistic. Quite similar to The Song Of Ice And Fire in that they seem pretty happy to kill off any of the leads at any time…

A small script for BRU

November 3rd, 2009

One day I noticed that our operators had been very busy making archives from the Autodesk systems and forgot to inform me. This meant that tehre was about a terabytes worth of data was sitting on our nearline storage just waiting to be run onto tape and deleted, sitting in about 60 different archives.

Archiving these Autodesk archives is boring and manual work. Each archive usually consists of two files: project_name and project_name_1.seg, and I just copy the name of the project to be the name of the backup-job, then choose the two files and run the job. And all this takes time because the GUI for BRU is not the most agile one available. So I decided to write a little script I could run, just giving the folder where the archives are as a parameter.

This is probably not very efficient, and there are many ways to make it better. But it save quite a lot of time for me.

#!/usr/bin/env bash
# 2009-11-03
# juso@iki.fi

PASSWD="passwd"
SERVER="bru.servername"

if [ -z "$1" ]; then
 echo usage: $0 directory
 exit
fi

folder=$1
cd $folder

for file in *_1.seg
do
 file2=${file%_1.seg}
 upload=`(echo ${PASSWD}
 sleep 5
 echo 'backup -j "'$file2'" -t "Full" -D "destination" -o "append" -v
["/localhost'$folder'/'$file2'", "/localhost'$folder'/'$file'"]'
 sleep 5
 exit) | bru-server.cmd username $SERVER`
 echo $upload
done

Some disclaimers: I have not actually had time to test the script exactly as is (because our Bru-server is quietly clogging away at the 25 backup jobs I fed to it a while ago), but it should work. It is also my very first stab at Bru-automation.

Dreambox and the zen of discovery

August 20th, 2009

For a while I have been miffed at my PVR, wonderign where my samba mounting has gone. Finally, after testing several different images for dreambox, I discovered the problem: Mac OS X Leopard (and apparently Snow Leopard) do not by default support connections to SMB shares that allow Guests to authenticate.

Adding a file called

~/Library/Preferences/nsmb.conf

and adding the following lines:

[default]
minauth=none

on the workstation fixed the problem.

All this after midnight while watching Liverpool beat Stoke. Thanks to Google and Ubuntu forums:

http://ubuntuforums.org/archive/index.php/t-917156.html

Monitoring which folders are wasting space

June 3rd, 2009

Most of my work seems to be freeing up space on the different shared HDs we have around the facility. After trying several  apps that show how folders use space graphically (GrandPerspective used to be my favourite) and testing different scripts, I have found out that Disk Inventory X suits my workflow the best. It allows me to graphically see where the space is used, and shows me when the folder in question has been created and modified. Helps in finding all those ancient DPX-sequences that have been forgotten in the miasma that is oour SAN.

Nagios /w incinerator, revisited

April 7th, 2009

The Problem

After a couple of weeks testing, Nagios has worked very well. Alarms have been sent consistently and to the right addresses. Pretty soon after my first real Incinerator problem, I noticed the real problem with my setup: for each incident where the  incinerator nodes crashed, I got 8 separate emails. And after I fixed the issue, I got another 8 emails telling that the problem is fixed.

I thought I could work around the problem by creating a servicegroup with the incinerator nodes in it, and only enable alarms for the group as a whole. But you cannot assign alarms for servicegroups, only individual services. Nagios comes with check_cluster, but setting it up seemed like quite a bit of work, with wrappers etc.

The Solution

After some searching (and getting a nagios specific book, Nagios: System and Network Monitoring, 2nd Edition), I came across check_multi. It is a simple plugin that I installed on the Lustre mediaserver. I moved the commands that check the nodes from nrpe.cfg to a separate .cmd file, and added a new command into nrpe.cfg, that used the cmd file to run the check_multi command. Then I just added this as a nrpe command on the nagios server.

Now I get the status to Nagios neatly under one service, but I can still check the individual status of each node.

I have also been testing Cacti. My main interest is to get more data from switches and routers (especially since QLogic is asking 10000€ for the software to monitor their Fiberchannel switches with the newest firmware). I will write something about that later.

Nagios /w incinerator

March 23rd, 2009
Lustre Frameserver from Nagios

Lustre Frameserver from Nagios

We have been looking at setting up an open source monitoring solution at the office for quite some time (I remember having a discussion about Nagios on my first day at work), but looking at the Nagios docs made me think that setting up all those .cfg files was going to really suck, so I looked at alternatives.

Over the christmas holidays I installed Zenoss (http://www.zenoss.com/), mainly because it promised to crawl thru our networks and be a breeze to set up using the webGUI. It did do the crawling as promised, but setting anything further than the basic settings was really, _really_ painful. Sure, it was nice with the couple of windows machines it recognised immediately, but otherwise it was pretty useless.

It took me several months to finally bite the bullet, but last week I got around to finally installing Nagios at the office. After setting up the basic checks for the localhost ( a RHEL server we had reserved for this use) in a couple of hours, I was quickly feeling pretty proficient with all the different .cfg files, and started venturing into the unknown…

RAID-Chassis

Getting information from out ten-odd RAID-subsystems would be quite important. I first thought I had struck gold when I found check_promise_vtrak from http://www.consol.com/apple/nagios-plugins/check-promise-vtrak/ . Set up was a breeze, and I quickly got it working from the CLI. But the plugin refuses to function from Nagios, it only returns (null), which ends as a critical error and an email in my box. Probably a small fix is needed in the plugin itself, because plainly it is not returning anything readable by Nagios.

Mac OS X Servers

Using the basic plugins and nrpe (http://nagios.sourceforge.net/docs/1_0/addons.html) I was able to check all the basic data on our servers. I would like to monitor the actual services themselves (usually AFP and SMB, with some DNS and OD on some machines). As above, I thought check_osx_services (http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F1497.html;d=1) was going to fix all my problems. Again I was foiled at the start: this plugin wouldn’t work properly even from the CLI.

Autodesk Lustre /w Incinerator

Since this is the system that is keeping me the busiest in normal times, I wanted Nagios to help me out here.

Lustre /w Incinerator is a complex system, consisting of a workstation, frameserver, 8 rendering nodes, an ethernet network for commands and an Infiniband network for moving those frames around. With so many moving parts, there are way too many failure points here. Lately we have had a lot of issues with the renderd service (that handles the rendering and contact with the server) crashing  on the nodes by itself. I have simple scripts that allow me to restart the service on all the nodes at once, and it takes just seconds to run. The difficult part is getting the info when the nodes have crashed. Installing the normal checking tools on the nodes was not an option for a couple of reasons. Firstly: I don’t want to have too many extra things installed on the nodes and secondly: the nodes are not actually on any other network that the incinerator network, so there is no access to them from the Nagios server.

I have installed nagios plugins and nrpe on the frameserver. These check the normal things on the server (root disk space, CPU loads, Processes etc.). I also created a specific check to handle Browsed, the process that handles serving the frames to the workstation and the nodes. After some searching I discovered a plugin called check_process_by_ssh (http://www.nagiosexchange.org/cgi-bin/page.cgi?g=Detailed%2F2013.html;d=1)

which allowed me to formulate a suitable nrpe command to execute (from nrpe.cfg on the frameserver):

command[check_node1]=/usr/local/nagios/libexec/check_process_by_ssh -H node1 renderd

I then added the check to my Linux definitions:

define service{
        use generic-service
        host_name frameserver
        service_description Incinerator Node 1
        check_command check_nrpe!check_node1
        }

This worked fine from the CLI, but nagios didn’t get thru and said the status was critical. After some thought I realized the problem: nagios executes all scripts as the user nagios, and I was doing my testing as root. The Nagios user sisntä have the needed SSH authentication settings, so I copied the needed file (id_rsa) to a suitable folder, and modified the command on the frameserver:

command[check_node1]=/usr/local/nagios/libexec/check_process_by_ssh -H node1 -k /usr/local/nagios/keys/id_rsa -u root renderd

Now the checks work without a hitch, and I get an email about the nodes being down before an operator has been wondering what is wrong for an hour…

What I Learned…

January 20th, 2009

…during the last week of last year.

Due to some illnesses in our support staff, I ended doing a big upgrade job short-handed. This is what I learned:

  • When Autodesk support say they have shortened support hours, it also meant that they are quite understaffed, which reflected in both the reply times, and even the quality of some answers
  • The Lustre upgrade scripts couldn’t handle the straight upgrade to 2009SP2 from 2007. Several things got broken on the way, like the Incinerator Manager website, which could no longer start or stop the nodes.
  • The Lustre/Incinerator licensing scheme is quite difficult to understand at times, and you have to make sure your temporary license contains licenses for the nodes as well.
  • It is hard to test systems you have no idea of how to use
  • Working 17 hours shifts is not fun
  • A good remote desktop & SSH remote connection to the office is very nice at times
  • Always upgrade Autodesk-workstations from the local screen, else you will miss out on some essential settings, and get no warning
  • writing bash scripts would be a nice skill to have, thanks Filipp ;)
  • Lustre 2009 is a lot more picky about settings. We used use our SAN with the frames mounted directly from workstation-SAN-mount, but in 2009 we need to use the frames over Infiniband from the Frameserver in order to get the nodes to work

Film to Digital, now in essay format

December 2nd, 2008

Film to Digital – essay

New Red cameras – resolution glut?

November 25th, 2008

The new cameras from RED were announced last week. There has been a lot of anticipation, especially towards the cheaper camera, named Scarlet. The announcement does bring up some questions, though.

http://www.red.com/epic_scarlet/

New cameras

I guess most people were happy with the announcement: 8 different cameras (”brains” and a whole slew

 of attachments, accessories and modules. The “brains” come in two different flavors, Scarlet is the smaller, “lowend” (called professional) unit, Epic is the “master professional”, high end unit. Both come with several models, with different lens mounts and sensor sizes.

If you have not acquainted yourself with RED-technology, look at the presentation I made earlier this month: http://www.thingamagic.net/jussi/?p=7 Here is a recap: the RED One is a digital cinema camera, that captures images in up to 4K resolution (4096×2580) using a proprietary “raw” format, called Redcode or r3d. The camera is very cheap compared to other professional digital motion picture cameras, with the price starting at 17500 USD.

At the same time I found this interesting article, Redfacts. It is especially interesting considering that it is the first piece to really heavily criticize the RED camera that I have read. Not being too familiar with any of the systems discussed in the paper (Red One or Sony F23), I will not go into details of the paper too heavily. It did however raise some valid points, which I would like to ponder about.

Read the rest of this entry »

Film to Digital

November 11th, 2008

filmwork_2

RED – Camera

I was planning on concentrating on the RED part of the presentation, but decided I had to go thru the basics first, else there wouldn’t really be a point to the whole exercise.

The actual presentation will be on thursday the 13th of November.

/jussi