So I said, narrow the focus.
Your "use case" should be, there's a 22 year old college student
living in the dorms.
How will this software get him laid? - jwz
We’ve been using munin at WooMe for a while now and we’re hitting a wall.
Now I’ve been a big fan of munin since I first stumbled across it for several reasons. Its massively simple to set up - even a lazy sysadmin like me can get it up and running on a small network in minutes. For the basic setup it is also pretty much install and forget which is just what you want.
Here, we’re adopted it whole heartedly. We ops guys have started writing initially very small, but slowly larger and more complex sets of scripts to plot additional graphs for our own uses. This brings me to my second reason why I’ve liked munin so much, its just so easy to extend.
This sort of simple script (weak coded example)
#!/bin/bash
function print_data
{
echo "graph_title Active Accounts M/F Last $days days"
echo "graph_vlabel Number of Active Accounts"
echo "graph_category Woome users"
echo "T_active.label Total"
echo "T_active.info (Total Active Accounts)"
echo "T_active.type GAUGE"
echo "T_active.value `expr $1 + $2`"
echo "M_active.label Males"
echo "M_active.info (Active Male Accounts)"
echo "M_active.type GAUGE"
echo "M_active.value $1"
echo "F_active.label Females"
echo "F_active.info (Active Female Accounts)"
echo "F_active.type GAUGE"
echo "F_active.value $2"
exit 0
}
function get_data
{
# PIxie magic to get the count of male and females online out of the db
query=`echo " select count (distinct id ), gender from webapp_person p \
where p.id in ( select distinct person_id \
from report_user_online_status_log r \
where r.online > 0 and \
r.change_time > now() -'$days days'::interval) \
and p.userstate_id not in ('BANNED','FULL SUSPEND') \
group by p.gender;"| \
psql -A -F, -U woome -d pridb -h localhost -p 1234 |\
tr , ' '|grep -E 'M|F'|sed -e 's/[ M| F]//g'|tr '\n' ,`
females=`echo $query|awk -F, '{print $1}'`
males=`echo $query|awk -F, '{print $2}'`
print_data $males $females
}
case `basename $0` in
active_accounts_30)
days=30
get_data
;;
active_accounts_60)
days=60
get_data
;;
*)
echo "`basename $0` is not valid usage of the active_accounts munin plugin"
echo "USAGE: ./active_accounts_[30|60]"
;;
esac
(Please - if you plan to use munin - read how you should write plugins instead)
So as our infrastucture grew and the range of things we wanted to watch grew it became easier and easier to monitor more, and with the aid of puppet, our current deployment system, it also became easier to get those scripts on all the relevant boxes.
Where we are now, is that for munin to be useful , we really need updates every 5 minutes. I find we’re now updating over 10000 rrds in this window and the disk subsystem on this box is permanently hammered.
So, we find that this doesnt scale. There are some fixes around, improvements here or there, suggestions about faster disks, putting your rrds in a ram disk to speed things up (but then you have to protect against crashes and data loss….) but push comes to shove - its time to move to a different system and this is where the real cost becomes apparent
We’re looking at Ganglia now, it seems to have some of the scaling concepts nicely rounded off, and I am told the plugin strategy is nice and simple, but in a fast moving business, I could have really done with recognising earlier the investment in supporting scripts and its longer time migration cost.