Sat, 10 Mar 2007
Daemon Logging Percentages and Playing with Ruby Idioms
While digging in to some large log files recently I needed to work out
which daemons were causing the most noise, so I wrote a little perl
script called
daemon_percentages.pl. It was short, ran quickly and did what I wanted.
And then my lunch plans were cancelled due to rain.
With nothing but boredom, a newly compiled version of ruby and the google homepage at my side I decided to write a version in ruby. And then I realised how long it's been since I last looked at ruby. After slightly longer than the perl version took, and with a couple of false starts, I ended up with daemon_percentages.rb.
I had forgotten how much I disliked ri. It feels slower
than perldoc and I find it awkward to use. Then I hit the
lack of a post-increment operator; while I understand the reasons for its
omission I've got used to having it, so that took a couple of minutes to
debug. And then the biggie for me, a lack of hash key autovivification.
I'd forgotten how much of a perlism it is and so I spent a little while looking at different ways to do it (and got some good pointers from Will Jessop). In the end I tried the following:
# option 1
if tally.has_key? daemon
tally[daemon] += 1
else
tally[daemon] = 1
end
# option 2
tally[daemon] = 0 unless tally.has_key? daemon
tally[daemon] += 1
# option 3
tally[daemon] = (tally.has_key? daemon) ? tally[daemon] + 1 : 1
Option 1 felt too long, I didn't like option 2 when I reread it as the code seemed to imply I'd decided something and immediately then changed my mind so I settled on option 3. Although it's a little more complex (and denser) it's such a common thing for me to use I'd rather have it on a single line and gloss over the syntax as it becomes more familiar.
While I had some small teething problems I do like the look of the ruby code and apart from the missing perlisms it felt quite natural to write. I'm not willing to jump ship just yet (CPAN is still too useful) but I think I'll be writing more of my personal tools in ruby.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2007/03/10 23:19 | /perl | Permanent link to this entry | This entry + same date
FRDNS Revisions - now with added ping checks!
I originally wrote frdns to
find and warn about inconsistencies in forward and reverse DNS records.
At the time I was also using a tool called hawk to show both IPs that
didn't have a reverse record and reverse records that didn't have a
responding IP address associated with them (we had a lot of orphaned
records).
While hawk did the job it required a MySQL instance, a daemon process
and an apache server to function - which was a PITA when it had to be moved
to another server. So I improvised. The first step was adding a
-p option to frdns that makes the program ping each IP
specified and flag the address if it doesn't have a reverse record. This
points out IPs that don't have DNS records. As for the no longer needed
records I've got a different tool for that - but that's for another blog
post.
I've also made frdns log both run time and how many issues it flags to syslog. The ping check can take a while so I added this to help me keep an eye on its performance. I did think about using one of the asynchronous DNS libraries to improve performance but we're only running it once a day to pick up mistakes so a long runtime isn't a huge issue.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2007/03/10 22:46 | /tools/commandline | Permanent link to this entry | This entry + same date
Importance Levels - A Simple Example
When you're first introduced to an environment you'll have the ever
fun task of working out which machines should get the most time; and
that order seldom matches which machines actually need the most
attention. To help me prioritise I've worked out a simple
importance rating system to show where I spend my time.
Below is a simplified version. I use it to assign a single importance number to each machine, and then I allocate a certain amount of time each day to work on the issues, requests and improvements I've got in my todo list for that level. When I've run out of time I move down a number and start working on anything related to machines rated at that importance. The amount of time I put aside for each level decreases as I work towards one.
5: Customer facing systems that generate revenue.
This is my no brainer. Pretty much everything is secondary to keeping the money coming in.
Examples: customer database, webservers and databases related to customer payments.
4: Internal Money Makers and Customer Visible Systems.
I normally put customer facing systems that don't make money in this bracket. An online presence and reputation for availability have been important to most of the companies I've worked at. It sounds horrible but it's a lot easier to save face and beg forgiveness from a five day internal outage than a one day external one; well, sorta. If you're a blogger watched company then this is even more important.
I also put internal money makers at importance 4. "Cash is king" should be true in all departments, including those where Sysadmins dwell. I've only ever had simultaneous problems with both types of importance four systems a couple of times. Each time had circumstances that made the priorities clear.
Examples: corporate website, company blog, invoicing systems, time-sheets at month end.
3: Systems that stop a number of staff working.
I typically put machines that don't directly contribute to the bottom line but are required for the company to continue in this bracket. A short outage of any machines at this level can be survived for a little while but it'll slow a lot of staff down, cause frustration and (after a while) cause major damage.
Examples: internal request tracking/ticketing, bug tracking, build machines, version control
2: Systems that hinder small numbers of staff.
This is another level that I use to cover two types of machine. The first type slows or hinders a number of staff but can be lived without. You can think of these as convenience or favour systems that make tasks easier or more pleasant. You'll often get a disproportionate amount of queries when one of these goes down. This is a good sign and shows you understand what your users care about.
When I'm asked to help with desktop support I lump single user problems here. Although it's frustrating to have a single person unable to work it's often not as bad as any of the higher levels. I put a lot of special cases and caveats here (sales people on presentation days, QA engineers before a release) but the most sensible workaround is to separate desktop and sysadmin roles. You can typically hire desktop support staff cheaper than a sysadmin and give them the opportunity to train with the sysadmins when things are quiet.
Examples: Web front-end to a version control server, centralised log shares for debugging, departmental wikis, individual laptops or desktops.
1: Play / scratch machines that no one really cares about.
Not much lives in this level, if no one cares if it's up or not then you should seriously consider turning it off. The smaller and simpler your environment the easier it'll be to manage.
Examples: sysadmin "play" lab environment, company jukebox
And now some warnings - these categories are (obviously) not perfect. The ratings are host-centric (but can be pulled up a level and applied to clusters or groups of identical machines), it doesn't factor in office politics (some systems are loved by certain members of management and should be treated like one of their loved ones).
It's also worth noting that some systems rise in importance at certain times; examples are month end batch reports, time-sheet systems when invoices are due etc. It shouldn't be too hard to work out most of these (typically cyclical) requirements after speaking to the other staff. Asking about their requirements is always a good way to help build bridges and show you do understand that the systems are there for a reason; to be used.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2007/03/10 16:05 | /sysadmin | Permanent link to this entry | This entry + same date
The Cron Commandments - part 1
Although it's a rare Unix machine that doesn't run at least a couple of
custom cronjobs it's an even more special snowflake that does them
properly. Below are some of the more common problems I've seen and my
thoughts on them.
Always use a script, never a bare command line.
A parenthesis wrapped command-line in a crontab sends shivers down my spine. Nothing says "I didn't really think this through" and "I've done the bare minimum to make it work" in quite the same way.
Don't shout about success
A cronjob that completes successfully shouldn't post anything to
stdout or stderr. Most developers have no idea
how annoying it is to get a single line email every minute proclaiming
all's well. It also trains people to delete messages with certain subject
lines without reading them, which'll catch you out when a real
problem occurs.
Caveat 1: Logging that the script finished, and adding some timing information, can often be useful. It's good to have an audit trail of what actually ran and how long it took. By logging to syslog you gain the benefits of centralised logs (you are centralising your log files right?) and, because it's passive, the sysadmin doesn't get notified about expected completions unless she looks for them.
Debug information should be an option
A script invoked via cron has a different environment than one run from the command line, it'll work (and break) in different ways - which you'll want to see. It should be possible to enable additional debug without making any changes to the script itself. A command-line flag or environmental variable should be enough to trigger additional debug information. Often all you'll get is an email with the error and the debug information so ensure you can diagnose from your own output.
Beware overrunning jobs
Almost all your cronjobs should check to ensure that another instance isn't already running and exit if it is - after logging the issue. I've lost track of the number of difficult to track bugs caused by a cronjob starting, taking longer to finish than the interval between runs, and then having another job follow it. This often causes deadlocks, resource conflicts, maxed out database connections and corrupted data. Some, very simple, cronjobs don't need this but when in doubt put it in. And log the fact, this can help pick up growth trends ("it took 2 minutes until we added the extra users").
Beware /dev/null redirects in crontabs
Any cronjob that redirects stdout, stderr or
(worse) both to /dev/null is going to cause you headaches
and will need some attention. People typically add these when something
is wrong and they lack either the skill or the time to fix it. The
presence of these redirects show a lack of confidence in the script and
should be treated as a red flag. On the plus side they point you at
potential trouble.
Avoid running as root
As in most things using root is bad. Try writing your cronjobs so they
can run under a non-privileged user, with a little sudo mixed
in if you need it. It'll save you a lot of hassle when something goes wrong
and the script tries to eat your file system.
Closing Comments
And to close, a couple of quick points: test your cronjobs from cron,
not just interactively. /etc/ is often backed up,
/var/spool/cron/crontabs/ is often missed so think about your
deployment locations. Make sure your admins know about any cronjobs your
packages add. And finally, if you generate your crontabs always add a
newline at the end.
If you at least know why you're breaking some of these rules (and they better be good reasons) then you'll be a good few steps above most developers I've worked with. And we'll get on a lot better.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2007/03/10 15:59 | /sysadmin | Permanent link to this entry | This entry + same date
First Puppet London Users Meet - Thursday, March 22
I'm a lurker on the
Puppet mailing
list and after some discussions John Arundel has stepped up and done the
organising for the first
Puppet London Users Meet - Thursday, March 22. I'm not using Puppet
yet but I'm thinking of heading along to hear peoples adoption stories.
I've also been thinking about the lack of a sysadmin community in London since GLLUG became a lot more newbie friendly and SAGE-WISE faded out. If you're a sysadmin in London interested in meeting some of your peers come along and say "Hi", this might be the start of a beautiful friendship^Wuser group.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2007/03/10 00:30 | /events | Permanent link to this entry | This entry + same date

