Small Mosaic


Categories:

books
career
codinghorrors
comics
events
geekstuff
justdont
languages
languages/bash
linkshot
magazines
meta
misctech
movies
nottech
operatingsystems
operatingsystems/linux
operatingsystems/linux/debian
operatingsystems/solaris
paranoidadmin
perl
ruby
security
security/apache
security/tools
serversmells
sites
specifications
sysadmin
tools
tools/commandline
tools/firefox
tools/gui
tools/network
tools/online
tools/online/greasemonkey
unixdaemon

Archives:

April 20084
March 20081
February 20081
January 200815
August 20072
June 20079
May 20076
April 20078
March 200731
February 20073
January 200721
December 20061
November 20064
October 20066
September 200632
August 200617
July 200614
June 20069
May 200613
March 200611
February 200616
January 200611
December 20051
November 20056
October 200519
September 200525
August 200516
July 200516
June 200513
May 20052
April 200519
March 200531
February 200520
January 200531
December 200421
November 200430
October 200432
September 200418
August 20047
July 200414
June 20045

Sat, 21 Apr 2007

Deferring Defects - Autonomics
Autonomics refer to the ability of computer systems to be self-managing. -- autonomics.ca

Here's one that has been bothering me. Suppose you have a recurring problem that your "autonomic solution" can handle every time it occurs without any one knowing. At what point does the fact there is a treatable issue propagate up to a real person?

While an automatic "fix and tell me later" approach helps change your work from fire fighting to planned tasks what classifies a temporary problem as being important enough to warrant you investigating it? It's hard enough to justify preventive maintenance with the current systems, if it fixes itself then you may never get given the time to investigate further.

If a problem fixes itself before any one notices or a sysadmin can look at it is it a problem?

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/04/21 17:41 | /sysadmin | Permanent link to this entry | This entry + same date


Handling Requests: Three Simple Rules
I'm a sysadmin, half my working life seems to be spent handling other peoples requests (which is why I'm trying to move over to infrastructure work - where I can hopefully concentrate on something for three whole minutes). While chatting with a junior admin at a tech talk in the week the following three tips came up:

Use a ticketing system. This one comes up a lot but it's true, never dropping someones request is well worth the time spent setting it up.

Customers sending requests to individuals is a BAD THING. People go on holiday, they get dragged in to meetings. They work on projects. Which of those do you think someone who's been waiting for a request will accept as an excuse? None of them. And telling them that it's their own fault is a great way of annoying them even more - even if it is true. Training your users to reply to all (so follow ups also get tagged by the ticketing system) and to not send a "Just a quick question" mail so their favourite sysadmin helps you keep an eye on the workload while ensuring that things can't drop between the cracks. Even if it's an often repeated uphill struggle.

There is a caveat to this one. If you've got the resources it's often helpful to assign a sysadmin to a new employee for their first couple of days. Asking those awkward new starter questions is a lot easier face to face than on a mailing list of who knows how many. Any requests can then be added in to the ticketing system while the sysadmin is present, showing the starter how to use it, and that the admins actually pay attention to and process tickets. Nothing beats a good first impression.

Lastly, people have an expectation of how long something should take. If you break this unwritten rule, even for a good reason, then they'll notice and it'll be used against you at some future point. While it's not ideal for concentration quickly completing short tasks like password changes can make a huge difference in how your team is perceived.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/04/21 17:28 | /sysadmin | Permanent link to this entry | This entry + same date


Tue, 17 Apr 2007

No one likes a whinger - The systems fight back
After my little whine I logged in to do my last checks for the evening to discover that one of our webservers had died due to a hard drive going bang, our production environment Nagios box had lost one of its network connections and a chunk of our SAN kit was complaining about power issues. Turns out that most of these were due to a power surge that killed a network switch and three of the racks power strips. On the very plus side no one outside of the systems team noticed. Resilience is a wonderful thing when you get it right.

Woke up this morning, checked the Nagioses Nagii and found out that one of our other products database servers had gone boom (my fellow sysadmins were fixing that one) and the fail over had mostly worked. No interesting logs, no hardware problems and a three hour gap in syslog (and only syslog) to help explain the outage.

What have I learned? That the production servers read my blog. And they hate me.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/04/17 21:32 | /sysadmin | Permanent link to this entry | This entry + same date


Wed, 28 Mar 2007

Bonded | Teamed Network Interface Challenge
Here is another one for the sysadmins in the audience:

How ...

... many of your servers have multiple network ports in the back?

... many of them have bonding (teaming for the Windows people) enabled?

... do you know when one interface goes down if the machine stays connected?

... long does it take for you to be notified?

... do you know if they start flapping?

... many have their bonded interfaces plugged in to different switches?

... how do you know if some one mistakenly plugs both in to one switch?

I've got a fun week ahead.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/03/28 00:06 | /sysadmin | Permanent link to this entry | This entry + same date


Mon, 26 Mar 2007

Monolithic Config Files Considered Harmful^WAwkward to Manage
This came up in conversation with a developer at the Google OpenSource Jam so I thought I'd mention it while it is fresh in my mind (update: at which point I forgot to move it to the published directory. Doh). Breaking up config files isn't done just to annoy people, it's done to make automated and mass management easier.

A solid practical example is the Debian Apache configs. Historically most distros (and too many current ones) used a single config file for Apache. While this made interactive editing easier by presenting all the options in a single place (and in a sensible order) it made it very hard for the package management software (or automation aficionados) to add a module or virtual host without some hairy scripting. Removing settings when a package is removed or updating a small chunk of the config in an upgrade is even more painful; as for preserving local changes - haha.

By breaking the config out in to a number of files / directories and combining them at run time it makes the addition of a new vhost or module config just a file drop and possibly a symlink (often used if the configs are order dependent). This is easier for third party packages to perform and makes provisioning of additional apps a lot easier.

So what's the main downside? Debugging. An "Error on line 50" is harder to track when line 50 could be in one of twelve files. But with a little forethought debug messages can be updated to show all the useful information. So next time you're writing an app of many parts please spare a thought for the sysadmins and make it easily manageable.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/03/26 22:02 | /sysadmin | Permanent link to this entry | This entry + same date


Sat, 10 Mar 2007

Importance Levels - A Simple Example
When you're first introduced to an environment you'll have the ever fun task of working out which machines should get the most time; and that order seldom matches which machines actually need the most attention. To help me prioritise I've worked out a simple importance rating system to show where I spend my time.

Below is a simplified version. I use it to assign a single importance number to each machine, and then I allocate a certain amount of time each day to work on the issues, requests and improvements I've got in my todo list for that level. When I've run out of time I move down a number and start working on anything related to machines rated at that importance. The amount of time I put aside for each level decreases as I work towards one.

5: Customer facing systems that generate revenue.

This is my no brainer. Pretty much everything is secondary to keeping the money coming in.

Examples: customer database, webservers and databases related to customer payments.

4: Internal Money Makers and Customer Visible Systems.

I normally put customer facing systems that don't make money in this bracket. An online presence and reputation for availability have been important to most of the companies I've worked at. It sounds horrible but it's a lot easier to save face and beg forgiveness from a five day internal outage than a one day external one; well, sorta. If you're a blogger watched company then this is even more important.

I also put internal money makers at importance 4. "Cash is king" should be true in all departments, including those where Sysadmins dwell. I've only ever had simultaneous problems with both types of importance four systems a couple of times. Each time had circumstances that made the priorities clear.

Examples: corporate website, company blog, invoicing systems, time-sheets at month end.

3: Systems that stop a number of staff working.

I typically put machines that don't directly contribute to the bottom line but are required for the company to continue in this bracket. A short outage of any machines at this level can be survived for a little while but it'll slow a lot of staff down, cause frustration and (after a while) cause major damage.

Examples: internal request tracking/ticketing, bug tracking, build machines, version control

2: Systems that hinder small numbers of staff.

This is another level that I use to cover two types of machine. The first type slows or hinders a number of staff but can be lived without. You can think of these as convenience or favour systems that make tasks easier or more pleasant. You'll often get a disproportionate amount of queries when one of these goes down. This is a good sign and shows you understand what your users care about.

When I'm asked to help with desktop support I lump single user problems here. Although it's frustrating to have a single person unable to work it's often not as bad as any of the higher levels. I put a lot of special cases and caveats here (sales people on presentation days, QA engineers before a release) but the most sensible workaround is to separate desktop and sysadmin roles. You can typically hire desktop support staff cheaper than a sysadmin and give them the opportunity to train with the sysadmins when things are quiet.

Examples: Web front-end to a version control server, centralised log shares for debugging, departmental wikis, individual laptops or desktops.

1: Play / scratch machines that no one really cares about.

Not much lives in this level, if no one cares if it's up or not then you should seriously consider turning it off. The smaller and simpler your environment the easier it'll be to manage.

Examples: sysadmin "play" lab environment, company jukebox

And now some warnings - these categories are (obviously) not perfect. The ratings are host-centric (but can be pulled up a level and applied to clusters or groups of identical machines), it doesn't factor in office politics (some systems are loved by certain members of management and should be treated like one of their loved ones).

It's also worth noting that some systems rise in importance at certain times; examples are month end batch reports, time-sheet systems when invoices are due etc. It shouldn't be too hard to work out most of these (typically cyclical) requirements after speaking to the other staff. Asking about their requirements is always a good way to help build bridges and show you do understand that the systems are there for a reason; to be used.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/03/10 16:05 | /sysadmin | Permanent link to this entry | This entry + same date


The Cron Commandments - part 1
Although it's a rare Unix machine that doesn't run at least a couple of custom cronjobs it's an even more special snowflake that does them properly. Below are some of the more common problems I've seen and my thoughts on them.

Always use a script, never a bare command line.

A parenthesis wrapped command-line in a crontab sends shivers down my spine. Nothing says "I didn't really think this through" and "I've done the bare minimum to make it work" in quite the same way.

Don't shout about success

A cronjob that completes successfully shouldn't post anything to stdout or stderr. Most developers have no idea how annoying it is to get a single line email every minute proclaiming all's well. It also trains people to delete messages with certain subject lines without reading them, which'll catch you out when a real problem occurs.

Caveat 1: Logging that the script finished, and adding some timing information, can often be useful. It's good to have an audit trail of what actually ran and how long it took. By logging to syslog you gain the benefits of centralised logs (you are centralising your log files right?) and, because it's passive, the sysadmin doesn't get notified about expected completions unless she looks for them.

Debug information should be an option

A script invoked via cron has a different environment than one run from the command line, it'll work (and break) in different ways - which you'll want to see. It should be possible to enable additional debug without making any changes to the script itself. A command-line flag or environmental variable should be enough to trigger additional debug information. Often all you'll get is an email with the error and the debug information so ensure you can diagnose from your own output.

Beware overrunning jobs

Almost all your cronjobs should check to ensure that another instance isn't already running and exit if it is - after logging the issue. I've lost track of the number of difficult to track bugs caused by a cronjob starting, taking longer to finish than the interval between runs, and then having another job follow it. This often causes deadlocks, resource conflicts, maxed out database connections and corrupted data. Some, very simple, cronjobs don't need this but when in doubt put it in. And log the fact, this can help pick up growth trends ("it took 2 minutes until we added the extra users").

Beware /dev/null redirects in crontabs

Any cronjob that redirects stdout, stderr or (worse) both to /dev/null is going to cause you headaches and will need some attention. People typically add these when something is wrong and they lack either the skill or the time to fix it. The presence of these redirects show a lack of confidence in the script and should be treated as a red flag. On the plus side they point you at potential trouble.

Avoid running as root

As in most things using root is bad. Try writing your cronjobs so they can run under a non-privileged user, with a little sudo mixed in if you need it. It'll save you a lot of hassle when something goes wrong and the script tries to eat your file system.

Closing Comments

And to close, a couple of quick points: test your cronjobs from cron, not just interactively. /etc/ is often backed up, /var/spool/cron/crontabs/ is often missed so think about your deployment locations. Make sure your admins know about any cronjobs your packages add. And finally, if you generate your crontabs always add a newline at the end.

If you at least know why you're breaking some of these rules (and they better be good reasons) then you'll be a good few steps above most developers I've worked with. And we'll get on a lot better.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/03/10 15:59 | /sysadmin | Permanent link to this entry | This entry + same date


Fri, 09 Mar 2007

Disk Delving - 2 Good Papers and a Blog
"The Google team found that 36% of the failed drives did not exhibit a single SMART-monitored failure. They concluded that SMART data is almost useless for predicting the failure of a single drive."
-- StorageMojo - Google's Disk Failure Experience

There have been two excellent papers on disk drive failures released recently, the Dugg and Dotted Google paper - Failure trends in a large disk drive population (warning: PDF) and the also excellent but less hyped Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?.

Both papers make very interesting reading, the comparisons of SCSI to SATA disks alone should turn some heads, but they are a little dry, so once you've worked your way through them it's worth looking at the summarised highlights over at StorageMojo, a top notch blog that was recommended to me by Kim Hawtin. StorageMojo covered both papers and I've linked to them in the quotes above and below.

"Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times. If I'm an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive."
-- StorageMojo - Everything You Know About Disks Is Wrong

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/03/09 09:21 | /sysadmin | Permanent link to this entry | This entry + same date


Sysadmin Challenge - Disk Usage
Here's one for the sysadmins in the crowd; if you were asked to show the following how long would it take you to gather the information?

If you use Nagios you can cheat and work out the full drive size from the free space and percentage used reported by the disk checks, but that's... icky. You get bonus points for having prediction built in to your usage graphs.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2007/03/09 08:48 | /sysadmin | Permanent link to this entry | This entry + same date


books career codinghorrors events geekstuff justdont languages/bash linkshot magazines meta misctech movies nottech operatingsystems/linux operatingsystems/linux/debian operatingsystems/solaris perl ruby security security/apache security/tools serversmells sites specifications sysadmin tools/commandline tools/firefox tools/gui tools/network tools/online tools/online/greasemonkey unixdaemon

Copyright © 2000-2005 Dean Wilson XML feed logo