Small Mosaic


Categories:

books
career
codinghorrors
comics
events
geekstuff
justdont
languages
languages/bash
linkshot
magazines
meta
misctech
movies
nottech
operatingsystems
operatingsystems/linux
operatingsystems/linux/debian
operatingsystems/solaris
paranoidadmin
perl
python
ruby
security
security/apache
security/tools
serversmells
sites
specifications
sysadmin
tools
tools/commandline
tools/firefox
tools/gui
tools/network
tools/online
tools/online/greasemonkey
unixdaemon

Archives:

October 20085
September 20084
August 200812
July 20089
April 20084
March 20081
February 20081
January 200815
August 20072
June 20079
May 20076
April 20078
March 200731
February 20073
January 200721
December 20061
November 20064
October 20066
September 200632
August 200617
July 200614
June 20069
May 200613
March 200611
February 200616
January 200611
December 20051
November 20056
October 200519
September 200525
August 200516
July 200516
June 200513
May 20052
April 200519
March 200531
February 200520
January 200531
December 200421
November 200430
October 200432
September 200418
August 20047
July 200414
June 20045

Sat, 15 Jan 2005

The Hidden Curse of High Uptime
A number of Unix/Linux people seem to pride themselves on obtaining the highest uptime they can. While this may seem like a little harmless fun, in a production environment (which are mostly fun-free places), it can hide a number of problems that will later become major issues.

At some point the machine will have to come down and face a power off or reboot, and then it's expected to come back up, and this is where the problems can start. In almost any environment, no matter how simple, and this problem gets worse as more complexity and people are involved, a number of changes will be made to the running system and given some testing time; and then they will be forgotten about and never made persistent and able to survive a reboot.

Whether it's the simple addition of a firewall rule thats never written to the config file, an unsaved routing table entry or forgetting to enable a service in rc.local, on any machine with a high up time their is a chance that something won't come up. And if it's a remote box it'll be something that stops you getting in to fix it, Murphy ensures this.

My recommendation? Pick a schedule (a month, three months, maybe once a quarter) and take the machines off line and then see what doesn't come up (you do have monitoring in place don't you?) If you have the opportunity you should combine this with your UPS testing (and you better be testing those!). If you can't afford to take a server down for testing then you've got a resilience problem and a single point of failure that needs addressing.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

Posted: 2005/01/15 15:55 | /serversmells | Permanent link to this entry | This entry + same date


books career codinghorrors events geekstuff justdont languages/bash linkshot magazines meta misctech movies nottech operatingsystems/linux operatingsystems/linux/debian operatingsystems/solaris perl python ruby security security/apache security/tools serversmells sites specifications sysadmin tools/commandline tools/firefox tools/gui tools/network tools/online tools/online/greasemonkey unixdaemon

Copyright © 2000-2005 Dean Wilson XML feed logo