Tue, 17 Apr 2007
No one likes a whinger - The systems fight back
After my little
whine I logged in to do my last checks for the evening to discover that
one of our webservers had died due to a hard drive going bang, our
production environment Nagios box had lost one of its network
connections and a chunk of our SAN kit was complaining about power
issues. Turns out that most of these were due to a power surge that
killed a network switch and three of the racks power strips. On the very
plus side no one outside of the systems team noticed. Resilience is a
wonderful thing when you get it right.
Woke up this morning, checked the Nagioses Nagii and found
out that one of our other products database servers had gone boom (my
fellow sysadmins were fixing that one) and the fail over had mostly worked.
No interesting logs, no hardware problems and a three hour gap in syslog
(and only syslog) to help explain the outage.
What have I learned? That the production servers read my blog. And they hate me.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2007/04/17 21:32 | /sysadmin | Permanent link to this entry | This entry + same date

