Tue, 08 Jul 2008
More Memory Than Sense
My recent bugbear is - servers with inaccessible memory.
You go and spec a nice new server with say 8Gb of RAM (a little box),
you install Debian, you start adding applications to the machine and then a
couple of months later some anal sysadmin comes along, does a free
-m and mutters about under-specced virtualization servers when he sees
-
total used free shared buffers cached
Mem: 3287 225 3062 0 24 149
For those of you not paying attention - the machine isn't using over half of it's memory. So first of all how do you spot this and secondly how do you fix it?
If you're on Debian then the spotting is easy (for some hardware) -
apt-get install lshw
lshw -class memory | grep -A 4 '\-memory'.
If the size is bigger than the total from free then you've got
wasted resources.
The fix? Install the right bigmem kernel. And then recompile VMware server. Dammit.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2008/07/08 20:55 | /serversmells | Permanent link to this entry | This entry + same date
Sat, 15 Jan 2005
The Hidden Curse of High Uptime
A number of Unix/Linux people seem to pride themselves on obtaining the
highest uptime they can. While this may seem like a little harmless fun,
in a production environment (which are mostly fun-free places), it can
hide a number of problems that will later become major issues.
At some point the machine will have to come down and face a power off or reboot, and then it's expected to come back up, and this is where the problems can start. In almost any environment, no matter how simple, and this problem gets worse as more complexity and people are involved, a number of changes will be made to the running system and given some testing time; and then they will be forgotten about and never made persistent and able to survive a reboot.
Whether it's the simple addition of a firewall rule thats never written to the config file, an unsaved routing table entry or forgetting to enable a service in rc.local, on any machine with a high up time their is a chance that something won't come up. And if it's a remote box it'll be something that stops you getting in to fix it, Murphy ensures this.
My recommendation? Pick a schedule (a month, three months, maybe once a quarter) and take the machines off line and then see what doesn't come up (you do have monitoring in place don't you?) If you have the opportunity you should combine this with your UPS testing (and you better be testing those!). If you can't afford to take a server down for testing then you've got a resilience problem and a single point of failure that needs addressing.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2005/01/15 15:55 | /serversmells | Permanent link to this entry | This entry + same date

