Simon Lyall grabbed me at registration yesterday morning and asked if I'd give a lightning talk at the sysadmin track. I get 10-15 minutes to talk about whatever I want apparently, though presumably they want to hear about system administration at Weta.
I hate talking off the top of my head, so here's the beginnings of the outline:
Statistics Overview
- ~4000 render procs (mostly IBM Blade servers)
- ~100T online storage (mostly Network Appliance)
- ~700T near line
- ~85 racks of gear
- Very simple network, big L2's separated by single L3 core
Lessons
- Automate everything you can, time consuming tasks which repeat, will kill you. Spend the time upfront.
- Lights out management is basically required, blade servers rule.
- OpenLDAP doesn't scale as well as you'd hope (big hopes for the fedora directory server), but we'll probably go back to flat files since we don't have a large number of users.
- We only run one copy of linux, all boxes rsync to a golden image at boot time to stay in sync and get updates.
- You want to monitor everything by default because by the time you know you need to monitor something it's too late, you NEED history to see problems.
- Vendor lock in sucks, avoid it at all costs.
- Single biggest technical problem we have is the 16 group RPC limitation. Bring on NFSv4 ...
Moral
- So long as you keep things simple, you can get away with an amazing amount of stupid.