LCA2006: Giving a Lightning Talk at 9:40am Tuesday

Simon Lyall grabbed me at registration yesterday morning and asked if I'd give a lightning talk at the sysadmin track. I get 10-15 minutes to talk about whatever I want apparently, though presumably they want to hear about system administration at Weta.

I hate talking off the top of my head, so here's the beginnings of the outline:

Statistics Overview

  • ~4000 render procs (mostly IBM Blade servers)
  • ~100T online storage (mostly Network Appliance)
  • ~700T near line
  • ~85 racks of gear
  • Very simple network, big L2's separated by single L3 core

Lessons

  • Automate everything you can, time consuming tasks which repeat, will kill you. Spend the time upfront.
  • Lights out management is basically required, blade servers rule.
  • OpenLDAP doesn't scale as well as you'd hope (big hopes for the fedora directory server), but we'll probably go back to flat files since we don't have a large number of users.
  • We only run one copy of linux, all boxes rsync to a golden image at boot time to stay in sync and get updates.
  • You want to monitor everything by default because by the time you know you need to monitor something it's too late, you NEED history to see problems.
  • Vendor lock in sucks, avoid it at all costs.
  • Single biggest technical problem we have is the 16 group RPC limitation. Bring on NFSv4 ...

Moral

  • So long as you keep things simple, you can get away with an amazing amount of stupid.