Service Interruption – 5/27/12
We’re experiencing a service interruption currently related to a power overload in the datacenter. This began at roughly 9:30pm ET. As of about 10:45pm ET web services have been restored but we still need to redistribute some of the power load. Updates to follow.
Update 11:05pm ET: Mail services should be back now.
Post mortem analysis:
In our datacenter, our servers span 2 racks. In rack 14, we have a handful of important servers and a small group of UPSs to keep them online. What happened tonight was that ‘umbrella’, the NFS server backing the mail system, exceeded the capacity of its attached UPS, throwing it into overload. Normally this would not have caused any sort of cascading issue, however, the switches connecting rack 14 to our other rack were also on this UPS, thus connectivity was lost to all of rack 14 expanding the outage.
Fortunately this suboptimal layout has now been fixed, and the switches are on one UPS and each NFS server in rack 14 is on their own UPS devices. That should prevent the sort of cascading outage we had tonight, and simultaneously I’ll be talking to the datacenter crew about expanding our power allotment so we can continue to grow.
My apologies for the inconvenience this caused.