11
May
Minor interruption
We had a few machines reboot just now. Still investigating…
UPDATE 1 – 12:45pm ET – All services are restored. Investigation continuing.
UPDATE 2 – 1:10pm ET – As best I can tell this was caused by a cascading power issue in the datacenter. I believe due to a batch task plus other normal activity that our peak power requirements are greater than the capabilities of our connected UPS devices. Then due to the cross-connected nature of our environment, load was combined onto another UPS thereby exceeding the capacity there.
We’re working with our datacenter provider now to plan a migration of some servers in such a way that we can spread the power load more evenly, and better isolate servers from each other.