Major Outage – NFS server forced migration
Outage estimated at approximately 6 hours, from Saturday 4/7 @ 7pm ET through 4/8 @ 1am ET. Power issues affecting our primary NFS server appeared to cause some sort of boot failure. Emergency NFS migration performed.
All services restored, all data should be intact. More info to come.
Update – 10:15am 4/8 – MySQL InnoDB issues reported, likely due to inadvertent config file changes. Investigating now. (Resolved @10:35am: the InnoDB log file size was changed inadvertently. It has been reverted, but I may go through the proper procedure to increase its size since the old size is a bit small).
Update 2 – More details.
This outage was caused by a variety of factors. I’ll proceed chronologically in the interest of full disclosure.
First, some background. At GeekISP, we really strive for reliability above almost everything else, since we need to have a stable platform as a prerequisite for building anything bigger. To that end, we try to have redundancy for every critical component – preferably of the auto-failover variety. For instance, there are a redundant pair of firewalls, each connected to a separate power controller and each power controller connected to a separate UPS. This way it would require a simultaneous failure of at least 2 components to cause both firewalls to be offline. Other servers at GeekISP have redundant power supplies for the same reason. Our typical deployment might have 1 power supply connected to a UPS and the other to utility power, or sometimes we connect both power supplies to a UPS. Either way we have a reasonable level of protection from single-component power problems (and in some cases multiple components, but not all cases).
With the above in mind, the first failure observed was at approximately 7pm ET on 4/7. Both firewalls suddenly went offline, cutting off connectivity to the datacenter. Some of the other machines connected to the same power controllers as the firewalls also went offline at this time, but those machines rebooted cleanly – the firewalls did not. The monitoring station external to GeekISP noticed the problem, but unfortunately I did not. At the time I was in a loud environment and simply did not hear my phone.
At approximately 9:15pm ET I heard my pager go off again, and responded immediately. I immediately recognized the problem and contacted our datacenter provider for support, and I had someone in front of the main GeekISP rack within 30 minutes. The firewalls were up at this point, and then it became clear that there were larger issues.
Whatever knocked both firewalls offline at the same time also affected a number of other machines. Most critically, GeekISP’s primary NFS server was knocked offline, but it did not recover cleanly on boot. Normally if power is yanked on a FreeBSD server it will boot up and do a deferred fsck, allowing the machine to resume its normal duties while the disk is analyzed slowly in the background. Instead of that, we observed the machine booting up normally, but it would hang after reporting it’s plan to defer all filesystem checks. The front LEDs showed no sign of disk activity, and a ctrl-c on both a real and virtual keyboards did not allow the boot to proceed. Booting to single-user mode exhibited the same behavior.
After several failed attempts to get the machine booted in creative ways, we made some progress. We were able to boot the machine using a FreeBSD 9.0 live-cd and bring it up on the network. The livecd didn’t exhibit any problems and I was able to see the data on the disk just fine. This was at approximately 1am ET.
For a few months prior to this event, I had been planning to upgrade the main NFS server at GeekISP. Life and other work prevented me from ever being able to make the final switchover, but I had the other machine about 90% configured and also seeded with about 85% of the data. It was clear that the migration was happening immediately, so I began making a checklist and rsync’ing the data.
By about 2am ET web traffic and mail had been fully restored (mail was up a bit earlier actually – there is a small, separate cluster for it) and I was tidying up some of the many loose ends. At 3am, literally moments before I was going to turn in for the night, my terminals hung. No response. Both firewalls had disappeared again.
It took about 20 more minutes to get a member of the datacenter team in front of the rack, but the story was largely the same. Except more widespread. This time we had lost power to 2 of the 3 power controllers in our main rack, and also in the power controller on our auxillary rack. 3 separate power controllers, each connected to a dedicated UPS (none of which were operating at more than ~50% capacity), all rebooted at approximately the same time. 1 power controller stayed online throughout both events. Fortunately, this time we didn’t have any boot hangs, and I was able to restore order quickly.
Running low on brain power and phone power, the datacenter crew helped me reroute some power cables on the theory that we were somehow overrunning the capacity on the UPS’s. The datacenter team and I were (and still are) skeptical that that is the actual cause, but, there just isn’t much else that can cause the events we observed. We concluded the response at approximately 3:45am ET.