One day an erring electrician cut the power to the entire server room at the ISP I was running.
Bringing it all back up was a bitch.
At the time, we had, oh, at least 24 ISDN capable dual T1 modem boxes (the Ascend Max 4048 ) serving 48 simultaneous customers each - and costing 20k each! Even at that cost we were desparate for the Max TNT which they kept promising to ship RSN.
There was a pair of dns servers, a pair of radius servers, a db server, a couple netnews servers, a monitoring box and a ton of boxes providing web sites for the booming website business.
We had been running flat out, facing enormous growth, tripling in size every month for 9+ months running, and totally unable, really to do anything other than keep slamming hardware in and hanging on for our lives.
So the room rebooted… everything checked their hard disks… the dns servers came up, the netnews servers came up… but what came up first was the outward facing modems, and therein lies this story.
At the time we were running at 100% - every modem port utilized for 18 hours out of the day - returning a totally unknown number of busy signals to our customers because, in part, we had used up all the T1 capacity BellSouth had planned for the year, in January… and it was, like, March!
… and partially, because people were really, really digging the Internet.
Like BellSouth, we´d made a gross mis-estimation for the number of minutes our users would spend-online, and had got the ratio of active ports to active customers terribly, terribly, wrong. BellSouth used 3 minutes in their calculations for buying ISDN switch hardware, as that was the average length of a normal phone call. We started with an estimate of about 40 minutes per call, which was off by, I don’t remember how much, I think a factor of 3, the wrong way. We had also got the ratio between customers and ports, wrong, we had planned on 20, or even 40 to 1, and it had turned out to be closer to 10 to one. The marketing department wasnt going to stop selling new customers, BellSouth wasnt going to add more capacity, and we were screwed.
So the server room rebooted. And the Access gateways all came up first - and All our customers dialed in at approximately the same time. 1152 people (out of, like 12000) would win the battle to get that precious sweet sound of a modem going brssssshheeeewinebeepbeepbeep, and then they’d attempt to login.
The database engine collapsed. Never before had it been asked to handle that many queries in under a few minutes. Actually, worse, it took longer for that box to boot (by many minutes), than anything else.
So first to collapse were the radius servers. I seem to recall a load-average of something like 211, as they endlessly retried to get a response from the db server. You couldn´t login, you could´t do anything to see what was really going on with them. And then the db server would come up - and be instantly slammed with a thousand requests, and go down. And then the load average on the radius server would climb again…
Fixing the the db server took some time (we had cruddy filesystems then too), and the interdependency of the radius servers with the db were mis-understood, so there were hours and hours where stuff got power cycled, nothing worked, everything had to be fixed again, and so on.
I was on vacation at the time.
In the end… I called in and had to help describe exactly which set of boxes in which order, to manually power cycle, wait for the modem ports to fill up, wait for the load on the radius and db servers to go down, then bring up the next box. I think it took, once we worked out the right procedure, about 40 minutes to reboot the whole place.