I had a very tense morning on Friday, what my coworker refers to as a Sphincter Moment.
On Thursday, I was doing a server migration from UNIX to Windows. The UNIX server is getting really old and the hardware is no longer reliable.
The data transfers over without conversion, so I could simply send the files via FTP. For convenience, I usually collect the files into a TAR or GZIP file before sending the files. I got people out of the system at 5 PM, made sure last night's backup had ran OK, then created my TAR files for the transfer.
The FTP download wouldn't complete, so I investigated. The UNIX time was off by 80 minutes (no one applied the new timezone patch), and restarting TCP didn't seem to help. System uptime was 287 days, so I figured, "Well, I have two backups, I'll correct the time, reboot and see if that helps."
I flushed the disk controller's cache to disk via the sync command, then issued a shutdown and restart command. Twenty minutes later, I still can't log in and I can't ping it locally... and my contact person on site answered the phone but just left (argh).
The next morning, the client discovered that the server room was locked and nobody had the key. After getting the vice president who had the key to unlock the server room, he turned the machine back on, followed some onscreen instructions, and we were back in business.
I went to my backup folder to try a transfer before they got busy... and my files were gone. The timezone was still wrong, and some other files and changes I made last night were missing, too. I checked the MAC address to make sure I was on the right system, but it was like someone erased all my work. "Weird," I thought, "someone deleted my changes? Oh well, I'll re-do the backup."
Then came the call, "Hey, Lee, we have nothing in our dispatch board. No appointments, period." Then the call from accounting that the fiscal year was wrong. And as I looked at the system logs, I noted that there was a huge gap between 09/25/2008 and 03/27/2009, as if the server had been off for six months. Even our databases were the same way... no data was entered after the morning of 09/25/2008.
I restored the backup from tape (as my TAR backups disappeared with everything else), and the client lost a day of work. Fortunately, it was a light day and they could re-input their data quickly. Out of the IT director, the CIO, CTO, my manager, and myself, I was the one most bent out-of-shape over the whole thing.
We pieced it together after the fact. The IT director is new as the previous one left without warning and without leaving any passwords or configuration information. He knew the UNIX server was having problems, but not what problems.
Nobody was monitoring the RAID array, and it is a RAID-1 with a LSIL on-board controller. The LSIL had a neat feature, in that if the array is broken, both the online drive and the offline drive retain data, and the array can be rebuilt with either drive. This allows you to build an array from existing drives without losing data on the drive you copy from.
Well, DISK 0 probably went offline in 09/2008, while DISK 1 kept working. When the server restarted, it either 1) asked which drive to set as primary and defaulted to the old data on DISK 0, or 2) defaulted to DISK 0 without asking. In either case, the drive from 09/2008 became the primary, thus causing a six-month black hole.
The moral of the story? Make sure you have backups, make sure you have OFFLINE backups, and make sure those backups can be accessed and restored -- test your backups! Thank God Almighty that the backup restored OK, or that client would be out of business within months. *whew*