What a day.. I’ve had enough scares for one day!

This topic has 2 replies, 2 voices, and was last updated 16 years, 12 months ago by paul.oconnell.uk@gmail.com.

Viewing 3 posts - 1 through 3 (of 3 total)

Author

Posts
May 25, 2008 at 10:55 am #74

Leif
Keymaster

😯

My morning began with seeing two e-mail from my server, sent during the night.

20080525034127 – Controller 0
WARNING – Sector repair completed: port=7, LBA=0x16700DF3

and

20080525040225 – Controller 0
WARNING – Sector repair completed: port=10, LBA=0xBD99964

This is my server:

It’s a fairly standard Pentium-4 3.0 GHz, 2gb ram, with fairly non-standard 21 (twenty-one) hard drives. Two Raid-5 arrays on two separate 3ware controllers. Array 0 (Controller 0) is 12x250gb and Array 1 is 8x500gb. Boot drive is standard SATA standalone. The 8x500gb array is new (only 2 months old), but the 12x250gb array (11 + 1 hot spare) has been up and running mostly non-stop since 2005, save for the move to Thailand, and shutting down and unplugging the server as soon as I hear thunder outside.

A Sector Repair is something that a hard drive does when it had trouble (but still succeeded) reading a sector. It reallocates the data to a spare area it keeps for just this purpose, makes a note of it, and goes on with its business. This is basically normal behaviour – it happens once in a while, hard drives (like all electronics) are analog once you look deep enough, and they’re not perfect.

On a normal computer, an event like this would probably go completely unnoticed (unless you check your hard drive’s S.M.A.R.T. info every day), but with a hardware raid controller, it lets the administrator know.

A single error (or two) like this on a drive is nothing to be worried about. I started copying a large amount of data to the server right before going to bed, so I figured it ran across it during the night.

However, a few minutes later, another emayl arrived, again saying it repaired a sector on Port 7. And another. And another.

Then it went quiet for a couple of hours, and then another one came – again from Port 7. At this point it’s starting to freak me out. Sure – Raid-5 is fault tolerant (I can lose 1 drive and lose no data) but if you lose two, ALL data is gone. I’m starting to consider retiring Drive 7, and switching in the Hot Spare drive instead. What’s got me worried is that single report from Drive 10 – one of the ones that arrived during the night. What if Drive 10 suddenly crashed while the array was rebuilding to the hot spare (Drive 11)? Server would be dead in the water, along with all the data on that array.

So, as a precaution, I copy the very most important things (source code, personal documents) to the other array, Disable Drive 7, and start the rebuild to the hot spare.

Everything goes well, it chugs along for 3 or 4 hours or so, until at 88%:

20080525140422 – Controller 0
ERROR – Drive timeout detected: port=1
20080525140417 – Controller 0
WARNING – Sector repair completed: port=1, LBA=0x850180D
20080525140654 – Controller 0
ERROR – Drive timeout detected: port=1
20080525141339 – Controller 0
WARNING – Sector repair completed: port=1, LBA=0x850180E

FOUR warning emails out of the blue, from a completely different drive, one that has never had a problem before. And, a SERIOUS problem too – at this point the array is rebuilding, and vulnerable. Another loss and I’d lose a mountain of work, despite having the most important things backed up. Most things aren’t backed up — how do you back up a 6 terabyte server??.

Major bad adrenaline rush, but at this point there’s nothing to do but to let the rebuild finish — it is still running, and even though it complained, it didn’t mention any data loss. I guess it managed to read after all.

While the server is finishing itself up, I drive into town, and buy a Blower (to blow out some of the dust that has accumulated in the server, since I’m gonna have to take things apart to replace the failing drive 7 anyway) as well a new 250gb SATA hard drive for 2100 baht ($70). Not a bad price at all for the small town I’m in.

Coming back, server completed the rebuild, but still showing an error condition on Drive 1. I restart the server once, to make sure nothing new comes up, and everything is fine – even Drive 1. I figure it’s a fluke, so I shut down, replace Drive 7, and run the blower to blow out some of the dust buildup.

Putting the machine together again and starting it up, I hear a couple of extra beeps during the POST (bios self test) which I’ve never heard before, so I turn it off and go and grab a monitor (this system is normally headless) to see what was up. However, the next time it starts without the beeps, and both raid controllers report all drives accounted for and OK, so during the bootup process, I figure everything, OK, unplug the monitor, return it to where I usually use it, and go back upstairs to where my main workstation.

However, I can’t reach the server over the network. It responds to ping, but not VNC, not SSH, and last but not least, not SMB (network drive mapping). I figure it’s something simple, so I go downstairs and plug the monitor in again.

"ext2 superblock not found" ??????????

What the bleeping bleep??

Never seen anything like this before. The server had not been turned off improperly, no reason for a logical failure of this kind, and the raid controllers are both reporting "all ok"!

I reboot with Ctrl-Alt-Del. The computer (running Ubuntu Linux) starts the shutdown process but doesn’t finish. Ctrl-alt-del again – more things happen, but still doesn’t finish. Finally I press reset.

Starts up again, Linux kernel gives me a cryptic hacker error message to the effect of "Uhhmm, I just got unknown NMI 21 – are you using some strange power saving mode? Confused, but trying to continue." (Yes, the message literally said something like this, including the Uhhmm).

It takes a few seconds, but after that it starts booting again.

Finally, "ext2 superblock not found" again. Oh crap.

Near panicking, I reset once again.

This time, however, I hear those extraneous beeps again — and I happen to have a monitor plugged in and ready to see the error message.

"NMI – PCI PARITY ERROR"

Wow! That’s new. Never ever had that before.

At this point, it hits me.. Could it really be…?

Days like this can drive a man to alcohol, so I hop on my scooter and drive down to the local pharmacy and buy a bottle of 70% rubbing ethyl alcohol.

However, instead of drinking it, I take apart the server completely, get a bunch of Q-tips (cotton swabs) and with those and alcohol, I clean off the cakes of moist dust which comes from running a server in 90% humidity, 30 to 45 degrees celsius, and with a large fan blowing into the open case.

I’ve never had a problem before – but I figure, it could be my hack job of cleaning with a blower that actually made things worse.

An hour later or so (it takes a while, a server with 21 hard drives requires at least 42 cables, half of which MUST go to the exact right place) I put the server back together again, plug it in, turn it on, and hold my breath.

And, it boots up without a hitch!!

Thank you, oh applicable divinity!

Please, let it be a while before I have a day like this again.

Anyway, long story short – the RAID-5 systems performed just like they should – no data lost despite the imminent failure of Drive 7, and it’s back to redundant status.

Life (and work) goes on.

///Leif

P.s. Not exacly an applicable forum post, but you guys are cool, and I just had to write this story down and get it out. If I had more days like this, I’d have a blog, but thank the applicable divinity i don’t!

May 25, 2008 at 8:03 pm #4251

Lane
Member

I love a happy ending!

but I’d still find a way to also back up at least the essentials. 🙂

May 25, 2008 at 8:44 pm #4252

paul.oconnell.uk@gmail.com
Member

A day to change into the brown trousers! 😯
Author

Posts

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.