pc issues

Danika has both a laptop and a desktop computer she uses at the house. Mostly she uses her laptop and when she needs to use her desktop she remotely logs in to it. This is partly because the monitor I hooked up to her desktop quit working shortly after I hooked it up. The other day she noticed that she could no longer connect to her desktop, so she checked it out and it was making some odd noises and appeared to just be rebooting and rebooting. She asked if I would hook a working monitor up to it so we could see what the problem was. I lugged in the large CRT from the storage area under the house and hooked it up.

When it started booting, I saw something I had never seen before -- the SMART monitoring system on the hard drive was reporting errors and the BIOS was not letting the system boot without user intervention. There was a message on the screen that said "SMART Failure Predicted on Primary Master : WARNING: Immediately back-up your data and replace your hard disk drive. A failure may be imminent." Typing F1 (to continue) caused widows to start booting, but it eventually got to a blue screen and rebooted.

I have done some hard drive restoration in the past, so I thought I would do some diagnostics to see if I could figure out how to get her PC up and running again. The SMART messages the BIOS was reporting lead me to believe the hard drive was in bad shape. I booted the box with a Parted Magic live CD (see my other post about what a great tool this is) and started up the SMART diagnostic tools. I ran an extended self test and it reported raw read errors. At that point I decided I should just try to copy what I could from the drive to another drive and see if I could get the system to boot from the replacement drive.

The bad drive was one that I had originally purchased with 3 others just like it (that I had used to make a RAID array). I found one of the other drives that was not in use and ran some diagnostics on it to determine it's health (I figured there was no sense copying the bad drive to a drive that was also about to fail). I connected the replacement drive to my desktop (via a USB to IDE adapter) and started running badblocks on it. I was not too happy when badblocks reported errors. Since I had seen badblocks report errors on hard drives before when using this setup (the USB to IDE adapter) that after further diagnostics reported to be fine, I figured I should run some more diagnostics on the drive. I removed the bad drive from Danika's desktop and put the replacement drive in it. I fired up the Parted Magic live CD again and ran an extended self test on the replacement drive. The extended self test reported a PASS when it completed. At that point I decided the replacement drive was healthy enough to start copying over the data from the bad drive.

I did some searching and found a package called g4u (ghosting for unix). g4u can be used to create an image of a hard drive and send it to another system on the network that has an ftp server running on it. It looks to be a collection of software and scripts to wrap dd, ftp, ssh, etc and make it easy to save drive images somewhere on the network. I got everything setup and started a backup of the bad drive. I left it to run overnight and when I checked on it in the morning, it had failed when dd got to the bad sectors on the drive. I haven't used dd enough to realize that by default if it encounters i/o errors it stops. I did a quick scan of the dd man page to figure out how to tell it to "ignore" errors and did not see an option for it. Next, I did a quick google search and found a program called ddrescue. ddrescue appears to have been created because of stop on error limitation of dd. Subsequently, I checked the man page again and discovered how to get dd to ignore errors, you add the conv=sync,noerror option.

I checked the Parted Magic documentation and discovered ddresuce was already part of the live CD. I shut down the system connected the replacement drive (and left the bad drive connected as well) and booted Parted Magic. I did a quick search for the ddrescue man page and determined that all I needed to run was: ddrescue /dev/hda /dev/hdb, where /dev/hda was the bad drive and /dev/hdb was the replacement drive. It started up and showed it's status as it ran. It ran for a while (just over an hour or so) and completed without errors. It reported 6 errors and 37,888 bytes of bad data. I shut the PC down, disconnected the bad drive, changed the replacement drives jumper to be master, took a deep breath and powered on the PC.

On boot up the BIOS no longer reported any errors and windows started booting. Windows booted into the "Last time I flaked out, so do you want to boot me into safe mode this time" screen. I booted into safe mode and noticed that it copied a bunch of drivers as it booted (I don't use windows enough to know whether or not this is normal) and eventually brought up windows. When it finished booting, I rebooted it into "normal mode" and everything appeared to be working. I checked that the network connection was up and then let Danika know she could try to remotely connect to it again. She was able to, so at that point I considered it a success. I suggested she run chkdsk (windows version of fsck) on both of the partitions to let windows attempt to fix any of the holes in the data. So far she has not run into any issues with the system.

I gotta say, open source software really kicks ass!

References:
ddrescue - www.gnu.org/software/ddrescue/ddrescue.html
g4u - fehu.org/~feyrer/g4u/
Parted Magic - partedmagic.com
Smart Monitoring Tools - smartmontools.sourceforge.net

SMART errors on the BIOS screen
The SMART errors being reported on the BIOS screen

ddrescue results
The output after running ddrescue