May 19, 2003

The Shark Tank

I enjoyed reading the Shark Tank archives this morning. They remind me of Dilbert, when it was still funny. True stories of IT screw ups.
Posted by bret at 12:25 PM | Comments (0)

May 01, 2003

Disk Drive Woes and Backup Software

Two nights ago, my laptop started acting up. The processing went down to a crawl and the hard drive seemed to be the culprit. The hard drive light was on almost constantly and it didn't sound right. I was up late trying to figure out the source of the problem and see what i could do to fix it. One reason for my worry was that i hadn't backed up recently.

At one point i decided to see if my hard drive was seated properly, so i removed it and reinstalled it. For a while thereafter, the system was working correctly. And then the problem came back.

I'd seen a similar problem years ago on a laptop running NT (my current laptop runs 2000). Back then, i eventually corrected the problem by uninstalling zip disk drivers for a parallel port zip disk. They seemed to interfere with the atapi controller when the zip disk wasn't present.

After a late and depressing night with little luck, my laptop worked fine the next morning. I was now able to check the event log, which contained numerous instances of this error:

Event Type:	Error
Event Source:	atapi
Event Category:	None
Event ID:	9
Date:		4/30/2003
Time:		1:49:15 AM
User:		N/A
Computer:	LAKATOS
Description:
The device, \Device\Ide\IdePort0, did not respond within the timeout period. 
Although this looks like the zip disk error, i've decided that the problem must have been due to overheating. It hasn't happened since and i'm supposing that removing the hard drive briefly helped because it open up the case and helped cool the system down. I do remember it being rather hot. I haven't made any configuration changes recently.

So when the system was working again, i first did my normal backup. This uses Retrospect to back up my data partition that includes all of my data files, including my email, bookmarks, notes and drafts of presentations and articles, client notes, and various coding projects. I also backed up a client project that was on a different partition.

The night before this error, i'd been up researching Ghost 2003. I'd used Ghost 2002 before and had been unhappy with its support for peripheral devices. But i found a report that 2003 apparently had much better support, including support for USB.

So i went to the store and bought Ghost 2003 and an 80 GB USB hard drive to back up to. (Normally i back up to another hard drive on my network, but it doesn't have enough space for a full system image.)

The hard drive hooked up without a problem. But then i ran into a problem with Ghost. When i told it to do a backup, it told me that it was going to reboot the system, and then the following error appeared instead:

Unable to successfully defragment the Virtual Partition file. This is probably due to your disk being too fragmented, we recommend that you run the Windows Disk Defragmenter and run this task again. See your Windows documentation for more details.
I didn't understand this message. I know how to defragment a partition, but which partition? Of course the Windows documentation wasn't going to tell me. Eventually i decided to check the Symantic website, and i soon found a technical bulletin that provided more information about this error. Note that it details eight possible causes, only one of which actually pertains to defragmenting.

Some of these didn't apply to my situation. I tried the ones that did apply and they didn't help. Remember, i'm worried that my hard disk problem may recur at any moment and am eager to get the entire disk imaged before that happens.

Earlier i had called IBM support about my problem with my Thinkpad. The tech i'd spoken to confirmed i was still covered by warrenty. He said that if it was a heat problem, the cause was more likely the fan than the hard drive, and he showed me where to get a hard drive test if i wanted to check it out (i did: it passed). He said that he couldn't help me much since the problem wasn't currently occurring. But i knew that once it recurred, there'd be little they could do, as the system would be unusable. He did let me know that they were available 24 hours a day. And he suggested that they might suggest that i'd need to send my laptop in. That'd take three to five days, which i really didn't want to happen.

I needed to get my image and get back to work, so i decided to spring for another $30 to call Symantec tech support. After handing over my credit card number, and telling the tech what i'd done so far, he proceeded to go through the tech bulletin step by step with me, making sure i'd tried each bit. This was frustrating for me. One of the items said that we had to make sure i didn't have a dynamic drive. What is that? I don't think he knew, but i read him some information from the device manager and we went on. (Later i found a description in Ghost's manual: it is a RAID or other compound disk. Not what i have. Symantec was seriously negligent to charge me $30 to speak a tech who hadn't read their manual. "Dynamic disk" appears in both the table of contents and the index.)

Near the end of the list, he was trying to give me a list of things to try and then call back and i was trying to keep him on the line while i did them. I think both of us were pretty sure they weren't going to work. But his will prevailed. When they other items didn't work, i called back, steered myself through the phone maze and got another tech. (This was part of the same case, so i didn't have to pay again; the first tech had made this point clear. At first i was annoyed; but then i realized it was his way to get me to someone more knowledgeable.)

I summarized what i had done to the new tech. One of the items said that you could get this error if you already had four primary partitions. I'd gone through this with the first tech. This was actually one of the items that i'd wanted some advice on. I did have four partitions, but some weren't primary. Or at least that is what i'd thought. I told the new tech about this and that i had pulled up Partition Magic with the previous tech. That was it, she said. "Ghost wasn't compatable with Partition Magic." OK, that explains it. But why did i have to pay $30 to learn this? I'd read the doc and this wasn't stated anywhere. "Let me double check that." Then she got back to me and said that wasn't really true. I did tell her that i had been able to create a Ghost boot disk and run the DOS version successfully. She encouraged me to try that, and ensured that it had the same features as the Windows version.

After talking with her, i'd decided that Partition Magic must have part of the problem despite her waffling on the matter. Clearly, it would put them in an embarrasing position to admit incompatability. PowerSoft, the publisher of Partition Magic, also makes Drive Image, Ghost's principle competition. Admiting incompatability could be the basis for an antitrust lawsuit. Nonetheless, she'd explained more about how Ghost worked. The virtual partition was Ghost's method of transferring control to the DOS program that actually did the heavy lifting. I could use a boot disk to do the same thing. I looked again at how Partition Magic had allocated the space on my drive. Everything was allocated to one partition or another. I thought that maybe if i could make some unallocated space, Ghost could make use of it. I was working on the supposition that there actually was some unadmitted incompatability between Ghost and Partition Magic, but that it might be restricted to the default usage of Partition Magic.

So i went in to Partition Magic and tried to free up some space. And then i realized my mistake.

Partition Magic rebooted and then started moving my partitions. This was going to take a while -- over an hour. And if my hard drive started acting up, i was going to be doomed. The partition would be scrambled. Of course Partition Magic recommends that you backup your system first. But then i was repartitioning as an attempt to get my backup software to work....

I was lucky, and nothing bad happened. But the repartitioning didn't fix the Ghost problem. So should i use Ghost from DOS or should i return it (it has a 60-day money back guarentee) and get Drive Image, which surely is compatable?

I read the reviews for Ghost and Drive Image at CNET.com. Both had a lot of griping, although Drive Image got more. Many Ghost reviewers said that the DOS usage was a strength. My original reason for getting it was not so much for backing up my laptop, but to help with setting up test configurations on some lap machines i'm getting. So i'm going to stick with it for now.

I tell this long tale, partly to get it off my chest, but also because i think that good testers can learn from bugs, and can learn how to find similar problems in other products.

The first bug is the one i remembered from long ago: the incompatability with the zip disk driver. Sadly it wasn't the first time i'd seen a critical zip disk bug. If you press the eject button and then insert a different zip disk, it won't realize you've changed disks and will write the file catalog of the first to the second, causing both to be worthless. Their advice: don't press the eject button. The eject button controls the firmware on the drive; there is no reason why they couldn't have disabled the button in situations like this. This second problem lead me to give up on the zip disk; i haven't used it since. What good is a backup system that corrupts data and incompacitates your system? (Anyone want a parallel port 100 MB zip drive? I have two sitting in a box.)

Back to the tester. Imagine you're a tester working for Iomega. I suspect that the eject bug was found in testing and that it was a poor management decision that lead it be shipped the way it was. But i can imagine that the incompatability was missed. Running a drive off a parallel port is a bit of a hack, but they should have known that and focussed testing effort on it. But how much testing would they have done in a configuration that had the drivers installed, without the zip disk attached? And i don't know what the contributing factors were for the problem. It happened when i was connected to a client's network: this may have been a contributing factor. At the same time, NT wasn't really designed for removable peripherals, yet the Zip Drive was sold as such. So it's a weak spot that their testers could have reasonably focussed on. I guess the general rule that testers could learn from this is to test incomplete-but-valid systems.

(The fact that Iomega may be a source of buggy software raises another concern: Ghost's DOS boot disk uses Iomega drivers.)

I wish i knew the cause of the atapi problem that i ran into the other night. But i don't, so it's hard to conjecture as to what the source of the problem may have been and thus what a tester could have done to find the problem.

The Ghost problem is something else. I suspect that if the testers had tested Ghost with Partition Magic, they would have been able to find this problem. Or they'd be able to at least advise customers regarding how to make they work together. The documentation for Ghost makes little mention of other utility software: all i could find was a reference to a potential conflict with GoBack. The fact that some of their tech's had been told that they weren't compatable means that they have reason to test further.

From what i have seen, this problem isn't necessarily Ghost's fault. Both Ghost and Partition Magic are doing unusual things with partitions. Unlike the Ghost documentation, Partition Magic (7.0) includes an appendix on compatability with other utility software, including GoBack. It even has a section on Norton Utilities; but it makes no mention of Ghost. Did they test it and not find problems? The advice to use Ghost only in DOS mode is similar to the kinds of workarounds they mention for other products; so if they knew it, i would think they would have documented it.

Maybe i don't really understand the source of this problem with Ghost after all. I can say, however, that i've learned that that error message isn't accurate. It says the problem is with fragmentation, yet the technical note acknowledges that there are many other potential causes. I'm guessing that the message really indicates that it can't create a virtual partition. But i shouldn't have to guess. Perhaps error messages should only indicate what the software can't do without speculating as to the reasons why.

From this testers can learn that they should try not only to try to trigger all error messages (a standard technique), but also to see if there are alternate ways to trigger a message. Is the error message accurate in its description of the error condition? It would help me now if i had a better description. Note that the techical bulletin lists several possible causes (and solutions) but does not provide a better explanation of quite what the error message should be understood to indicate.

A final note. I used to test back up software -- as a tester for a product that is no longer on the market. I probably made all of the testing mistakes discussed here. I'm also a bit of a backup software geek and probably use the software in atypical situations.

Posted by bret at 02:12 PM | Comments (3)