Thomas Hawk is trying a tech support experiment, in which he posts problems with his PC and then requests help. Instead of posting in his comments, I’m going to cover one of his problems here:
Problem number 1. My computer seems to be inexplicably freezing up (yes it’s a Windows machine, I know, I know, get a Mac) periodically. These are really bad freeze ups. Control-alt-delete does not return my PC. I can’t alt tab. Total freeze up. The only way to get my computer back is to restart. The last time it happened I had Pandora on in the background (but this is probably just coincidence) the music even stops and stutters as the freeze happens. The most recent thing I’ve installed is Windows new Live One Care. My next step is going to be to uninstall Live One Care and see if that helps me out at all.
By a curious coincidence, the same thing has happened to me within the past two weeks. Based on the symptoms, Thomas’s problem has nothing to do with software and everything to do with hardware. Here’s my story, and how I resolved it.
I have a Dell PowerEdge 600SC server running Windows Server 2003. It’s about 3-1/2 years old, and it has been running nonstop with virtually no problems for all that time. Over the years, I’ve added some big hard drives, and about a year and a half ago I replaced the original 2GB of RAM with 4GB so I could run multiple virtual machines on this box.
For the past month or so, this system has been responding slowly on some activities, especially file copies over the network. Then, about two weeks ago, the server froze up one day. Simply stopped responding. The power was still on, but the screen was black and the system didn’t respond to mouse input. I pressed the power button to restart, and when it came back on, I checked the System log in Event Viewer to see if there were any events captured there that might shed light on the error. Nope. Every recorded system event up until the crash was perfectly normal.
(Note to Thomas: Be sure to check Event Viewer. From Control Panel’s Classic view, double-click Administrative Tools, then double-click Event Viewer.)
The fact that there were no events listed is actually a crucial troubleshooting piece of information. It means that whatever happened was a complete surprise to the Windows code that’s running in kernel mode and supervising the whole system. Essentially, it means Windows was mugged.
A few days later, it happened again. This time, when I restarted, I booted into Dell’s Diagnostic Utilities partition and ran its comprehensive series of diagnostics. They showed no hardware problems. I also ran a quick memory test that showed no problems. Baffled, I restarted the system. Maybe it’s a failing motherboard, I thought, or a system that’s overheating.
When it happened again the next day, I decided to run a more comprehensive memory test. And sure enough, when I ran the full suite of memory tests included with Dell’s diagnostic suite, I found that the error correcting code (ECC) in one of the server’s memory modules was causing unrecoverable errors. Now, an unrecoverable memory error is bad news and would completely explain why (1) the system was locking up and (2) the lockups had no apparent relation to any software running.
Using another diagnostic tool, I ran a different suite of tests, which showed that the fault was in the memory module in DIMM slot A. This particular system has four slots, each with a 1GB stick of RAM in it. The RAM is installed in pairs. I wasn’t sure which slot was DIMM slot A, so I took out the modules on either end and then reseated the other two DIMMs in the remaining slots.
I restarted and ran another memory diagnostic. This time the system passed with flying colors. I now a highly confident that one of the two modules I removed is defective. They’re still under warranty, so I should be able to return them for replacement.
Lessons learned:
Most system and application failures are fairly easy to identify. Random failures often indicate hardware problems.
Bad RAM, overheating, and defective hard disks, in order, are the most common hardware failures in my experience.
Hardware can fail over time. Most people assume that the problem is software because they haven’t changed any hardware lately
Hope that helps, Thomas!