Successfully fixing a BSOD problem

This is the very picture of an unhappy, unhealthy Windows system:

image

In this Reliability Monitor chart, the two rows to focus on are at the bottom. That string of Windows failures (and related Miscellaneous failures) indicate the appearance of the dreaded Blue Screen of Death. I’ve highlighted the last appearance, on October 14. Note that the Failure Detail section below lists the error code, 0x000000D1, and examining the blue screen itself shows that the error is triggered in (although not necessarily caused by) a Windows system file called Afd.sys, which is related to networking.

As you can see from the chart, this string of errors started on October 1, and these hard crashes were occurring with alarming frequency by October 10. The crashes were not triggered by any particular activity but seemed random. To troubleshoot, I did the following:

  • Ran hardware diagnostics. Thorough testing showed no problems with memory or hard disk, and I was able to rule out heat and power problems as well.
  • Installed a BIOS update and a hotfix from the system manufacturer’s website. The readme files for each download mentioned that they were designed to fix BSOD problems that sounded similar.
  • Uninstalled several third-party programs that I had installed around the time the crashes began occurring. This included an antivirus package and a display utility.
  • Checked for unsigned drivers using the Verifier tool. Found none.

And still the crashes continued. The data set shown above is from a relatively new system, an HP m9300t that I purchased last month. After exhausting everything in my troubleshooting bag of tricks, I was ready to send the system back but decided to first contact HP support. I used the online chat feature and connected with a tech support rep within two minutes, and we walked briskly through the problem.

After about five minutes of research, he came back with the answer. I needed to replace the storage driver for my onboard Intel ICH8R/ICH9R SATA RAID controller. In fact, I had inadvertently caused the problem by replacing the HP-supplied OEM driver with a newer Intel Matrix Storage Manager driver downloaded from Intel’s site. The Intel driver was version 8.5.0.1032. HP’s recommended driver was version 7.6.9.1002. After confirming that the BIOS was up to date, I installed the recommended driver, reinstalled the QFE 955252 hotfix (using the x64 version here), and restarted the system.

The results? Well, you can see for yourself in the screen shot above. This system has been running nonstop for exactly a week, without a trace of the previous problems. (The red X’s in the second row are application crashes related to a beta program I’m testing.) I’m confident that this BSOD problem is now cured.

And the moral? The latest driver is not necessarily the best, especially for critical components like a network card or a storage controller. Think twice before replacing an OEM-supplied driver with one from a different source, and be ready to roll back that driver at the first hint of trouble.

I have to give props to HP for its support as well. The entire support interaction lasted less than 20 minutes. The tech was knowledgeable and polite. And most important of all, he was able to diagnose and fix the problem the first time.

15 thoughts on “Successfully fixing a BSOD problem

  1. I think tech support has gotten much better since companies like HP and Dell have implemented chat. I used to hate calling tech support for anything.

    I haven’t looked at my Reliability monitor in a while. I checked it today and to my surprise I was at a perfect 10 for a month until a sidebar problem a week ago.

  2. Interesting. My self-built, year-old Vista box blue screened last week for the first time shortly after increasing the RAM to 4 GB. That turned out to be the Intel Matrix Storage Driver as well. In my case, I was using not only an old driver, but one of the first Intel released for that motherboard. Upgrading to the newest driver seems to have cured it. (I was an early adopter for this motherboard and ran into a lot of issues with the early drivers and BIOSs).

    This was my first chance to try out the Crash Analyzer Wizard from the Microsoft Desktop Optimization Pack. I was quite impressed by it – you just install and run it (after also installing the debugging tools), and it comes up with the probable cause of the crash. MDOP comes with SA, but it’s also available on TechNet and probably MSDN.

  3. That’s a lot of software failures. While it is understandable that a driver could wreak that much havoc, I hope you’ll remember this next time the issue comes up and you claim that most of the problems are hardware related.

  4. The problem is, it’s a delicate balancing act between latest driver and manufacturer’s drivers– most manufacturers stop updating their drivers and applications, leading you to be on your own. I had a constant problem with a Dell Optiplex machine in Vista, related to iaStor constantly crashing:

    http://www.pdsys.org/blog/2007/04/10/DellAndIntelScrewUpBigTimeIaStorAndWindowsVista.aspx

    It could only be fixed by installing the latest drivers, or installing a registry hack. Dell didn’t help with the issue at all, instead they wanted me to do a system restore (which I had done, and still had the problem.)

  5. Amazing! I have had such the opposite experience with HP. That was about a year and a half ago though. Perhaps they’ve changed since then?

  6. Jon, the software failures you see there are all caused by two programs. One is an early beta, which should be expected to crash at this point in its development cycle. The other is a particularly poorly written piece of software that I need to use; it hangs occasionally and needs to be restarted. This system has easily 50 software programs installed on it. The fact that two of them (one of which is a beta) occasionally hang doesn’t seem to be worth making a fuss over.

  7. I’ve had a similar problem to Ed with an Acer 5920G laptop running Vista 32-bit. The only Intel Matrix Storage driver versions that don’t cause random freezes are 8.2.3.1001 or the version which Acer supplies (7.6.0.1011).

    The most annoying thing about this problem was that there was absolutely nothing to indicate that it was the Storage Driver causing the freezing.

  8. I didn’t even know the Reliability and Performance Monitor was there. Thanks for the tip!

  9. You’re really on to something with this, Ed. With any given piece of hardware, there’s usually three sources for a driver, including one from the PC maker, one from Windows Update and one from the hardware manufacturer. As you describe, these sources may or may not offer the same driver and there’s no way other than trial-and-error to determine which one is best.

    Case study: On a Dell XPS 410, the Sigmatel integrated audio driver suggested by Dell for the system causes problems (and even prevented SP1 to install on Vista), but the generic audio driver from Windows Update works. On the same system’s ATI 1600x pro graphics card, the driver from Windows Update doesn’t work, this time, the best driver comes directly from ATI’s Web site.

    Making things even more confusing, if you use Windows’ Device Manager to check to “Update Driver Software” for a device currently using a manufacturer-issued driver, Windows will overwrite the driver with one that might be older and may not work as well.

    I’ve heard some recommend using manufacturer-issued drivers for graphics cards and TV tuners, the PC makers’ drivers for disk controllers and things integrated in the motherboard, and then finally Windows Update drivers for most other things. I suppose this rule-of-thumb works, but it seems to put much of the work in the users’ hands on something that should be in the background and invisible for the most part.

    Seems to be the answer would be to require hardware makers to certify their drivers and make them immediately available at Windows Update. There’s a central catalog for drivers available, why not use it?

  10. My motherboard’s network card (nVidia) will only work with the original NIC driver that comes with Vista. nVidia’s drivers are very flaky with this board (a version 1.0; I was an early adopter of this particular motherboard.) Not at all surprised.

  11. Ed, I think you are referring to the application failures? I suppose “beta” means different things to different people – I’d call software that crashes every couple days more like “alpha” software – though I suppose if you were using it constantly, maybe a crash every 24-48 hours isn’t too bad.

    I run the Debian “testing” suite, which arguably, isn’t really “testing”, but still contains fairly new software, in my opinion. I have a friend who runs the “unstable/nightly builds” suite, and he does run into a crash or malformed behavior every couple years.

    And I did have my xserver on my home computer flake out the other day, using up most of the CPU, so I had to restart X – I am not saying that Linux is perfect (and hopefully taking myself out of the fan-boy category), but as a user of both Windows and Linux for a decade or two, I don’t see how anyone can say that Windows is more reliable by any definition.

    Looking back at your first paragraph, I find it interesting that you uninstalled a display “utility” (unless it included a driver) and anti-virus software – why would those applications cause an OS crash? That is the sort of difference I am talking about.

    I don’t argue with your driver problem – that can happen to any OS (though as the other commenters point out – it is sometimes hard for even technically minded folks to figure out which driver is the correct one to use – and remember that decision when it gets “upgraded” by Windows Update a year later – an experience I share with them).

  12. Jon, antivirus software hooks into the operating system at the very lowest level, with kernel-mode drivers and file system filter drivers. Those are capable of creating BSODs, and because I installed that program the day the crqashes started happening, I uninstalled it to confirm that it wasn’t the source of the problem. Troubleshooting 101.

    As for display drivers, same story. This particular display driver did indded include a kernel mode driver, and I had upgraded it to a new version the same day as I installed the AV software. So I uninstalled just to confirm.

    The beta software in question is a web browser and indeed it is one I use full-time, for as long as the OS is running. The crashes are infrequent (four times in three weeks) and are easily recovered from with no loss of data. The Application Failure list also includesentries for processes I kill using Task Manager, because I just don’t want to wait for a process that is hung on a remote server or something.

  13. “Windows and Linux for a decade or two, I don’t see how anyone can say that Windows is more reliable by any definition.”

    Except Windows is stable and reliable, what is not are the applications that run on top. And there is no operating system that can make applications stable and reliable. But the underlying system is.

    To say because program X crashes N times per day must mean Windows is unstable and unreliable. We both know that is a false statement.

  14. Hrm, I’m now confused when you say “windows is stable and reliable”. What exactly are you referring to? I am referring to blue screens. You’ll see in all of my comments that I wasn’t counting the application failures as part of my windows instability comments, though I have less of a tolerance for “beta” applications crashing than Ed does.

    I am not enough of a kernel guy to know if BSODs are exactly equivalent to kernel panics, but I believe they are. And the number of times I have seen a kernel panic is probably somewhere around 4 in the last ten years, due to driver or motherboard issues. The number of times I have seen BSODs are uncountable due to various reasons. XP has certainly been much better than prior versions of Windows, but BSODs are not completely gone.

    I don’t remember if I have seen BSODs in Vista or not. I don’t recall any. I have seen memory dumps and automatic restarts (which seems more like a kernel panic if there is a difference) and times when the OS and all applications are completely unresponsive and requires yanking out the power cord (or holding the power button) to continue. That doesn’t fit my definition of reliable.

    How are you defining windows as stable, and how many days/months/years? has your OS been running without a requiring a reboot?

Comments are closed.