[Discuss] ATA Access Errors For Spinning Disk

Steve Litt slitt at troubleshooters.com
Sun Dec 17 21:05:39 EST 2023


jbk said on Sun, 17 Dec 2023 10:13:36 -0500

>I periodically get access errors for a specific spinning 
>disk that I have done these things to diagnose:
>Changed Sata Cable
>Switched Sata bus on MB
>Run E2fsck on the 3 formatted ext4 partitions w/ no errors found
>Run smartctl -a: all results within norms
>Run smartctl -t short: No errors found
>
>Disk operation age is about 7.5 years with around a couple 
>hundred starts. It has been in continuous operation for over 
>8 years except during vacations. On occasion the disk 
>partitions will become unmounted and a mount -a will remount 
>the partitions as a different device from lets say sda to 
>sdd. I've not lost any data and I do regular backups to 
>another device that's rotated out of system.
>
>I seem to have always had these errors present on this MB 
>that is maybe 4 or 5 years in operation. Any thoughts on the 
>cause of this issue? Do others see this behavior on occasion 
>on systems they manage?
>
>On this same system my Rocky OS on an SSD is showing no 
>issues at all. Same operation age as the spinner.

I really like the troubleshooting strategy you've pursued in trying to
find the root cause of this intermittent problem. As we all know,
intermittents are much more difficult to diagnose than reproducible
symptoms. If you look at the Universal Troubleshooting Process (UTP) on
Troubleshooters.Com, you'll see that UTP step 5, Corrective
Maintenance, is extremely powerful and necessary with intermittent
problems. I have some suggestions for Corrective Maintenance and
further diagnostic tests...

* You get occasional disk errors, any of which could cause data
  corruption. To prevent things from getting worse, boot a rescue
  distro and ddrescue your current disk to a larger disk, and if you
  ever mount that backup disk, mount it read-only.

* Lubricate all electronic contacts for all cables, daughter cards, RAM
  sticks, switches with associated cables, and jacks and plugs for all
  peripherals. Apply the lubricant to conductive surfaces on both plug
  (male) and jack (female), then insert and remove twenty times to bust
  off all corrosion. Please take 10 minutes to read this 20 year old
  discussion of electronic lubrication:
  http://troubleshooters.com/tpromag/200310/200310.htm
  I've used transmission fluid, WD-40, Lube-Job electronics lubricant,
  Breakthrough CLP, WD-40, Deoxit Gold, Superslick Slick Stuff, and CRC
  QD Contact Cleaner, and was very satisfied with all of them. I
  currently use mostly Superslick Slick Stuff. The important thing is
  that there's residual lubrication to prevent build-up of Fretting
  Corrosion. Stabilant 22 and Deoxit Gold are the safest to prevent
  damage to non-metals and prevent conduction between non-mating
  surfaces, but they're pretty expensive. My experience has been that
  as long as I carefully limit application to the mating conductors.
  Lubricating all mating electronic contacts takes 2 or 3 hours, but
  doing so can save you weeks of frustration if an intermittent is
  being caused by fretting corrosion between electronic contacts. I do
  complete electronic contact during the initial build of all my
  computers. Because you've observed this intermittent since you bought
  the mobo several years ago, lubricating the RAM stick contacts is
  especially important, as it's likely those sticks have been in place
  since you bought the mobo.

* Run a complete RAM test overnight by booting a memtest86 CD or thumb
  drive. Get rid of any sticks with errors. Intermittents are too
  expensive to try to limp along with RAM errors. Note that if you're
  not using UEFI, you'll need an older version of memtest86.

* Temporarily swap in a known good power supply, use for several days,
  and see whether the problem has gone away. If so, use the known good
  power supply or a known good newly purchased power supply. If the
  problem persists, put back the original power supply at the
  conclusion of troubleshooting.

* Power switches and reboot switches can go intermittent and cause
  hangs and spontaneous reboots. If I have suspicions of these things,
  I disconnect the reboot switch (you can always unplug the computer
  for an abrupt shutdown), and temporarily disconnect the power switch,
  starting and stopping the computer by CAREFULLY shorting the power
  switch pins with a screwdriver. I then run the machine for about 3
  days to see if the problem really went away. If the problem appears
  to be the power switch, I replace it with a cheap, wired, no light, 2
  contact doorbell switch, available at home warehouse stores. If you
  can't find it there (most doorbell switches are now lighted), I'm
  pretty sure that this is what you need:
  https://www.ebay.com/itm/155929670486 . You might need extra wire so
  your front panel can be removed enough to service the front parts
  without needing to disconnect the power button leads and fish them
  around the motherboard and through the chassis.

* If you're overclocked, roll it back to the non-overclocked
  frequencies. Often simply telling the BIOS to reset to its factory
  state is a great way to rule out a whole bunch of BIOS caused
  problems. As always, test for several days to make sure the
  intermittent symptom really went away.

* Use various sensor programs to check various CPU temperatures and
  disk temperatures. If temperatures even begin to approach maximum
  specs, take 

* Try to observe whether this intermittent symptom occurs significantly
  more when running a specific set of software, and act accordingly.

* Boot a radically different distro, use for several days, and see if
  the intermittent symptom still occurs. If so, you've for the most
  part ruled out your distro, software, and config settings. If not,
  investigate your software and configs.

* If none of the preceding works, you need to consider how much time,
  money and energy you're willing to throw at this intermittent problem.
  If you have a known good spinning rust hard disk bigger than the
  current one, you could ddrescue the current one onto the new, bigger
  one, test for a few days, and if the symptom doesn't recur, the hard
  disk had a problem not detected by smartctl.

* If none of the preceding works, you need to consider how much time,
  money and energy you're willing to throw at this intermittent
  problem. Personally, at this point, I'd byte the bullet and buy a new
  motherboard, ram and processor and processor heat sink. Be sure to
  use high quality thermal heat sink compound between processor and heat
  sink, be sure to remove any labels the manufacturer stupidly put on
  the processor where it should be mating with the heat sink, and clean
  all label adhesive residue before applying heat sink compound. Don't
  cheap out on the heat sink: A lot of times the heat sink packaged
  with the processor is great for email and light web browsing, but
  allows overheat in intense operations like compiling a kernel.
  Remember, you want this new setup to last for many years.

* If you're going to buy a new mobo, CPU and RAM anyway, it costs you
  nothing to take the very risky step of updating your BIOS. Who knows,
  it might work. Because of risks involved in BIOS updates, I don't
  recommend them except in cases where your symptom is a well known
  effect of your specific BIOS version, or else when you're about to
  throw the mobo in the trash anyway. Be sure to run the computer on a
  known good uninterruptable power supply when updating your BIOS so
  your electric company's problems don't brick your computer.

I'm very aware of the time and energy the preceding steps require. Your
computer is now 8 years old and probably anemic by today's standards.
If your current computer has enough capability for your needs, you
could probably buy a whole new computer of equal capability for under
$700. If you want to replace it with a modern computer with huge
capacity, you can probably do it for between $1500 and $2300. Remember,
the alternative is all the troubleshooting steps I listed (and probably
other people can think of even more).

HTH,

SteveT

Steve Litt 

Autumn 2023 featured book: Rapid Learning for the 21st Century
http://www.troubleshooters.com/rl21


More information about the Discuss mailing list