WD40EFAX drives - IDNF when resilvering ZFS array

stoatwblr · March 24, 2020, 3:50pm

I’ve just purchased 3 WD REDs to replace aging drives in a ZFS array

ALL THREE are failing during resilvering with IDNF (sector ID not found) errors:

Here’s a typical example

After command completion occurred, registers were:
ER – ST COUNT LBA_48 LH LM LL DV DC
– – – == – == == == – – – – –
10 – 51 00 00 00 00 00 00 0a 10 40 00 Error: IDNF at LBA = 0x00000a10 = 2576

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
– == – == – == == == – – – – – --------------- --------------------
61 00 08 00 20 00 00 03 7f 31 40 40 00 17:55:26.267 WRITE FPDMA QUEUED
61 00 08 00 18 00 00 03 7f 2d 80 40 00 17:55:26.267 WRITE FPDMA QUEUED
61 00 10 00 10 00 01 d1 c0 74 10 40 00 17:55:26.267 WRITE FPDMA QUEUED
61 00 10 00 08 00 01 d1 c0 76 10 40 00 17:55:26.267 WRITE FPDMA QUEUED
61 00 10 00 00 00 00 00 00 0a 10 40 00 17:55:26.267 WRITE FPDMA QUEUED

End-to-end sequential writes of the drives show no problem, smartctl shows no issues and ATA secure erase also shows no problem

Does anyone have any ideas what’s going on?

pazuwu · March 25, 2020, 2:41pm

I’ve got the same situation over here; just purchased 2 x 4TB WD Red WD40EFAX and I’m trying to use them with ZFS in Linux (zfs 0.8.3 and kernel 4.17.14), on a HP Gen8 Microserver.

The SMART info looks clean, except for the IDNF errors in the extended log. They show up during ZFS resilvering. Ran a complete badblocks test, as well as the short / long SMART tests and no further errors occured.

My complete SMART output + kernel errors: pastebin log.

I will move them to a windows machine and use WD’s test tool on them, but until then I’m doing a couple more tests with a new ZFS pool, to see if I can reproduce the problem. I also plan on testing btrfs (although i don’t think this looks like a software - zfs - issue)

stoatwblr · March 26, 2020, 2:24am

Argh!

Who on earth sells SMR drives disguised as normal drives and then badges them as NAS devices?

Of all the stupendously DUMB things I can think of this comes near to taking the biscuit.

pazuwu · March 26, 2020, 5:25am

Hmm, I had no idea they were SMR drives (didn’t even know about this technology). Are you saying that this could be the problem?

I’ve tested the drive in a windows box, other than the fact that their own tool (data lifeguard) doesn’t see the drive, it’s fine (no kernel error messages, smart checks up ok in various tool, performed a full surface test with no issues…).

stoatwblr · March 26, 2020, 8:06pm

It certainly looks that way.

Benchmark testing is fairly conclusively pointing to the drives being SMR.

I’m talking to OpenZFS experts at the moment including a couple of vendors who recommend REDs in their products - they’re alarmed to find that SMRs are in the channel without being differentiated and will be testing the EFAXs in the nest few days to try and verify my findings.

Of course one of the more painful problems with WD drives is the inability to upgrade their firmware in Linux or BSD…

*HINT HINT - Some of us don’t actually HAVE windows boxes - even in the work environment.

stoatwblr · March 26, 2020, 8:22pm

Does the WD Windows tool report the FW version as current? (Firmware Version: 82.00A82 here)

I’m assuming that as the drives are brand new into the channel there’s no update but you never know.

My issues are on a home ZFS setup on Linux/ZFS 0.8.3 too.

As we’re about to pull the trigger at $orkplace on a (LARGE) setup to replace a 400TB TrueNAS that’s been running flawlessly for the last 5 years I communicated my concerns/experience to the vendor I’ve been dealing with because we had “lots and lots and lots” of problems with previous products and I really don’t want the users to find any hint of latencies or instabilities which can be attributed to the fileserver - they have a blanket policy that SMR is best kept “far far away” from performance arrays and really don’t like the idea that they’re being submarined into the marketplace like this as it’s the kind of thing that drives up warranty/support costs rapidly.

pazuwu · March 26, 2020, 9:24pm

I did a bit of googling and it seems it’s a known fact (?) that WD?0EFAX are SMR. I should have suspected there’s something fishy when I saw they were cheaper than WD40EFR, having a larger cache, too.

My drives have the same firmware, 82.00A82 (it’s in the pastebin of my original post too); their windows tool (wd data lifeguard 1.36) didn’t even detect the drive so I don’t know if there’s the possibility of an upgrade.

I’m returning the drives tomorrow. They “seem” fine, they just don’t want to work with ZFS in my setup. I’ll probably try my luck with some IronWolfs again (internet says they’re not SMR, but I didn’t find any official info), although last time i ordered, 1 out of 2 was DoA…

stoatwblr · March 31, 2020, 5:26pm

This issue is starting to get traction in a few forums and it’s been confirmed the drives cannot be used to rebuild RAID6 arrays either

EU laws are quite tough on false advertising. Changing the underlaying characteristics on drives advertised as suitable for NAS and RAID use isn’t going to go down at all well with regulators.

pazuwu · March 31, 2020, 6:16pm

So you had this confirmed, that all such drives have similar issues and it’s not just our bad luck?

stoatwblr · April 1, 2020, 4:23pm

confirmed inasmuch as: EFAX appear to be SMR whilst EFRX are “CMR” (conventional)

Best not to use “PMR” as a term for the older drives - SMR (shingling) is an extension on top of PMR technology and I just had a WD regional marketing manager latch onto “PMR” to claim “the drives are PMR” and therefore there isn’t an issue.

The concensus is that in this particular instance the issue is rotten firmware and there’s no good reason why the drives should be returning these codes

I’m also getting feedback that it’s difficult-to-impossible to rebuild RAID5/RAID6 arrays using EFAX drives, not just RAIDZ/Z2/Z3 arrays.

It’s not just WD pulling this silliness. Examples have been cited of disguised DM-SMR units from SG too (eg: ST3000DM-007 and some Ironwolf models have been confirmed)

I’ve sent a heads-up to the smartmontools developer list to let them know what’s going on.
Hopefully ways will be quickly developed to flag disguised DM-SMR drives

Just to add more fun: TDMR (Two Dimensional Magnetic Recording) is a way of describing the zoning and block reassigning(indirection) functions necessary in a SMR drive and you essentially can’t have one without the other - there’s no need for this functionality in a CMR drive. That means the implications for issues are intertwined if you see drives described using either SMR or TDMR.

pazuwu · April 6, 2020, 2:07pm

Replaced the drives with WD40EFRX. Can confirm they work properly (in my setup at least).

stoatwblr · April 6, 2020, 4:19pm

My suppliers (insight) can’t obtain them anymore. Where did you get yours? What’s their manufacture date?

pazuwu · April 6, 2020, 5:43pm

Manufactured in Thailand, May 3rd, 2019. I bought them from an online retailer in Romania. They can still be found in various online shops here and across Europe too ( Western Digital WD Red Plus 4TB, SATA 6Gb/s (WD40EFRX) ab € 145,00 (2023) | Preisvergleich Geizhals EU )

stoatwblr · April 13, 2020, 4:12pm

About a year old then. I figured you’d gotten old stock from somewhere.

I’ve been checking Skinflint too, but more often than not the listings are out of date

Thanks

stevenc80 · April 14, 2020, 10:32pm

Thank you for raising this issue. This article suggests that all Reds up to 6TB use SMR. Do you have any info to confirm that EFRX models are still CMR?

stoatwblr · April 24, 2020, 7:05pm

stevenc80 · April 24, 2020, 8:59pm

Thank you. It seems only the 2-6TB EFAX drives use SMR, the 2-6TB EFRX drives still use CMR.

stoatwblr · May 4, 2020, 5:58pm

ixSystems (makers of TrueNAS and FreeNAS) have confirmed my findings that this is a firmware bug:

“At least one of the WD Red DM-SMR models (the 4TB WD40EFAX with firmware rev 82.00A82) does have a ZFS compatibility issue which can cause it to enter a faulty state under heavy write loads, including resilvering. This was confirmed in our labs this week during testing, causing this drive model to be disqualified from our products. We expect that the other WD Red DM-SMR drives with the same firmware will have the same issue, but testing is still ongoing to validate that assumption.
In the faulty state, the WD Red DM-SMR drive returns IDNF errors, becomes unusable, and is treated as a drive failure by ZFS. In this state, data on that drive can be lost. Data within a vdev or pool can be lost if multiple drives fail.”