WD20EZRX unusual S.M.A.R.T values

Haidube · January 27, 2013, 7:07pm

Hello there,

A new green WD20EZRX with only 275h of power on hours suddenly shows s.m.a.r.t. logs with some alarming entries I’ve never seen before. I just wonder why the drive decided to abort activities during boot of a linux system, although no other s.m.a.r.t. parameters look disturbing.

Could some experienced expert please comment? Many thanks!

smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EZRX-00DC0B0
Serial Number: WD-WMCXXXXXXXXX
Firmware Version: 80.00A80
User Capacity: 2'000'398'934'016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 9
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Jan 27 13:42:23 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: (26580) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities: (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability: (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70b5)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
  3 Spin_Up_Time 0x0027 219 178 021 Pre-fail Always - 4016
  4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 42
  5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
  7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
  9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 275
 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 42
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 17
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 34
194 Temperature_Celsius 0x0022 125 107 000 Old_age Always - 25
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
ATA Error Count: 4
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4 occurred at disk power-on lifetime: 274 hours (11 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 53 4f c2 20 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  b0 d8 00 00 4f c2 20 00 00:00:12.088 SMART ENABLE OPERATIONS
  ef 03 46 00 00 00 00 00 00:00:12.088 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 20 00 00:00:12.087 IDENTIFY DEVICE
  ec 00 00 00 00 00 20 00 00:00:12.086 IDENTIFY DEVICE

Error 3 occurred at disk power-on lifetime: 274 hours (11 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 53 4f c2 20 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  b0 d8 00 00 4f c2 20 00 00:03:25.482 SMART ENABLE OPERATIONS
  ef 03 46 00 00 00 00 00 00:03:25.482 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 20 00 00:03:25.482 IDENTIFY DEVICE
  ec 00 00 00 00 00 20 00 00:03:25.480 IDENTIFY DEVICE
  b0 d8 00 00 4f c2 20 00 00:02:44.579 SMART ENABLE OPERATIONS

Error 2 occurred at disk power-on lifetime: 274 hours (11 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 53 4f c2 20 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  b0 d8 00 00 4f c2 20 00 00:02:44.579 SMART ENABLE OPERATIONS
  ef 03 46 00 00 00 00 00 00:02:44.578 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 20 00 00:02:44.578 IDENTIFY DEVICE
  ec 00 00 00 00 00 20 00 00:02:44.577 IDENTIFY DEVICE
  b0 d8 00 00 4f c2 20 00 00:00:06.630 SMART ENABLE OPERATIONS

Error 1 occurred at disk power-on lifetime: 274 hours (11 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 53 4f c2 20 Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------------- --------------------
  b0 d8 00 00 4f c2 20 00 00:00:06.630 SMART ENABLE OPERATIONS
  ef 03 46 00 00 00 00 00 00:00:06.630 SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 20 00 00:00:06.630 IDENTIFY DEVICE
  ec 00 00 00 00 00 20 00 00:00:06.591 IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 274 -
# 2 Extended offline Completed without error 00% 227 -
# 3 Extended offline Completed without error 00% 149 -
# 4 Extended offline Completed without error 00% 87 -
# 5 Extended offline Completed without error 00% 77 -
# 6 Extended offline Completed without error 00% 48 -
# 7 Extended offline Completed without error 00% 21 -
# 8 Extended offline Completed without error 00% 6 -
# 9 Conveyance offline Completed without error 00% 1 -
#10 Short offline Completed without error 00% 1 -

SMART Selective self-test log data structure revision number 1
 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

fzabkar · January 28, 2013, 3:14am

I have rearranged the command sequences (see end of post) to try to make sense of what is happening. The fifth command in each group of 5 looks out of place chronologically, so I’ve marked each with asterisks. This leaves a consistent group of 4 commands.

Assuming I have interpreted the Powered_Up_Time correctly, then the fact that the first set of commands was received 6.6 seconds after power-on and the last at 12 seconds, would suggest that they were issued by BIOS. The other two sets would suggest that the system was warm booted or reset twice in between.

It appears that BIOS first sends the IDENTIFY DEVICE command in PIO mode. It then determines the capabilities of the drive and upgrades the transfer mode to UDMA 6. Next it attempts to enable SMART. It is here that the error occurs.

I notice that smartctl is reporting that “SMART support is: Enabled”, so it looks like the drive powers up with SMART enabled by default, and then aborts subsequent SMART ENABLE OPERATIONS commands. According to the ATA8 standard, successive SMART ENABLE OPERATIONS commands should execute without error, so this looks like it could be a firmware bug. I notice that smartctl reports that the “ATA Version is 9”, so maybe something changed in this version. FWIW, the standard does specify that, if SMART is disabled, a subsequent SMART DISABLE OPERATIONS command should return an ABORT error.

See Sections 7.53.2 and 7.53.4 of the following document:

Working Draft AT Attachment 8 - ATA/ATAPI Command Set (ATA8-ACS):
http://www.t13.org/documents/UploadedDocuments/docs2008/D1699r6a-ATA8-ACS.pdf

00:00:06.591 IDENTIFY DEVICE
00:00:06.630 IDENTIFY DEVICE
00:00:06.630 SET FEATURES [Set UDMA 6 transfer mode]
00:00:06.630 SMART ENABLE OPERATIONS → error

*** 00:00:06.630 SMART ENABLE OPERATIONS

00:02:44.577 IDENTIFY DEVICE
00:02:44.578 IDENTIFY DEVICE
00:02:44.578 SET FEATURES [Set UDMA 6 transfer mode]
00:02:44.579 SMART ENABLE OPERATIONS → error

*** 00:02:44.579 SMART ENABLE OPERATIONS

00:03:25.480 IDENTIFY DEVICE
00:03:25.482 IDENTIFY DEVICE
00:03:25.482 SET FEATURES [Set UDMA 6 transfer mode]
00:03:25.482 SMART ENABLE OPERATIONS → error

00:00:12.086 IDENTIFY DEVICE
00:00:12.087 IDENTIFY DEVICE
00:00:12.088 SET FEATURES [Set UDMA 6 transfer mode]
00:00:12.088 SMART ENABLE OPERATIONS → error

Haidube · January 28, 2013, 11:01am

Thank you very much for your analysis and your educated guess that it might be a firmware error.
Your interpretation of the timings is correct and I also thought that there was no convincing reason to make the disk firmware abort and write to the internal smart log.

We had bought 2 new drives at the same time and the other one also showed worrying smart data:
During the first few days of burn in which included approx. 6 power on cyles it always showed spinup time of 0 which is physically impossible.

After this burn in under linux using smartctl software we once measured throughput under windows 8 using HDTune Pro where the smart data immediately showed a reasonable value for spinup time. The situation is even more miraculous, because from then on that drive always showed reasonable spinup time under linux too!

The magic behaviour of one drive and the smart errors on the other make me wonder whether we got green bananas to be matured be the client…

There remains the disconcerting question: Which one - if any - of those drives can be trusted enough to be used in my home server which holds hihgly valuable peronal data and media?

fzabkar · January 28, 2013, 10:53pm

If it were my drive, I wouldn’t be concerned at all. Assuming that the error is indeed a firmware bug, then it appears to be benign. Its only undesirable consequence seems to be clutter in the error log.

In fact here is a recent thread where the error log appears to have recorded another bug, but this time the bug is in WD’s diagnostic software rather the firmware:

http://community.wdc.com/t5/Desktop-Portable-Drives/Odd-Errors-on-New-WD-Black-1TB-Drive/m-p/534596#M12130

The abovementioned thread illustrates that not all entries in the log are the fault of the drive. In that particular case the drive aborted an illegal command. Ironically it was WD’s own Vendor Specific Command (VSC), not a standard ATA command, that caused the error.

As for the Spin Up Time SMART attribute, I have noticed that some SMART attributes take a while to settle down. This is to be expected for those attributes whose values reflect a rolling average or lifetime average. In these cases the drive needs to record a certain amount of activity before the SMART data can be considered to be statistically significant.

For example, what can you say about a drive that records a seek error or read error on its very first seek or read? In Seagate’s case, a drive needs to record 1 million seeks before its Seek Error Rate attribute settles down. It begins with a normalised value of 100 and then immediately drops to 60 when the target is reached. Although this counterintuitive behaviour might ring alarm bells, the data actually reflect a perfect score. I haven’t examined WD’s attributes to the same extent, but I would expect that all is not as it would appear.

Here is another thread where I have collated Spin Up Time data for a WD drive:

http://community.wdc.com/t5/Desktop-Portable-Drives/WD1002FAEX-Rumble-noise/m-p/535979#M12200

Assuming that the raw value of the Spin Up Time attribute represents the time in milliseconds, then ISTM that a normalised value of 200 corresponds to 3.00 seconds, and each point corresponds to 50ms. That is, a drive that spins up in 4.00 sec would lose 20 points (180 = 200 - 20).

Your data do not appear to corroborate my findings, but then the difference between current and worst values is quite large, so maybe the data are still settling down.

Spin_Up_Time - 219 178 021 - 4016