WD 2TB Red went offline

Tomaszg · March 10, 2015, 8:20pm

I’ve been running Supermicro board with two identical WD20EFRX-68EUZN0 models.

OS is Linux based on 3.2.0 kernel.

These were partitioned by 6 partitions, each except swap is separate Mdadm RAID1 level (mirror).

They were partitioned correctly with 4k-aligned partitions. Using deadline sheduler for both disks.

And mirrors were ok since few months.

Something around one day ago my server sent me notify saying 2 partitions went into degraded mode:

A Fail event had been detected on md device /dev/md/2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] 
md4 : active raid1 sdc8[0] sda8[2]
      927402816 blocks super 1.2 [2/2] [UU]
      
md3 : active raid1 sdc7[0] sda7[1]
      664057280 blocks super 1.2 [2/2] [UU]
      
md2 : active raid1 sda5[3](F) sdc5[2]
      258788216 blocks super 1.2 [2/1] [_U]
      
md1 : active raid1 sda2[3](F) sdc2[2]
      48827320 blocks super 1.2 [2/1] [_U]
      
md0 : active raid1 sda1[3] sdc1[2]
      390132 blocks super 1.2 [2/2] [UU]

… few hours later it took down other partions, too:

A FailSpare event had been detected on md device /dev/md/0.

It could be related to component device /dev/sda1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] 
md4 : active raid1 sdc8[0] sda8[2](F)
      927402816 blocks super 1.2 [2/1] [U_]
      
md3 : active raid1 sdc7[0] sda7[1](F)
      664057280 blocks super 1.2 [2/1] [U_]
      
md2 : active raid1 sda5[3](F) sdc5[2]
      258788216 blocks super 1.2 [2/1] [_U]
      
md1 : active raid1 sda2[3](F) sdc2[2]
      48827320 blocks super 1.2 [2/1] [_U]
      
md0 : active raid1 sda1[3](F) sdc1[2]
      390132 blocks super 1.2 [2/1] [_U]

… meaning the /dev/sda has failed, which was confirmed by:

[pon mar 9 06:25:12 2015] md/raid1:md3: redirecting sector 880806144 to other mirror: sdc7
[pon mar 9 06:25:12 2015] disk 0, wo:0, o:1, dev:sdc7
[pon mar 9 06:25:12 2015] disk 0, wo:0, o:1, dev:sdc7
[pon mar 9 06:25:13 2015] md/raid1:md4: redirecting sector 450275584 to other mirror: sdc8
[pon mar 9 06:25:13 2015] disk 0, wo:0, o:1, dev:sdc8
[pon mar 9 06:25:14 2015] disk 0, wo:0, o:1, dev:sdc8
[pon mar 9 06:25:19 2015] md/raid1:md0: redirecting sector 1028 to other mirror: sdc1

… ending with something like:

[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] Unhandled error code
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 24 b7 c0 c8 00 00 08 00
[pon mar 9 18:50:20 2015] end_request: I/O error, dev sda, sector 616022216
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] Unhandled error code
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 24 b7 c0 d0 00 00 08 00
[pon mar 9 18:50:20 2015] end_request: I/O error, dev sda, sector 616022224
[pon mar 9 18:50:20 2015] end_request: I/O error, dev sda, sector 0
[pon mar 9 18:50:20 2015] end_request: I/O error, dev sda, sector 0
[pon mar 9 18:55:31 2015] end_request: I/O error, dev sda, sector 0
[pon mar 9 18:55:31 2015] end_request: I/O error, dev sda, sector 0
[pon mar 9 19:00:31 2015] end_request: I/O error, dev sda, sector 0

… in continuous loop.

While trying to read serial number of the disk to remove it, suprised by report:

smartctl --all /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /0:0:0:0
Product:
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more ‘-T permissive’ options.

where should be:

User Capacity: 2,000,398,934,016 bytes [2,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical

SATA port seems to be ok, because working drive has been connected to it during replacement.

Now doing zeroing of failed disk, it’s at 5%, it has around 3000h uptime, thinking about RMA it?

Trancer · March 10, 2015, 11:24pm

Welcome to the Community.

I’d recommend a free device replacement under warranty. Additional information is available in the following link:

http://wdc.custhelp.com/app/answers/detail/a_id/8/

Tomaszg · March 11, 2015, 9:51pm

Well,

I’m not so sure that device is failing, that’s why I’m asking. I’m running Debian 64-bit.

It looks more likely Linux kernel bug according to:

http://ubuntuforums.org/showthread.php?t=1470970

http://ubuntuforums.org/showthread.php?p=11420126

http://serverfault.com/questions/438535/write-error-on-swap-device-result-hostbyte-did-bad-target-driverbyte-driver-ok

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=625922

and facts, that:

device finishes full zeroing within command:

dd if=/dev/zero bs=1M |pv| dd of=/dev/sde bs=1M

within around 5 hours:

copied 2000398934016 bytes (2,0 TB), 18031,9 s, 111 MB/s
2. reports no errors during that

there are only 3 SMART errors, but not related to reallocated sectors, within Power-on-hours many earlier than current value (correction, it’s 6000hrs not 3000), Error: ABRT
device passes SMART tests

5. both servers are running same kernel:

Linux NAME_HERE 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux

more details about disk:

=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EFRX-68EUZN0
Serial Number: WD-WMCxxxxxxxxxx
LU WWN Device Id: xxxxxxxxxxxxxxxxxx
Firmware Version: 80.00A80
User Capacity: 2,000,398,934,016 bytes [2,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical

If this is caused by Linux kernel and not related to only one operating system then it can affect multiple servers at once when new linux kernel is installed on a machine (possible thousands of Linux devices affected with multiple large disks!).

Is there diagnostic software provided by WD, just to make sure it’s ok?

EDIT:

I see there is large difference between 2 SMART values:

ID 193, attribute name Load_Cycle_Count = 774

while:

ID 4, attribute name Start_Stop_Count = 150

Can You explain what this software does?

http://support.wdc.com/product/download.asp?groupid=619&sid=201&lang=pl

WD Red SMART Load/Unload

Tomaszg · March 13, 2015, 8:29pm

So… this isn’t solution, because:

sudo ./wd5741x64 -D4
WD5741 Version 1
Update Drive
Copyright (C) 2013 Western Digital Corporation

WDC WD20EFRX-68EUZN0 80.00A80 Drive update not needed

Any ideas so far?

Tomaszg · March 15, 2015, 11:24am

Summary:

this isn’t real hardware problem
this is Linux kernel issue
no RMA needed
hdd has been overwritten few times, read few times (each such operation takes around 5hrs) without issues.

Tomaszg · June 12, 2015, 4:54pm

3 months later…

… update on this:

That was pre-failure event. Even if drive was ok so far it’s now failing tests:

Extended offline: read failure, remaining 70%

Errors below:

[Sat July 6 21:28:01 2015] ata5.00: exception Emask 0x0 SAct 0x7c03ffff SErr 0x0 action 0x0
[Sat July 6 21:28:01 2015] ata5.00: irq_stat 0x40000008
[Sat July 6 21:28:01 2015] ata5.00: failed command: READ FPDMA QUEUED
[Sat July 6 21:28:01 2015] ata5.00: cmd 60/80:d0:c0:84:88/00:00:3b:00:00/40 tag 26 ncq 65536 in
[Sat July 6 21:28:01 2015] res 41/40:00:c0:84:88/00:00:3b:00:00/40 Emask 0x409 (media error)
[Sat July 6 21:28:01 2015] ata5.00: status: { DRDY ERR }
[Sat July 6 21:28:01 2015] ata5.00: error: { UNC }
[Sat July 6 21:28:01 2015] ata5.00: configured for UDMA/133
[Sat July 6 21:28:01 2015] ata5: EH complete
[Sat July 6 21:28:04 2015] ata5.00: exception Emask 0x0 SAct 0x7ff00fff SErr 0x0 action 0x0
[Sat July 6 21:28:04 2015] ata5.00: irq_stat 0x40000008
[Sat July 6 21:28:04 2015] ata5.00: failed command: READ FPDMA QUEUED
[Sat July 6 21:28:04 2015] ata5.00: cmd 60/80:c0:c0:84:88/00:00:3b:00:00/40 tag 24 ncq 65536 in
[Sat July 6 21:28:04 2015] res 41/40:00:c0:84:88/00:00:3b:00:00/40 Emask 0x409 (media error)
[Sat July 6 21:28:04 2015] ata5.00: status: { DRDY ERR }
[Sat July 6 21:28:04 2015] ata5.00: error: { UNC }
[Sat July 6 21:28:04 2015] ata5.00: configured for UDMA/133
[Sat July 6 21:28:04 2015] sd 4:0:0:0: [sdc] Unhandled sense code
[Sat July 6 21:28:04 2015] sd 4:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sat July 6 21:28:04 2015] sd 4:0:0:0: [sdc] Sense Key : Medium Error [current] [descriptor]
[Sat July 6 21:28:04 2015] Descriptor sense data with sense descriptors (in hex):
[Sat July 6 21:28:04 2015] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[Sat July 6 21:28:04 2015] 3b 88 84 c0
[Sat July 6 21:28:04 2015] sd 4:0:0:0: [sdc] Add. Sense: Unrecovered read error - auto reallocate failed
[Sat July 6 21:28:04 2015] sd 4:0:0:0: [sdc] CDB: Read(10): 28 00 3b 88 84 c0 00 00 80 00
[Sat July 6 21:28:04 2015] end_request: I/O error, dev sdc, sector 998802624
[Sat July 6 21:28:04 2015] ata5: EH complete