I’ve been running Supermicro board with two identical WD20EFRX-68EUZN0 models.
OS is Linux based on 3.2.0 kernel.
These were partitioned by 6 partitions, each except swap is separate Mdadm RAID1 level (mirror).
They were partitioned correctly with 4k-aligned partitions. Using deadline sheduler for both disks.
And mirrors were ok since few months.
Something around one day ago my server sent me notify saying 2 partitions went into degraded mode:
A Fail event had been detected on md device /dev/md/2.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1]
md4 : active raid1 sdc8[0] sda8[2]
927402816 blocks super 1.2 [2/2] [UU]
md3 : active raid1 sdc7[0] sda7[1]
664057280 blocks super 1.2 [2/2] [UU]
md2 : active raid1 sda5[3](F) sdc5[2]
258788216 blocks super 1.2 [2/1] [_U]
md1 : active raid1 sda2[3](F) sdc2[2]
48827320 blocks super 1.2 [2/1] [_U]
md0 : active raid1 sda1[3] sdc1[2]
390132 blocks super 1.2 [2/2] [UU]
… few hours later it took down other partions, too:
A FailSpare event had been detected on md device /dev/md/0.
It could be related to component device /dev/sda1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid1]
md4 : active raid1 sdc8[0] sda8[2](F)
927402816 blocks super 1.2 [2/1] [U_]
md3 : active raid1 sdc7[0] sda7[1](F)
664057280 blocks super 1.2 [2/1] [U_]
md2 : active raid1 sda5[3](F) sdc5[2]
258788216 blocks super 1.2 [2/1] [_U]
md1 : active raid1 sda2[3](F) sdc2[2]
48827320 blocks super 1.2 [2/1] [_U]
md0 : active raid1 sda1[3](F) sdc1[2]
390132 blocks super 1.2 [2/1] [_U]
… meaning the /dev/sda has failed, which was confirmed by:
[pon mar 9 06:25:12 2015] md/raid1:md3: redirecting sector 880806144 to other mirror: sdc7
[pon mar 9 06:25:12 2015] disk 0, wo:0, o:1, dev:sdc7
[pon mar 9 06:25:12 2015] disk 0, wo:0, o:1, dev:sdc7
[pon mar 9 06:25:13 2015] md/raid1:md4: redirecting sector 450275584 to other mirror: sdc8
[pon mar 9 06:25:13 2015] disk 0, wo:0, o:1, dev:sdc8
[pon mar 9 06:25:14 2015] disk 0, wo:0, o:1, dev:sdc8
[pon mar 9 06:25:19 2015] md/raid1:md0: redirecting sector 1028 to other mirror: sdc1
… ending with something like:
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] Unhandled error code
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 24 b7 c0 c8 00 00 08 00
[pon mar 9 18:50:20 2015] end_request: I/O error, dev sda, sector 616022216
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] Unhandled error code
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[pon mar 9 18:50:20 2015] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 24 b7 c0 d0 00 00 08 00
[pon mar 9 18:50:20 2015] end_request: I/O error, dev sda, sector 616022224
[pon mar 9 18:50:20 2015] end_request: I/O error, dev sda, sector 0
[pon mar 9 18:50:20 2015] end_request: I/O error, dev sda, sector 0
[pon mar 9 18:55:31 2015] end_request: I/O error, dev sda, sector 0
[pon mar 9 18:55:31 2015] end_request: I/O error, dev sda, sector 0
[pon mar 9 19:00:31 2015] end_request: I/O error, dev sda, sector 0
… in continuous loop.
While trying to read serial number of the disk to remove it, suprised by report:
smartctl --all /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright © 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: /0:0:0:0
Product:
User Capacity: 600,332,565,813,390,450 bytes [600 PB]
Logical block size: 774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more ‘-T permissive’ options.
where should be:
User Capacity: 2,000,398,934,016 bytes [2,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
SATA port seems to be ok, because working drive has been connected to it during replacement.
Now doing zeroing of failed disk, it’s at 5%, it has around 3000h uptime, thinking about RMA it?