Drive failed, Volume degraded MCmirror 6tb

tony99982 · September 11, 2017, 10:01am

This is the result after smartctl -t short /dev/sdb

I enetered smartctl -a /dev/sdb to get the results printed, is that correct?

=== START OF INFORMATION SECTION ===
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N3DXUEFJ
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 9
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 11 11:59:19 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 241) Self-test routine in progress…
10% of test remaining.
Total time to complete Offline
data collection: (40080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2481
3 Spin_Up_Time 0x0027 188 179 021 Pre-fail Always - 5583
4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 4800
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 6
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 3768
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 4806
194 Temperature_Celsius 0x0022 098 094 000 Old_age Always - 52
196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Aborted by host 40% 3768 -

2 Extended offline Aborted by host 90% 3764 -

3 Extended offline Aborted by host 90% 3764 -

4 Short offline Aborted by host 10% 3764 -

5 Extended offline Completed without error 00% 3734 -

6 Extended offline Aborted by host 90% 3727 -

7 Extended offline Aborted by host 10% 3727 -

8 Short offline Completed without error 00% 3716 -

9 Short offline Completed without error 00% 0 -

#10 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

tony99982 · September 11, 2017, 10:06am

When I do dd if=/dev/zero of=/dev/sdb, I see no Leds blinking on the device
In terminal, when I press enter, it goes to the next line and the cursor blinks, but nothing is shown

popolou · September 11, 2017, 10:08am

Tony

It appears to have worked and the drive repaired the faulty sectors. If you scan back to your earlier post you will see that Current_Pending_Sector was 68 and now it is 0. The drive must have repaired them without needing to reallocate them. This is good.

If you power down the system and reboot, you may get the option to now rebuild the array. Go ahead and let it do this and after it completed, a final long test will verify all is well again.

Pops

tony99982 · September 11, 2017, 10:09am

Ah yes?
Thats good news, but why do I still have the same notifications?

tony99982 · September 11, 2017, 10:29am

Hi Pops,

I powered down the device and rebooted.
The same notifications appear:

Volume degraded
Drive failed

Do I need to do something else before running this long test?

tony99982 · September 11, 2017, 10:42am

Can I do a manula rebuild or an auto rebuild?

tony99982 · September 11, 2017, 11:08am

In disk status I have the following now:

Drive1
3 TB
47 °C
Good
S.M.A.R.T. Data

Drive2
3 TB
48 °C
Good
S.M.A.R.T. Data

So both drives appear to be good, although I still jave thes notifications
Is it safe to to a manual rebould now, without losing the data on disk A?

Tony

popolou · September 11, 2017, 11:34am

Yup, good to go ahead. The errors can be cleared and they will stop once the array has been brought back online. If you have the manual build option, you can continue.

Pops

tony99982 · September 11, 2017, 11:35am

Ok, because if I click manual rebuild I get this message:

Warning: The Volume is degraded now. Remember to rebuild the RAID for data integrity.

Auto-Rebuild Configuration allows you to enable or disable the Auto-Rebuild feature. You can also manually rebuild by clicking Next. Please note that rebuilding will erase all data on the newly inserted drive.

This worries me a bit

tony99982 · September 11, 2017, 11:37am

Funny thing is that I can now acces from my mobile again through the app (this was not possible before)

popolou · September 11, 2017, 11:51am

I think it’s merely saying that your data is “unsafe” until a rebuild has taken place and the array has been brought back online. Rebuilding is an intensive process but Raid 1 arrays are least burdened. If you are naturally concerned, make sure you have a separate backup just in case.

Other than that, it’s a simple matter of following the steps in the manual. A manual rebuild is the way to go at this stage.

Pops

tony99982 · September 11, 2017, 11:56am

Im updating now, Im monitoring if data remains in Finder
So far so good!!
The red LED light has already dissapeared

tony99982 · September 11, 2017, 2:36pm

Hi Pops,

It looks like everything is going perfect so far, the update takes quite some time, but I can still access my data and I see that its updating only drive 2.
I want to thank you very much for all you help, I really appreciate it!
If you ever need a photographer and I’m in your town I will do it for free for you

Cheers

Tony

popolou · September 11, 2017, 2:54pm

Excellent news and really not a problem. Information on the forums is a little sparse on how to repair the drives so hopefully your experience will be of use again to others.

Just be sure to not trouble the data until the replication is complete. In theory, rebuilding a raid array does allow you to work in the meantime but WD’s implementation of it is not enterprise grade - I am unclear how the box will react when there is new data that needs to be sync’d across the array to maintain parity.

Should you ever be in London, feel free to look me up and we can grab a pint.

Pops

tony99982 · September 11, 2017, 3:06pm

I’m not working or adding stuf while it’s rebuilding, I’ll ket it finish first
Btw, do you have any idea why this drive became faulty, is it just random bad liuck or something else?

Cool, I am i London quite often for work, so count on a message from me to grab one!

popolou · September 11, 2017, 3:33pm

Could be caused by anything really, power cuts, knocks, or factory defects. What’s interesting (and if i recall this correctly) the drive didn’t flag a SMART error that the WD firmware then consequently read as a degraded disk, so it is possible (also very hypothetical!) that the defect was caused by the WD software (incidentally, what version are you running?). Another scenario is perhaps the drive was lagging and with the data written to sda, the write to sdb failed and so knocking the drive out of the array. Or, there was a write operation at the same time that a jolt knocked the heads of sdb hard enough for the write to fail and the drive to flag a bad sector. In any event however, it was a minor error because the sector was still readable and then zeroed out on the format.

Hardware Raid setups have a separate memory cache that would only empty once the drive verifies the data was correctly written to it. I do not know how WD configure these boxes to ensure that minor write/read failures of this nature are mitigated - perhaps they don’t and it takes very little for an array to fail? Hopefully someone will come along to fill in the gaps.

If you can spare the time (tonight say), a final long test would be a good measure of things once the array is back online.

Pops

tony99982 · September 11, 2017, 5:26pm

Hi Pops,

Rebuild is finished…all looks perfect
Im running a fill test now and will copy taht here so you can have a final check if you want.

Regarding the WD software, where can I find the version number?

popolou · September 11, 2017, 6:10pm

Excellent news. Yes, let’s see what both disks say before we close this off. The firmware version should be under setting/firmware in the Web admin.

Pops

tony99982 · September 11, 2017, 6:22pm

Aaaah…the firmware, that I can find, I thought you meant something else
Version: 2.30.165