Drive failed, Volume degraded MCmirror 6tb

I have a My cloud Mirror 6tb
I get a message on my dashboard 'Volume Degraded - Drive Failed and ‘Network link down’
If I connect locally I still can access my files, but remotely (on an app or website) I cannot connect.
The bottom light on the device is red.
Reading some posts with the same issue on internet doesn’t really help me, some say drive should be replaced, some say RAID mode can be changed, but I don’t really want to do anything until I’m sure I don’t loose my data.

Does anyone had the same issue or does someone know what to do?

Thanks in advance

Tony

What RAID mode are you using (presumably R1)? I suspect two different issues here. In Settings → Network does it confirm Internet Access? What do the logs say in Notifications? Also check under Settings → Utilities → System Logs.

To check the drives, do SMARTCTL tests. The tests are short/long and can be done via (both console and) the web interface so log in to Settings → Utilities and run a Quick Test. This performs a short test, the results of which you can view here (pop up box) or via the console (via smartctl -l selftest /dev/sda for drive 1 and smartctl -l selftest /dev/sdb for drive 2). Copy/paste the results here to see what they say.

After the short test, if you do smartctl -A /dev/sda (and then the same for /dev/sdb) you can verify the drive health. Paste the results here for each drive again if you need them interpreted. You may actually need to do long tests across each of the drives to verify if there is any actual damage to the sectors to throw the RAID off. This is done via the web interface as before but selecting Full Test or via the console with smartctl -t long /dev/sda (and for sdb too). This will take several hours and you can verify the results with smartctl -l selftest /dev/sda (and again for sdb) so run both at the same time and they should ideally finish at the same time.

Final note, make sure Settings → General → Energy Saver is off or the drives will go into sleep mode and the tests will fail.

Pops

Hi Pops

Thank you for your reply.

I use RAID 1 yes
Settings/network tells me I have internet access

There is no system log in utilities, so I cannot see that.

This is the smartctl test for drive 2 (the one with the fault:
== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Short offline Completed without error 00% 3716 -

2 Short offline Completed without error 00% 0 -

3 Short offline Completed without error 00% 0 -

So no errors shown here.

The alarm clock on the top is RED adn give the following
Critical icon: Volume degraded
Critical Icon: Drive Failed
Warning icon: Network link down

In Storage - RAID
In Raid Porfile:
Raid Helath:Degraded
Auto Rebuild = off

RAID Volume
Volume_1 RAID 1 Degraded

Disk status:
Drive 1 3tb 51 C Good
Drive 2 3tb 53 C Bad

Running a long test now, this will take some hours.

Meanwhile I wanted to give you this extra info

Thanks

Tony

Ok, focus on sdb. When the long tests are complete, do a smartctl -a /dev/sdb and copy/paste everything here. You are looking at circa 3 to 4 hours for this long test to complete. If there is a disk error, there are steps to ‘freshen up’ the failing logical block or mark it as damaged and move on. When rebuilding the array, it will then skip over it.

One of the biggest flaws with the majority of WD’s firmware family (or is it design?) is that they don’t do these short/long tests via a schedule. It is a very easy job for their devs to incorporate a method of scheduling tests AND notify the admin via email if there is a SMART error detected.

Pops

Hi Pops

This is the result ffrom the test:

=== START OF INFORMATION SECTION ===
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N3DXUEFJ
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 9
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Sep 10 04:41:43 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (40080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2481
3 Spin_Up_Time 0x0027 188 179 021 Pre-fail Always - 5583
4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 4799
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 6
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 3737
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 4804
194 Temperature_Celsius 0x0022 109 094 000 Old_age Always - 41
196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 68
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 3734 -

2 Extended offline Aborted by host 90% 3727 -

3 Extended offline Aborted by host 10% 3727 -

4 Short offline Completed without error 00% 3716 -

5 Short offline Completed without error 00% 0 -

6 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@CloudMirror2 root #

It appears the drive is indeed failing. The smart data to monitor is 5, 197 & 198. When the drive fails to read a sector, it marks it as bad and moves on. This triggers a Current_Pending_Sector event and the drive firmware will eventually attempt to remap it. If it does so successfully, then it triggers a Reallocated_Event_Count event. In your case, this happened 6 times. This is considered normal and remapping the sectors allows the drive to continue functioning.

However, if the above process occurs but the drive doesn’t map the sectors, it triggers the Current_Pending_Sector count. Your stats indicates that the drive has not mapped the 68 sectors yet. This will happen eventually so we don’t know at this moment whether the drive will mark them as reallocated or (and a sign that the drive is faulty) triggering an Offline_Uncorrectable event. If the latter then this is a sure sign that the drive needs to be RMA. Was this purchase under a year?

I would low-level reformat the entire drive before then bringing it back into the Raid (make sure you have backups). On reformatting, the drive will read every sector, mark the failing ones and map them to the spare the drive has available. Then, you can do a further long smartctl test to compare the results. I would expect the sectors in event 197 to shift to 196 - some may even become readable again so maybe fewer move across. Finally and if the drive returns to health, you can rebuild the mirror.

Pops

Thank you very much for your help and explanation
I think I will return it as RMA, I bought it February this year, and honestly, if it eventualy will go to a ‘healthy’ state, I don’t trust it anymore to function good in the future, and I will have a problem when it falls out of warranty.
I’m a photographer and all my work is on these drives, I cannot afford losing all that.
You talk about low-level reformatting the entire drive, do you mean both drives? And how do I do a low-level format? Problem is also to make a backup, there is about 1tb of data on it now and the reason I bought this device is to have a backup of my data, so I don’t have a possibility to backup my drive at this moment, even a cloud solution (like amazon or backblaze) will cost extra money and will take a lot of time I guess.
I never tries rebooting the drive, you think that might help too?

Anyway, I thank you very much for your time in helping me!
Very much appreciated!

Tony

Tony

I can see your concern. In fairness, not all hard drives are perfect. They are designed so that these very situations are covered and mitigated. I will admit that it is odd to see failing sectors after such a short period but I have had both failing drives which were recovered like this and have gone on to be reliable and drives which have not exhibited any failures after years of continuous service. Luck of the draw.

Be careful, i am proposing low level formatting the single physical drive (sdb) to zero out all the sectors and start afresh with it. The Raid 1 policy meant that your other drive (sda) still has the data intact. To do this, you would enter the command dd if=/dev/zero of=/dev/sdb. The drive will then get a chance to heal itself (after several hours for the format) and if recoverable, update the SMART stats which we can then verify. Rebuilding the mirror will be the next step and you should be back to where you left off before the problem started.

As it stands, there is data on both the drives. sda has a good copy and sdb may have a similar copy but without knowledge where the damage to the sectors is, the integrity of that data is unclear at this stage. You can still plug in a USB3 drive (to a PC on a wired LAN or directly to the Cloud box) and copy the data across from the mirror to serve as a further offline backup. You can then either use this as the means to restore the data again or maintain as a fallback.

Personally, I would perhaps not reboot the drive until you can verify you have a copy of the data as backup. Once the array is flagged as degraded, it will remain like that after boot until the rebuild command is given (or if set to auto-rebuild). It’s clearly a hardware fault and not a glitch in the software.

Cheers
Pops

Thanks Pops!

I will try to do the dd if=/dev/zero of=/dev/sdb on the faulty drive
When this is finished, how do I udate the SMART stats and rebuild the mrirror again?

Cheers

Tony

Not a problem Tony.

Formatting with dd actually forces the drive to reallocate the bad sectors when it encounters them in turn so when you next run a short selftest, the SMART stats are updated accordingly. It’s worth running a long test just after too so that we can verify whether the drive is indeed healthy across the platters. Only then will we know whether the drive spotted, repaired/marked the sectors and then mapped them. Doing a smartctl -a /dev/sdb will reveal the results. Feel free to paste them in full here (also, use the preformatted text button in the html markup bar in your reply to preserve the text formatting).

If this works out then the My Cloud box will detect a new drive that is available for use and allow you to manually rebuild the mirror. Pages 132 onwards of the manual explains this a little further.

Best of luck.
Pops

Thanks again Pops!

Just a quick question
I entered the command in terminal like below, is it correct that nothing is printed when this task is being peformed? I only see the cursor on the next line blinking.
Is there no way to see the progress on this?

root@CloudMirror2 root # dd if=/dev/zero of=/dev/sdb

Thanks

Tony

Only when it completes i’m afraid. I believe there is a way and I just googled it to remind me. If you open another console session and enter kill -USR1 $(pgrep ^dd) it will dump the progress to the original ssh session. However, I am not confident how the BusyBox version of the NAS will respond to this command.

Frankly, since this is a wipe command i suppose what little harm could happen if dd is terminated!

Pops

UPDATE: I tested this command on my MyBookLive and it worked as intended.

Hi Pops

I get this message when I enter the command:
root@CloudMirror2 root # kill -USR1 $(pgrep ^dd)
-sh: pgrep: not found
sh: you need to specify whom to kill

That’s ok, the command was not found so no harm came about. The NAS uses a different underlying version of unix from the other WD products and they didn’t include this command. Better to leave dd to do it’s thing. It could be similarly as long as a long smartctl test IMHO.

Pops

Perfect, thanks.
I’ll get back to you once everything has been done and I have the test results again

Cheers

Tony

Hi Pops,

Is it correct that during this process you cannot connect to the device anymore?

Tony

No reason why it should stop. Is there disk activity via the front LEDs? Is it responding to pings?
Pops

Hi Pops,

No idea too, I just couldn’t access it anymore. Not in ‘finder’ and also Dashboard didn’t work anymore.
Now I think it has finished because I can access both Finder and the dashboard again, but the status is till the same, nothing changed.
Also, Terminal is still blinking in the same position, there are no prints that either it did something or it finished something, is that correct that there is also no indication when finished this low level formatting?

Anyway, I still have the same Notifications as before, I did a quick Self test on Dashboard and in Terminal
Dashboard gives me this info:
Disk1 Passed - Quick disk test completed successfully.
Disk2 Failed - Quick disk test failed. Please backup your data and replace this disk. If you need further assistance, contact WD Support.

Terminal gives me this info:

root@CloudMirror2 root # smartctl -l selftest /dev/sdb

smartctl version 5.38 [arm-marvell-linux-gnueabi] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 3734 -

2 Extended offline Aborted by host 90% 3727 -

3 Extended offline Aborted by host 10% 3727 -

4 Short offline Completed without error 00% 3716 -

5 Short offline Completed without error 00% 0 -

6 Short offline Completed without error 00% 0 -

I will run a long test again and paste the results here again.
But I’m not sure this low level formatting actually worked…

Thanks

Tony

It could have been interrupted and crashed (energy saver?). Is there any reference of dd running if you type dmesg in the terminal? Look towards the bottom.

That SMART log doesn’t show the short test that was just run (no change from the earlier posted results). Can you do a smartctl -a /dev/sdb and paste the entire log? If there is no reference to a short offline test then run smartctl -t short /dev/sdb and paste the results after a few minutes.

The WD firmware would still detect a failed drive until the error is cleared by inserting a new one (and this requires powering it down). This may be the next step in fact since it may allow us to access sdb freely. You may be able to repeat the disk wipe again with better results (check if drive 2’s LED lights are blinking as a cursory observation).

Pops

Hi Pops,

if I run dmesg I get a very long list, not sure where to look and what it means.

This is the result of smartctl -a /dev/sdb

=== START OF INFORMATION SECTION ===
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N3DXUEFJ
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 9
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 11 11:53:00 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 20) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: (40080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2481
3 Spin_Up_Time 0x0027 188 179 021 Pre-fail Always - 5583
4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 4800
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 6
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 3768
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 4806
194 Temperature_Celsius 0x0022 098 094 000 Old_age Always - 52
196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Aborted by host 40% 3768 -

2 Extended offline Aborted by host 90% 3764 -

3 Extended offline Aborted by host 90% 3764 -

4 Short offline Aborted by host 10% 3764 -

5 Extended offline Completed without error 00% 3734 -

6 Extended offline Aborted by host 90% 3727 -

7 Extended offline Aborted by host 10% 3727 -

8 Short offline Completed without error 00% 3716 -

9 Short offline Completed without error 00% 0 -

#10 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.