I have a My cloud Mirror 6tb
I get a message on my dashboard 'Volume Degraded - Drive Failed and ‘Network link down’
If I connect locally I still can access my files, but remotely (on an app or website) I cannot connect.
The bottom light on the device is red.
Reading some posts with the same issue on internet doesn’t really help me, some say drive should be replaced, some say RAID mode can be changed, but I don’t really want to do anything until I’m sure I don’t loose my data.
Does anyone had the same issue or does someone know what to do?
What RAID mode are you using (presumably R1)? I suspect two different issues here. In Settings → Network does it confirm Internet Access? What do the logs say in Notifications? Also check under Settings → Utilities → System Logs.
To check the drives, do SMARTCTL tests. The tests are short/long and can be done via (both console and) the web interface so log in to Settings → Utilities and run a Quick Test. This performs a short test, the results of which you can view here (pop up box) or via the console (via smartctl -l selftest /dev/sda for drive 1 and smartctl -l selftest /dev/sdb for drive 2). Copy/paste the results here to see what they say.
After the short test, if you do smartctl -A /dev/sda (and then the same for /dev/sdb) you can verify the drive health. Paste the results here for each drive again if you need them interpreted. You may actually need to do long tests across each of the drives to verify if there is any actual damage to the sectors to throw the RAID off. This is done via the web interface as before but selecting Full Test or via the console with smartctl -t long /dev/sda (and for sdb too). This will take several hours and you can verify the results with smartctl -l selftest /dev/sda (and again for sdb) so run both at the same time and they should ideally finish at the same time.
Final note, make sure Settings → General → Energy Saver is off or the drives will go into sleep mode and the tests will fail.
I use RAID 1 yes
Settings/network tells me I have internet access
There is no system log in utilities, so I cannot see that.
This is the smartctl test for drive 2 (the one with the fault:
== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Short offline Completed without error 00% 3716 -
2 Short offline Completed without error 00% 0 -
3 Short offline Completed without error 00% 0 -
So no errors shown here.
The alarm clock on the top is RED adn give the following
Critical icon: Volume degraded
Critical Icon: Drive Failed
Warning icon: Network link down
In Storage - RAID
In Raid Porfile:
Raid Helath:Degraded
Auto Rebuild = off
RAID Volume
Volume_1 RAID 1 Degraded
Disk status:
Drive 1 3tb 51 C Good
Drive 2 3tb 53 C Bad
Running a long test now, this will take some hours.
Ok, focus on sdb. When the long tests are complete, do a smartctl -a /dev/sdb and copy/paste everything here. You are looking at circa 3 to 4 hours for this long test to complete. If there is a disk error, there are steps to ‘freshen up’ the failing logical block or mark it as damaged and move on. When rebuilding the array, it will then skip over it.
One of the biggest flaws with the majority of WD’s firmware family (or is it design?) is that they don’t do these short/long tests via a schedule. It is a very easy job for their devs to incorporate a method of scheduling tests AND notify the admin via email if there is a SMART error detected.
=== START OF INFORMATION SECTION ===
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N3DXUEFJ
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 9
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Sep 10 04:41:43 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (40080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
It appears the drive is indeed failing. The smart data to monitor is 5, 197 & 198. When the drive fails to read a sector, it marks it as bad and moves on. This triggers a Current_Pending_Sector event and the drive firmware will eventually attempt to remap it. If it does so successfully, then it triggers a Reallocated_Event_Count event. In your case, this happened 6 times. This is considered normal and remapping the sectors allows the drive to continue functioning.
However, if the above process occurs but the drive doesn’t map the sectors, it triggers the Current_Pending_Sector count. Your stats indicates that the drive has not mapped the 68 sectors yet. This will happen eventually so we don’t know at this moment whether the drive will mark them as reallocated or (and a sign that the drive is faulty) triggering an Offline_Uncorrectable event. If the latter then this is a sure sign that the drive needs to be RMA. Was this purchase under a year?
I would low-level reformat the entire drive before then bringing it back into the Raid (make sure you have backups). On reformatting, the drive will read every sector, mark the failing ones and map them to the spare the drive has available. Then, you can do a further long smartctl test to compare the results. I would expect the sectors in event 197 to shift to 196 - some may even become readable again so maybe fewer move across. Finally and if the drive returns to health, you can rebuild the mirror.
Thank you very much for your help and explanation
I think I will return it as RMA, I bought it February this year, and honestly, if it eventualy will go to a ‘healthy’ state, I don’t trust it anymore to function good in the future, and I will have a problem when it falls out of warranty.
I’m a photographer and all my work is on these drives, I cannot afford losing all that.
You talk about low-level reformatting the entire drive, do you mean both drives? And how do I do a low-level format? Problem is also to make a backup, there is about 1tb of data on it now and the reason I bought this device is to have a backup of my data, so I don’t have a possibility to backup my drive at this moment, even a cloud solution (like amazon or backblaze) will cost extra money and will take a lot of time I guess.
I never tries rebooting the drive, you think that might help too?
Anyway, I thank you very much for your time in helping me!
Very much appreciated!
I can see your concern. In fairness, not all hard drives are perfect. They are designed so that these very situations are covered and mitigated. I will admit that it is odd to see failing sectors after such a short period but I have had both failing drives which were recovered like this and have gone on to be reliable and drives which have not exhibited any failures after years of continuous service. Luck of the draw.
Be careful, i am proposing low level formatting the single physical drive (sdb) to zero out all the sectors and start afresh with it. The Raid 1 policy meant that your other drive (sda) still has the data intact. To do this, you would enter the command dd if=/dev/zero of=/dev/sdb. The drive will then get a chance to heal itself (after several hours for the format) and if recoverable, update the SMART stats which we can then verify. Rebuilding the mirror will be the next step and you should be back to where you left off before the problem started.
As it stands, there is data on both the drives. sda has a good copy and sdb may have a similar copy but without knowledge where the damage to the sectors is, the integrity of that data is unclear at this stage. You can still plug in a USB3 drive (to a PC on a wired LAN or directly to the Cloud box) and copy the data across from the mirror to serve as a further offline backup. You can then either use this as the means to restore the data again or maintain as a fallback.
Personally, I would perhaps not reboot the drive until you can verify you have a copy of the data as backup. Once the array is flagged as degraded, it will remain like that after boot until the rebuild command is given (or if set to auto-rebuild). It’s clearly a hardware fault and not a glitch in the software.
I will try to do the dd if=/dev/zero of=/dev/sdb on the faulty drive
When this is finished, how do I udate the SMART stats and rebuild the mrirror again?
Formatting with dd actually forces the drive to reallocate the bad sectors when it encounters them in turn so when you next run a short selftest, the SMART stats are updated accordingly. It’s worth running a long test just after too so that we can verify whether the drive is indeed healthy across the platters. Only then will we know whether the drive spotted, repaired/marked the sectors and then mapped them. Doing a smartctl -a /dev/sdb will reveal the results. Feel free to paste them in full here (also, use the preformatted text button in the html markup bar in your reply to preserve the text formatting).
If this works out then the My Cloud box will detect a new drive that is available for use and allow you to manually rebuild the mirror. Pages 132 onwards of the manual explains this a little further.
Just a quick question
I entered the command in terminal like below, is it correct that nothing is printed when this task is being peformed? I only see the cursor on the next line blinking.
Is there no way to see the progress on this?
Only when it completes i’m afraid. I believe there is a way and I just googled it to remind me. If you open another console session and enter kill -USR1 $(pgrep ^dd) it will dump the progress to the original ssh session. However, I am not confident how the BusyBox version of the NAS will respond to this command.
Frankly, since this is a wipe command i suppose what little harm could happen if dd is terminated!
Pops
UPDATE: I tested this command on my MyBookLive and it worked as intended.
I get this message when I enter the command:
root@CloudMirror2 root # kill -USR1 $(pgrep ^dd)
-sh: pgrep: not found
sh: you need to specify whom to kill
That’s ok, the command was not found so no harm came about. The NAS uses a different underlying version of unix from the other WD products and they didn’t include this command. Better to leave dd to do it’s thing. It could be similarly as long as a long smartctl test IMHO.
No idea too, I just couldn’t access it anymore. Not in ‘finder’ and also Dashboard didn’t work anymore.
Now I think it has finished because I can access both Finder and the dashboard again, but the status is till the same, nothing changed.
Also, Terminal is still blinking in the same position, there are no prints that either it did something or it finished something, is that correct that there is also no indication when finished this low level formatting?
Anyway, I still have the same Notifications as before, I did a quick Self test on Dashboard and in Terminal
Dashboard gives me this info:
Disk1 Passed - Quick disk test completed successfully.
Disk2 Failed - Quick disk test failed. Please backup your data and replace this disk. If you need further assistance, contact WD Support.
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Completed without error 00% 3734 -
2 Extended offline Aborted by host 90% 3727 -
3 Extended offline Aborted by host 10% 3727 -
4 Short offline Completed without error 00% 3716 -
5 Short offline Completed without error 00% 0 -
6 Short offline Completed without error 00% 0 -
I will run a long test again and paste the results here again.
But I’m not sure this low level formatting actually worked…
It could have been interrupted and crashed (energy saver?). Is there any reference of dd running if you type dmesg in the terminal? Look towards the bottom.
That SMART log doesn’t show the short test that was just run (no change from the earlier posted results). Can you do a smartctl -a /dev/sdb and paste the entire log? If there is no reference to a short offline test then run smartctl -t short /dev/sdb and paste the results after a few minutes.
The WD firmware would still detect a failed drive until the error is cleared by inserting a new one (and this requires powering it down). This may be the next step in fact since it may allow us to access sdb freely. You may be able to repeat the disk wipe again with better results (check if drive 2’s LED lights are blinking as a cursory observation).
if I run dmesg I get a very long list, not sure where to look and what it means.
This is the result of smartctl -a /dev/sdb
=== START OF INFORMATION SECTION ===
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N3DXUEFJ
Firmware Version: 82.00A82
User Capacity: 3,000,592,982,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 9
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 11 11:53:00 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 20) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: (40080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.