My Cloud drops from network 2/3 minutes after boot since 5.27.157 update

trekrap · October 31, 2023, 8:25pm

Agree. Thus far ‘no joy’ from attempts to update FW. have examined html source for page in attempt to understand where device thinks the update server is or should be… no luck there.
Have strenuously requested to speak with domestic WD expert on firmware update and nothing but ‘crickets.’

Can reference tables or meta data (i.e., firmware update server URL) be ascertained via SSH layer commands?

Is there a debug mode whereby I can step through the execution logic used to manually retrieve and install the FW?

Early in my IT career as an IBM OS MVS Systems Programmer I spent many hours reading ‘core dumps’ and setting traps using S/370 instruction step mode to track down memory corruption caused by errant routines operating in privileged state (i.e., root level).

Thank you for helping find potential diagnostics needed to identify and confirm the root cause of this problem.

Cerberus · October 31, 2023, 8:35pm

There are several “extreme” methods of manually forcing a firmware update, but we’re not at the point of needing them just yet. The EX4100 can be bricked, so caution is warranted.

It’s a longshot, but what are the results of the following?

df -h;
df -i;

trekrap · November 1, 2023, 4:54pm

Hello, Thank you for additional diagnostic guidance. I was able to get the reports and have reformatted them (see post from this evening)

trekrap · November 1, 2023, 6:42pm

2023-11-01T04:00:00Z

Filesystem	Inodes	Used	Available	Use%	Mounted on
/dev/root	14,336	2,435	11,901	17%	/
devtmpfs	24,168	755	23,413	3%	/dev
mdev	24,168	755	23,413	3%	/dev
ubi0:config	-	-	-	0%	/usr/local/config
/dev/loop0	8,629	8,629	-	100%	/usr/local/modules
tmpfs	-	-	-	0%	/mnt
tmpfs	-	-	-	0%	/var/log
tmpfs	20,000	79	19,921	0%	/tmp
/dev/md0p1	35,200	10	35,190	0%	/usr/local/upload
/dev/sdc4	258,048	20	258,028	0%	/mnt/HD_c4
/dev/sdd4	258,048	19	258,029	0%	/mnt/HD_d4
/dev/sdb4	258,048	19	258,029	0%	/mnt/HD_b4
/dev/sda4	258,048	125	257,923	0%	/mnt/HD_a4
/dev/md1	457,605,120	306,116	457,299,004	0%	/mnt/HD/HD_a2

trekrap · November 1, 2023, 10:11pm

Filesystem	Size	Used	Available	Use%	Mounted on
/dev/root	54.2M	19.9M	31.5M	39%	/
devtmpfs	1017.6M	32.0K	1017.6M	0%	/dev
mdev	1017.6M	32.0K	1017.6M	0%	/dev
ubi0:config	12.1M	112.0K	11.3M	1%	/usr/local/config
/dev/loop0	163.6M	163.6M	0	100%	/usr/local/modules
tmpfs	1.0M	0	1.0M	0%	/mnt
tmpfs	40.0M	8.0M	32.0M	20%	/var/log
tmpfs	100.0M	8.8M	91.2M	9%	/tmp
/dev/md0p1	525.3M	4.0K	514.3M	0%	/usr/local/upload
/dev/sdc4	928.9M	56.0K	912.9M	0%	/mnt/HD_c4
/dev/sdd4	928.9M	52.0K	912.9M	0%	/mnt/HD_d4
/dev/sdb4	928.9M	52.0K	912.9M	0%	/mnt/HD_b4
/dev/sda4	928.9M	109.4M	803.5M	12%	/mnt/HD_a4
/dev/md1	27.2T	11.1T	15.8T	41%	/mnt/HD/HD_a2

This version is easier to inspect.
Is there a critical issue due to ‘/dev/loop0’ being exhausted?

Cerberus · November 1, 2023, 11:21pm

Once again, no problems are evident.

However, I have a feeling that the drives may somehow be connected to the true source of the problem. Do you have a spare hard drive, one that can be erased and used for a quick test?

If so, power off the NAS and remove all drives, making sure to label each drive with it’s corresponding (1-4) bay number. Afterwards, erase the spare “test” drive using a computer and insert it into the first drive bay, before powering on the NAS.

The idea is to see if everything returns to normal after the spare “test” drive has been initialized, and if a firmware update suddenly becomes possible.

No, loop devices only use the exact amount of space required. Hence, they show 100% utilization.

trekrap · November 2, 2023, 2:47am

2023-11-01T04:00:00Z; Yes, I have a spare albeit smaller capacity SATA drive. I conclude capacity is irrelevant to the proscribed text case and will proceed in the morning (late night now on the East Coast). Continued thanks and appreciation for diagnostic guidance and analysis. 2023-11-03T04:00:00Z I was unable to re-format my spare HD on MAC PC so decided to order cheapest SATA device available. [clarification on MAC. Apparently it see’s and mounts NTFS formatted drives ‘Read Only.’ I no longer have a working WinTel box so I decided to try alternate approach] Meanwhile, I ran Dashboard ScanDisk utility and as expected it failed. However, on a lark tried the manual firmware update and to my surprise it worked. I re-ran ScanDisk and it failed.
Not deterred I tried to copy another 42GB file off the device and after ~15 mins CPU spiked @100%. I tried to display offending process(s) but display would not cooperate and after about 15 mins I’m disappointed to report the device went offline leaving ~160 40+GB files that I need to copy.
Consequently, I’m left to wonder what will happen when I insert the one new hard drive; what difference might it make?
Thank you for additional diagnostic recommendations.

P.S. WD Support reply to SSH scandisk (i.e., df -i and df -h) report details in block quote:

A question or two comes mind. First, why they are consistently mute regarding the device locking up and going offline, a critical issue IMO and second, why do they assert the files are corrupted given the SMARTCTL diagnostics show NO ERRORS?

Blockquote
Hello TIM,
Thank you for your reply.
I got the case reviewed by the engineering team and this is what could be done:
You could try running the command “dmesg” over putty and will be able to see the errors.
2023-10-28T16:14:51.134628-04:00 di=b2cj7JZ1ei warning kernel: [ 124.578329] EXT4-fs (md1): warning: mounting fs with errors, running e2fsck is recommended
2023-10-28T16:14:51.134628-04:00 di=b2cj7JZ1ei warning kernel: [ 124.578329] EXT4-fs (md1): warning: mounting fs with errors, running e2fsck is recommended
2023-10-28T16:20:09.778070-04:00 di=b2cj7JZ1ei err kernel: [ 443.235401] inconsistent data on disk
2023-10-28T16:20:09.778081-04:00 di=b2cj7JZ1ei err kernel: [ 443.239124] EXT4-fs: ext4_free_blocks:4838: aborting transaction: IO failure in __ext4_forget
Also you could try these steps mentioned:
Run File system check and below is the link with the steps:
My Cloud: Scan Disk File System Check and Repair
My Cloud: Scan Disk File System Check and Repair
If no error, try performing a system only restore from dashboard
My Cloud: System, Quick and Full Restore a EX4100
My Cloud: System, Quick and Full Restore a EX4100
If issue still persists or if there are any errors reported in file system check then take a complete backup of data and perform full factory restore.
My Cloud: File System Check Failed or Has Detected Errors
My Cloud: File System Check Failed or Has Detected Errors
If you have any further questions, please reply to this email and we will be happy to assist you further.
Sincerely,
George D
Western Digital Customer Service and Support

Cerberus · November 4, 2023, 2:07pm

Bad idea! The drives should be taken OFFLINE until more is known about the problem.

Because while a drive may be ok per the S.M.A.R.T. results, the filesystem may NOT be ok. And the dmesg log entries you posted clearly shows a problem with the EXT4 filesystem, it’s as plain as day.

[ 124.578329] EXT4-fs (md1): warning: mounting fs with errors, running e2fsck is recommended
[ 443.235401] inconsistent data on disk
[ 443.239124] EXT4-fs: ext4_free_blocks:4838: aborting transaction: IO failure in __ext4_forget

In short, the filesystem shown below is probably corrupted, and is the most likely cause of the problems you’ve been experiencing.

/dev/md1 /mnt/HD/HD_a2

WD Support also gave you links with instructions to begin addressing the problem, so I STRONGLY suggest that you follow them.

trekrap · November 4, 2023, 3:03pm

Hello and thanks,

My back is against the wall. I get it’s not a ‘good idea’ to copy the data, however, I need to copy my data to a more stable and higher capacity PR4100 NAS

The sales pablum had me believing RAID 5 would enable me to recover any and all data across a drive failure. I inferred that file system issues could also be recoverable through a similar mechanism. Now it seems the only way to recover is a full backup of the entire 40Tb EX4100 device. This is not a viable option due to the fact the device constantly locks up and goes off-line.
Is there any way to put EX4100 drives into PR4100 solely to copy off the files I need to recover?

The Dashboard ‘Sandisk’ appears to repair nothing. No matter how many times I run it, thus far the results never vary. It doesn’t report what it finds, and it doesn’t repair anything. What am I missing?

Consequently, I examined and tried to run ‘e2fsck’ -p -f -C 0 /dev/sda, and for devices … /sdb, /sdc, and /sdd.

Should e2fsck also be run against /dev/md1?

Perhaps I don’t understand the proper syntax or have not properly conditioned the environment (i.e., assure device is not checking drive health). I sign into SSH as soon as possible due to the lock-up and timeout issue because I don’t know how long the scan/repair operation will take.

Thanks for help and further guidance.

Cerberus · November 4, 2023, 3:17pm

Never believe marketing. RAID exists to provide high availability, at the expense of increased complexity and reduced reliability.

No. The EX4100 isn’t the problem, the drives are. Rather, the EXT4 filesystem on the drives is the problem.

Actually, I think you don’t understand the true nature of the problem, or the potential consequences.

Well, I can tell you that with roughly 12 TB of data on the drives, the process will take days, possiby even weeks. This is what people are signing up for when they use RAID. It’s not a matter of if, but when.

Follow the instructions, I’ve done all I can do and must focus on other things.

trekrap · November 4, 2023, 4:42pm

Thank you for further clarifications.

I used SSH to initiate e2fsck -p -f -C 0 /dev/md1

As mentioned previously the dashboard scandisk utility reports failure and does nothing to repair issues found during the scan phase.

Consequently, I manually entered e2fsck as outlined above. It ran >20 min with messages on a variety of actions taken to fix discrepancies. I put it in automated mode.

After a short time period it began a long series of rapid messages beginning ‘Inode # block #’ appended with an action taken or messaging that Inode # block # ‘conflicts with critical metadata, skipping block checks.’

Unexpectedly it stopped streaming the messages and became almost quiescent with very sporadic updates (i.e., ‘conflicts with critical metadata’ messages spaced between long time intervals with barely perceptible disk activity).

Given that the messages are not time/date stamped and there’s no other obvious diagnostic available how can it be confirmed that repairs remain underway or it’s time to pull the plug and start over?

Thank you for assisting to broaden my understanding and knowledge.

NAS_user · November 5, 2023, 7:18pm

So. . . .sorry to stick my nose in . . . .I don’t have much to add technically, however:

Heed the warning: You are diagnosing a corrupt file system. To me; “pulling the plug and starting over” means: “Reformat and Restore data from backup.” Having more money than sense: I would use fresh disks; so you could attempt to salvage the data off the old disks if you need to offline.

Having said that: I am not saying hope is lost.
I am saying you are rolling dice.
I have had Windows O/S corruption happen with no data loss. Maybe you will get lucky.

Is the diagnostic still using CPU cycles?

trekrap · November 8, 2023, 2:49pm

Hello,
Thank you for your interest in this issue. After 3+ days ‘e2fsck’, and beyond its 1st 15 mins, there were no repairs reported. Conclusion: reformatting the device is needed to make it useable for file archival. Its already suitable for use as a door stop.

Cerberus · November 8, 2023, 3:03pm

Three days is nothing when trying to repair a RAID volume with 12 TB of data. Perhaps you forgot, but I tried telling you exactly what to expect. It’s a very slow process, and it requires lots of patience.

trekrap · November 11, 2023, 1:05am

Thank you for your advice.