PR4100 with group descriptors corrupted!

I awoke to find my drive has 4 red led’s when it was working perfectly well the night before.

The drive is under warranty but even though I have sent the logs and shown the issue to WD they are refusing to help me, stating that the device has no mount, no volume and no raid and it did a factory reset by itself in the middle of the night and I should go and pay for data recovery because they don’t do that.

Seriously, this is what their support response is for an under warranty item. If I had known this is how they would treat people before I purchased the item, then things would have been different.

They said there is no raid configured, yet in dmesg, it clearly shows there is:

38.038777] RAID conf printout:
[ 38.038779] — level:5 rd:4 wd:4
[ 38.038782] disk 0, o:1, dev:sda2
[ 38.038784] disk 1, o:1, dev:sdb2
[ 38.038786] disk 2, o:1, dev:sdc2
[ 38.038788] disk 3, o:1, dev:sdd2

Its quite simple, the device had a raid across all 4 8TB WD drives, it had around 16TB of free data, now it says I have 0MB free and no volume is configured or mounted.

If you look at the output from dmesg you can see that it tries to mount md1 but finds lots of:

75.466431] EXT4-fs (md1): ext4_check_descriptors: Block bitmap for group 21568 not in group (block 2251799813718436)!
[ 75.478398] EXT4-fs (md1): group descriptors corrupted!

So, there is an issue here. Maybe there was a power cut during the night. After a full drive test they are all good. I have checked the drives to see if they have any super block backups and they all have the same backups:

Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

This is the same on all 4 sda2, sdb2, sdc2 and sdd2. (This tells me there is a raid)

So, with the lack of qualified support from WD, how can I fix the corruption with the backups? Firstly I will need to mount MD1 as it isn’t mounted. Then replace the corruption with a backup?

I hope someone can help me, unfortunately this device, which I trusted, has my fathers data and photos and he passed away recently and its the only place we have the data.

Thanks!

Iain

Thanks for providing sufficient details.
A quick google result show that your data is not lost yet.

Translated to PRx100:

# list block devices.. it should show /dev/md1
/usr/bin/blkid
# read only verbose file system check
e2fsck -nv /dev/md1
# show backup blocks
dumpe2fs /dev/md1 | grep -i superblock
# repair with backup superblock
e2fsck -b 32768 /dev/md1

Note that filesystem repair should happen when NOT mounted.

Thanks for the response, I did find that article on Google, however when I tried the very first command on my box it just responds with:

sudo fdisk -l

sudo: fdisk: command not found

So I moved on…

md1 is not mounted, as after it finds the corrupted group descriptors, it unmount it.

Here are some responses:

# mdadm -E -s

ARRAY /dev/md0 level=raid1 num-devices=4 UUID=7b2d612c:d9a2a690:037608cb:89d08506

ARRAY /dev/md/1 level=raid5 metadata=1.0 num-devices=4 UUID=7b532c3f:a9089a56:57b177b6:a8f67641 name=1

#mount /dev/md1 /mnt/

mount: special device /dev/md1 does not exist

# mount /dev/md/1 /mnt/

mount: special device /dev/md/1 does not exist

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid1 sdd1[3] sdc1[2] sdb1[1] sda1[0]
      2097088 blocks [4/4] [UUUU]
      bitmap: 0/128 pages [0KB], 8KB chunk

unused devices: <none>

Yeah it uses a GPT partition table, so it’s gdisk instead of fdisk… and it’s not a separate disk but an array, so blkid / lsblk (from entware) is ideal.

The examine command shows 7b532c3f:a9089a56:57b177b6:a8f67641
Now try

mdadm --assemble --uuid=7b532c3f:a9089a56:57b177b6:a8f67641 /dev/md1

EDIT: added missing /dev/md1 argument

With gdisk I get:

# sudo gdisk -l
GPT fdisk (gdisk) version 0.8.10

Problem opening -l for reading! Error is 2.
The specified file does not exist!
# mdadm --assemble --uuid=7b532c3f:a9089a56:57b177b6:a8f67641
mdadm: an md device must be given in this mode

Oh right. Give it /dev/md1

# mdadm --assemble --uuid=7b532c3f:a9089a56:57b177b6:a8f67641 /dev/md1

mdadm: /dev/md1 has been started with 4 drives.

Okay, now run read-only e2fsck on /dev/md1 as above. Is there only a single error or are there plenty?

# e2fsck -nv /dev/md1

e2fsck 1.42.9 (28-Dec-2013)

ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap

e2fsck: Group descriptors look bad... trying backup blocks...

still running

Group 21628's inode bitmap at 4294967296 conflicts with some other fs block.

Relocate? no

Illegal block number passed to ext2fs_test_block_bitmap #3268640893040590848 for in-use block map

Illegal block number passed to ext2fs_mark_block_bitmap #3268640893040590848 for in-use block map

Group 21630's block bitmap at 28 conflicts with some other fs block.

Relocate? no

Group 21630's inode bitmap at 2046152288 conflicts with some other fs block.

Relocate? no
Illegal block number passed to ext2fs_test_block_bitmap #3268640893040590848 for in-use block map

Illegal block number passed to ext2fs_mark_block_bitmap #3268640893040590848 for in-use block map

Group 21630's block bitmap at 28 conflicts with some other fs block.

Relocate? no

Group 21630's inode bitmap at 2046152288 conflicts with some other fs block.

Relocate? no

Error reading block 267225835849207 (Invalid argument) while getting next inode from scan. Ignore error? no

Error while scanning inodes (44171264): Can't read next inode

/dev/md1: ********** WARNING: Filesystem still has errors **********

e2fsck: aborted

/dev/md1: ********** WARNING: Filesystem still has errors **********

Yeah this will take a while :slight_smile: better get comfy.
I’d suggest repairing… you might lose a few files but it should recover most of it.

RAID5 protects you from single disk failures.
Power outage causes all disks to corrupt a bit, no parity can fix this… only a UPS (universal power supply).

:slight_smile: I’m comfy, not too worried about a few files, I think their was about 4TB of useful data and 4 TB of backups that are not needed. Thanks so far, much further than I got! :smiley: “on Amazon looking for UPS”

Okay good luck with it! Maybe try with another backup superblock e.g. on at end of disk?

There are plenty superblock backups, so I’ll try one further in, cheers!

So after 5 days of crunching and waiting for the e2fsck to complete, I got a “killed”, did a reboot and after 25 minutes waiting for the reboot, I’m back to where I started with 4 red lights and no drive.

Any other suggestions?

Hmmm… why the reboot? If e2fsck cached any progress, it’s gone now…
Could be that you SSH session timed out.
Better to try with screen/tmux (if you have entware) or use nohup (no entware required)

nohup e2fsck ...  > /tmp/fsck.log 2>&1 &

You can follow progress with

tail -f /tmp/fsck.log

Also, you can safely exit your SSH shell now and check back later… just make sure your /tmp doesn’t fill up completely.

Note that FS corruption issues would occur for any brand of NAS, it’s a bit crazy to run an ext4 volume of 16TB in raid5. If you ever start over, you may want consider 2 or 4 separate volumes with some rsync backup jobs. Doesn’t give you the huge single volume, but way less downtime for these type of problems.
Anyway, a backup of the crucial data to a separate device is always recommended.

Good luck!

Apologies, I was also following another thread that said to reboot it, OK I’ve done it again and this time I haven’t rebooted.

Where should I look for a log or check to see what’s been done?

I suspect you don’t have enough memory, which may kill the process.

You could try adding an external HDD via USB and setup a swap partition to get around this.
Careful with these commands, don’t use it on your internal HDDs! Use the proper /dev/sdX here.
You may need to umount it from /mnt/USB first!

ls /dev/sd?
blkid

Wipe the USB drive, create a fresh table and add a swap partition

gdisk /dev/sdX
  p -- print
  x -- expert functions
  z -- zap all
  o -- create a new table
  n -- create a new partition with a large size 
     (just press enter for full disk)
      and hex code 8200 (linux swap)
  w -- write

Then create swap on this partition and enable it.
Note: gdisk works on the disk /dev/sdX, but here you need the partition /dev/sdX1

mkswap /dev/sdX1
swapon /dev/sdX1

Then try the e2fsck again…

Note to anyone reading this: use this (and anything you find on this forum) at own risk