Silent data corruption on Green SSD WDS120G2G0A-00JH30 with Linux


#1

Hi,
We have noticed a lot of silent data corruption on our SSD WDS120G2G0A-00JH30 drives with linux.
We have installed 40 Green SSD WDS120G2G0A-00JH30 (firmware UE300000) on intel NUC with Ubuntu Linux 16.04 and currently we have 8 disks (20%) with silent data loss.
There isn’t detailed information about firmware updates nor solved problems with new firmware.
We have also found same problems with FreeBSD users:
https://www.mail-archive.com/freebsd-bugs@freebsd.org/msg38763.html


Has anyone else experimented this problem?
Where can we get detailed information about the problems solved in new firmware versions?
Thanks


#2

Hi Mau,

We want to open a support ticket to gather more detail information about your host information, application and steps to reproduce. Please let us know your email address so we can proceed


#4

Hi Chushi,

I can’t reproduce the problem. Our systems with Intel NUC and Green SSD
WDS120G2G0A-00JH30 have Ubuntu 16.04. Currently they have a
4.4.0-137-generic kernel. Due to we apply security updates they come
from kernel 4.4.0-135-generic, 4.4.0-134-generic and so on.

Our systems wake up every day, do some jobs and then shutdown.

  • Some of them have lost the /boot files (ext4 sda1 partition) and
    then we cannot write to that partition without reformating. Some
    times with the error in kern.log file “*rec_len is smaller than
    minimal” *but usually without errors in log files
  • Some of them have had a /boot corruption: we can list the /boot
    files but can’t write new files into it (error returned) nor read
    the /boot files.

Somte times we have seen errors at kern.log like this:

Feb 16 12:48:42 serverName kernel: [ 620.083112] EXT4-fs error (device
sda1): ext4_validate_block_bitmap:395: comm fstrim: bg 3: block 25104:
invalid block bitmap
*
Feb 17 12:53:54 serverName kernel: [87322.713730] EXT4-fs (sda1): error
count since last fsck: 1

Feb 17 12:53:54 serverName kernel: [87322.713783] EXT4-fs (sda1): initial
error at time 1518781722: ext4_validate_block_bitmap:395

**Feb 17 12:53:54 serverName kernel: [87322.713800] EXT4-fs (sda1): last
error at time 1518781722: ext4_validate_block_bitmap:395

Oct 2 07:39:42 serverName kernel: [ 5783.223631] EXT4-fs error (device
sda1): ext4_readdir:230: inode #2: block 4589: comm update-grub-leg:
path /boot: bad entry in directory: rec_len is smaller than minimal -
offset=0(0), inode=0, rec_len=0, name_len=0
Oct 2 07:39:42 serverName kernel: [ 5783.230582] EXT4-fs error (device
sda1): ext4_readdir:230: inode #2: block 4590: comm update-grub-leg:
path /boot: bad entry in directory: rec_len is smaller than minimal -
offset=0(0), inode=0, rec_len=0, name_len=0
Oct 2 08:17:01 serverName kernel: [ 8021.836548] EXT4-fs (sda1): warning:
mounting fs with errors, running e2fsck is recommended
Oct 2 08:17:01 serverName kernel: [ 8021.837358] EXT4-fs (sda1): mounted
filesystem with ordered data mode. Opts: (null)
Oct 2 08:17:01 serverName kernel: [ 8021.880294] EXT4-fs (sda1): warning:
mounting fs with errors, running e2fsck is recommended
Oct 2 08:17:01 serverName kernel: [ 8021.881090] EXT4-fs (sda1): mounted
filesystem with ordered data mode. Opts: (null)
Oct 2 08:22:02 serverName kernel: [ 8322.150737] EXT4-fs (sda1): error
count since last fsck: 74
Oct 2 08:22:02 serverName kernel: [ 8322.150747] EXT4-fs (sda1): initial
error at time 1538458741: ext4_readdir:230: inode 2
Oct 2 08:22:02 serverName kernel: [ 8322.150750] EXT4-fs (sda1): last
error at time 1538459194: ext4_readdir:230: inode 2


#5

Thanks Mau,

I will discuss this with our engineering team


#6

Hello,
any news about this problem?
Should we request a RMA?
Any news about how to know what issues solve new firmware releases?
Thanks.


#7

Hi Mau,

Preliminary analysis suggest this is more of a host side issue, however if you do submit an RMA please let me know the case or RMA number so we can route the drives back to HQ for further analysis


#8

Hi Chushi,.
today we have had another disk failure (disk Model Number: WDC WDS120G2G0A-00JH30, firmware: UE300000)
Yesterday the host was working fine. It was powered off (shutdown -h now) at 15:20
This morning the host has booted up at 6:00 without errors at /dev/sda1 partition, but:

  • At 06:00, boot process, the systemd-fsck has ended without errors (systemd-fsck[730]: /dev/sda1: clean, 333/124928 files, 155940/498688 blocks)
  • At 06:17 our fsck test script (hourly executed) has reported errors (Directory inode 2, block #1, offset 0: directory corrupted) on /dev/sda1 (our /boot)
  • At 07:17 our fsck test script (hourly executed) has reported errors (Directory inode 2, block #1, offset 0: directory corrupted) on /dev/sda1 (our /boot)
  • At 07:42:01 has begun a kernel upgrade
  • At 07:42:20 the kernel upgrade has ended with errors (due to the /boot filesystem corrpution)
  • At 08:17 our fsck test script (hourly executed) has reported errors on /dev/sda1 (our /boot)
  • At 08:23 We have runned an “e2fsck -n -F -f /dev/sda1” with this error:

Directory inode 2, block #1, offset 0: directory corrupted
Salvage? no
e2fsck: aborted
/dev/sda1: ********** WARNING: Filesystem still has errors **********

  • Usually we have detected the problem after a kernel upgrade.
  • Usually the problems have affected the /dev/sda1 (/boot) partition
  • Currently we have 9 disks that have reported this problem

Our hosts are all Intel NUC. Here you have the hardware details:

System Information
_ Manufacturer: Intel Corporation_
_ Product Name: NUC7i5BNH_
_ Version: J31169-307 _
_ Family: Intel NUC_
_ Base Board Information_
_ Manufacturer: Intel Corporation_
_ Product Name: NUC7i5BNB_
_ Version: J31144-306_
_ BIOS Information_
_ Vendor: Intel Corp._
_ Version: BNKBL357.86A.0049.2017.0724.1541_
_ Release Date: 07/24/2017_

Here you have the partition info:

#parted /dev/sda print
Model: ATA WDC WDS120G2G0A- (scsi)
Disk /dev/sda: 120GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
_Disk Flags: _

Number Start End Size Type File system Flags
_ 1 1049kB 512MB 511MB primary ext4 boot_
_ 2 512MB 17.7GB 17.2GB extended_
_ 5 513MB 16.6GB 16.1GB logical lvm_
_ 6 16.6GB 17.7GB 1072MB logical linux-swap(v1)_
_ 3 17.7GB 120GB 102GB primary ext4_

Here you have some kern.log info (the web says that I’m not authorized to attach a complete log file):

Oct 23 06:17:02 serverName111 kernel: [ 984.393900] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 06:17:02 serverName111 kernel: [ 984.427994] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 06:25:24 serverName111 kernel: [ 1486.844918] perf interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Oct 23 07:17:01 serverName111 kernel: [ 4583.155637] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 07:17:01 serverName111 kernel: [ 4583.197692] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 08:03:55 serverName111 kernel: [ 7397.462235] perf interrupt took too long (5023 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Oct 23 08:17:01 serverName111 kernel: [ 8182.957020] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 08:17:01 serverName111 kernel: [ 8182.997370] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 08:22:23 serverName111 kernel: [ 8504.936018] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5812: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(1024), inode=0, rec_len=0, name_len=0
Oct 23 08:22:23 serverName111 kernel: [ 8504.937039] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5813: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(2048), inode=0, rec_len=0, name_len=0
Oct 23 08:22:47 serverName111 kernel: [ 8528.653551] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5812: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(1024), inode=0, rec_len=0, name_len=0
Oct 23 08:22:47 serverName111 kernel: [ 8528.654358] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5813: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(2048), inode=0, rec_len=0, name_len=0
Oct 23 08:24:54 serverName111 kernel: [ 8656.311958] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5812: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(1024), inode=0, rec_len=0, name_len=0
Oct 23 08:24:54 serverName111 kernel: [ 8656.312742] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5813: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(2048), inode=0, rec_len=0, name_len=0
Oct 23 08:28:16 serverName111 kernel: [ 8857.885739] EXT4-fs (sda1): warning: mounting fs with errors, running e2fsck is recommended

Here you have apt/history.log:

Start-Date: 2018-10-23 07:42:01
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=–force-confdef -o Dpkg::Options::=–force-confold dist-upgrade
Install: linux-image-4.4.0-138-generic:amd64 (4.4.0-138.164, automatic), linux-headers-4.4.0-138:amd64 (4.4.0-138.164, automatic), linux-image-extra-4.4.0-138-generic:amd64 (4.4.0-138.164, automatic), linux-headers-4.4.0-138-generic:amd64 (4.4.0-138.164, automatic)
Upgrade: linux-headers-generic:amd64 (4.4.0.137.143, 4.4.0.138.144), linux-libc-dev:amd64 (4.4.0-137.163, 4.4.0-138.164), linux-image-generic:amd64 (4.4.0.137.143, 4.4.0.138.144), linux-generic:amd64 (4.4.0.137.143, 4.4.0.138.144)
Error: Sub-process /usr/bin/dpkg returned an error code (1)
End-Date: 2018-10-23 07:42:20

Thanks.
Mau.


#9

Hi Mau,

The not detected is likely a drive failure, but we are more concerned about “silent data corruption”, so feel free to RMA the non detected one, however let me know on the RMA for SDC so I can route it back to HQ


#10

Hi Chusi,
here you have de RMA: 87721007

Thanks for your help.
Mau


#11

Thank you Mau,

Please do not ship the product back to us yet, I will ask my team to create a pre-paid UPS shipping label to route the drive to me.


#12

Sorry, but I have already sent it to our vendor. They have requested the RMA.
Mau