Silent data corruption on Green SSD WDS120G2G0A-00JH30 with Linux


#1

Hi,
We have noticed a lot of silent data corruption on our SSD WDS120G2G0A-00JH30 drives with linux.
We have installed 40 Green SSD WDS120G2G0A-00JH30 (firmware UE300000) on intel NUC with Ubuntu Linux 16.04 and currently we have 8 disks (20%) with silent data loss.
There isn’t detailed information about firmware updates nor solved problems with new firmware.
We have also found same problems with FreeBSD users:
https://www.mail-archive.com/freebsd-bugs@freebsd.org/msg38763.html


Has anyone else experimented this problem?
Where can we get detailed information about the problems solved in new firmware versions?
Thanks


#2

Hi Mau,

We want to open a support ticket to gather more detail information about your host information, application and steps to reproduce. Please let us know your email address so we can proceed


#4

Hi Chushi,

I can’t reproduce the problem. Our systems with Intel NUC and Green SSD
WDS120G2G0A-00JH30 have Ubuntu 16.04. Currently they have a
4.4.0-137-generic kernel. Due to we apply security updates they come
from kernel 4.4.0-135-generic, 4.4.0-134-generic and so on.

Our systems wake up every day, do some jobs and then shutdown.

  • Some of them have lost the /boot files (ext4 sda1 partition) and
    then we cannot write to that partition without reformating. Some
    times with the error in kern.log file “*rec_len is smaller than
    minimal” *but usually without errors in log files
  • Some of them have had a /boot corruption: we can list the /boot
    files but can’t write new files into it (error returned) nor read
    the /boot files.

Somte times we have seen errors at kern.log like this:

Feb 16 12:48:42 serverName kernel: [ 620.083112] EXT4-fs error (device
sda1): ext4_validate_block_bitmap:395: comm fstrim: bg 3: block 25104:
invalid block bitmap
*
Feb 17 12:53:54 serverName kernel: [87322.713730] EXT4-fs (sda1): error
count since last fsck: 1

Feb 17 12:53:54 serverName kernel: [87322.713783] EXT4-fs (sda1): initial
error at time 1518781722: ext4_validate_block_bitmap:395

**Feb 17 12:53:54 serverName kernel: [87322.713800] EXT4-fs (sda1): last
error at time 1518781722: ext4_validate_block_bitmap:395

Oct 2 07:39:42 serverName kernel: [ 5783.223631] EXT4-fs error (device
sda1): ext4_readdir:230: inode #2: block 4589: comm update-grub-leg:
path /boot: bad entry in directory: rec_len is smaller than minimal -
offset=0(0), inode=0, rec_len=0, name_len=0
Oct 2 07:39:42 serverName kernel: [ 5783.230582] EXT4-fs error (device
sda1): ext4_readdir:230: inode #2: block 4590: comm update-grub-leg:
path /boot: bad entry in directory: rec_len is smaller than minimal -
offset=0(0), inode=0, rec_len=0, name_len=0
Oct 2 08:17:01 serverName kernel: [ 8021.836548] EXT4-fs (sda1): warning:
mounting fs with errors, running e2fsck is recommended
Oct 2 08:17:01 serverName kernel: [ 8021.837358] EXT4-fs (sda1): mounted
filesystem with ordered data mode. Opts: (null)
Oct 2 08:17:01 serverName kernel: [ 8021.880294] EXT4-fs (sda1): warning:
mounting fs with errors, running e2fsck is recommended
Oct 2 08:17:01 serverName kernel: [ 8021.881090] EXT4-fs (sda1): mounted
filesystem with ordered data mode. Opts: (null)
Oct 2 08:22:02 serverName kernel: [ 8322.150737] EXT4-fs (sda1): error
count since last fsck: 74
Oct 2 08:22:02 serverName kernel: [ 8322.150747] EXT4-fs (sda1): initial
error at time 1538458741: ext4_readdir:230: inode 2
Oct 2 08:22:02 serverName kernel: [ 8322.150750] EXT4-fs (sda1): last
error at time 1538459194: ext4_readdir:230: inode 2


#5

Thanks Mau,

I will discuss this with our engineering team


#6

Hello,
any news about this problem?
Should we request a RMA?
Any news about how to know what issues solve new firmware releases?
Thanks.


#7

Hi Mau,

Preliminary analysis suggest this is more of a host side issue, however if you do submit an RMA please let me know the case or RMA number so we can route the drives back to HQ for further analysis


#8

Hi Chushi,.
today we have had another disk failure (disk Model Number: WDC WDS120G2G0A-00JH30, firmware: UE300000)
Yesterday the host was working fine. It was powered off (shutdown -h now) at 15:20
This morning the host has booted up at 6:00 without errors at /dev/sda1 partition, but:

  • At 06:00, boot process, the systemd-fsck has ended without errors (systemd-fsck[730]: /dev/sda1: clean, 333/124928 files, 155940/498688 blocks)
  • At 06:17 our fsck test script (hourly executed) has reported errors (Directory inode 2, block #1, offset 0: directory corrupted) on /dev/sda1 (our /boot)
  • At 07:17 our fsck test script (hourly executed) has reported errors (Directory inode 2, block #1, offset 0: directory corrupted) on /dev/sda1 (our /boot)
  • At 07:42:01 has begun a kernel upgrade
  • At 07:42:20 the kernel upgrade has ended with errors (due to the /boot filesystem corrpution)
  • At 08:17 our fsck test script (hourly executed) has reported errors on /dev/sda1 (our /boot)
  • At 08:23 We have runned an “e2fsck -n -F -f /dev/sda1” with this error:

Directory inode 2, block #1, offset 0: directory corrupted
Salvage? no
e2fsck: aborted
/dev/sda1: ********** WARNING: Filesystem still has errors **********

  • Usually we have detected the problem after a kernel upgrade.
  • Usually the problems have affected the /dev/sda1 (/boot) partition
  • Currently we have 9 disks that have reported this problem

Our hosts are all Intel NUC. Here you have the hardware details:

System Information
_ Manufacturer: Intel Corporation_
_ Product Name: NUC7i5BNH_
_ Version: J31169-307 _
_ Family: Intel NUC_
_ Base Board Information_
_ Manufacturer: Intel Corporation_
_ Product Name: NUC7i5BNB_
_ Version: J31144-306_
_ BIOS Information_
_ Vendor: Intel Corp._
_ Version: BNKBL357.86A.0049.2017.0724.1541_
_ Release Date: 07/24/2017_

Here you have the partition info:

#parted /dev/sda print
Model: ATA WDC WDS120G2G0A- (scsi)
Disk /dev/sda: 120GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
_Disk Flags: _

Number Start End Size Type File system Flags
_ 1 1049kB 512MB 511MB primary ext4 boot_
_ 2 512MB 17.7GB 17.2GB extended_
_ 5 513MB 16.6GB 16.1GB logical lvm_
_ 6 16.6GB 17.7GB 1072MB logical linux-swap(v1)_
_ 3 17.7GB 120GB 102GB primary ext4_

Here you have some kern.log info (the web says that I’m not authorized to attach a complete log file):

Oct 23 06:17:02 serverName111 kernel: [ 984.393900] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 06:17:02 serverName111 kernel: [ 984.427994] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 06:25:24 serverName111 kernel: [ 1486.844918] perf interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Oct 23 07:17:01 serverName111 kernel: [ 4583.155637] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 07:17:01 serverName111 kernel: [ 4583.197692] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 08:03:55 serverName111 kernel: [ 7397.462235] perf interrupt took too long (5023 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
Oct 23 08:17:01 serverName111 kernel: [ 8182.957020] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 08:17:01 serverName111 kernel: [ 8182.997370] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Oct 23 08:22:23 serverName111 kernel: [ 8504.936018] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5812: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(1024), inode=0, rec_len=0, name_len=0
Oct 23 08:22:23 serverName111 kernel: [ 8504.937039] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5813: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(2048), inode=0, rec_len=0, name_len=0
Oct 23 08:22:47 serverName111 kernel: [ 8528.653551] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5812: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(1024), inode=0, rec_len=0, name_len=0
Oct 23 08:22:47 serverName111 kernel: [ 8528.654358] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5813: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(2048), inode=0, rec_len=0, name_len=0
Oct 23 08:24:54 serverName111 kernel: [ 8656.311958] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5812: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(1024), inode=0, rec_len=0, name_len=0
Oct 23 08:24:54 serverName111 kernel: [ 8656.312742] EXT4-fs error (device sda1): htree_dirblock_to_tree:986: inode #2: block 5813: comm ls: bad entry in directory: rec_len is smaller than minimal - offset=0(2048), inode=0, rec_len=0, name_len=0
Oct 23 08:28:16 serverName111 kernel: [ 8857.885739] EXT4-fs (sda1): warning: mounting fs with errors, running e2fsck is recommended

Here you have apt/history.log:

Start-Date: 2018-10-23 07:42:01
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=–force-confdef -o Dpkg::Options::=–force-confold dist-upgrade
Install: linux-image-4.4.0-138-generic:amd64 (4.4.0-138.164, automatic), linux-headers-4.4.0-138:amd64 (4.4.0-138.164, automatic), linux-image-extra-4.4.0-138-generic:amd64 (4.4.0-138.164, automatic), linux-headers-4.4.0-138-generic:amd64 (4.4.0-138.164, automatic)
Upgrade: linux-headers-generic:amd64 (4.4.0.137.143, 4.4.0.138.144), linux-libc-dev:amd64 (4.4.0-137.163, 4.4.0-138.164), linux-image-generic:amd64 (4.4.0.137.143, 4.4.0.138.144), linux-generic:amd64 (4.4.0.137.143, 4.4.0.138.144)
Error: Sub-process /usr/bin/dpkg returned an error code (1)
End-Date: 2018-10-23 07:42:20

Thanks.
Mau.


#9

Hi Mau,

The not detected is likely a drive failure, but we are more concerned about “silent data corruption”, so feel free to RMA the non detected one, however let me know on the RMA for SDC so I can route it back to HQ


#10

Hi Chusi,
here you have de RMA: 87721007

Thanks for your help.
Mau


#11

Thank you Mau,

Please do not ship the product back to us yet, I will ask my team to create a pre-paid UPS shipping label to route the drive to me.


#12

Sorry, but I have already sent it to our vendor. They have requested the RMA.
Mau


#13

Hello There,

I have the same problem.

We have usually normal size Intel NUC5I3RYH, but we had a month, when we do not had the possibility to by this one, so we bought some NUC5I5RYK-s. Inside we installed the WDS120G2G0B-00EPW0.

One was shipped to Malaysia, then there it failed to boot, so we had to re-install it, then I rebooted in every 3 minutes without problem, 90 times. So I gave back to the program testers, and after two days, they gave me back, and it has an other booting fail…

Also we received some boot failured RYK from other countries.

Here is the output of smartctl -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-42-generic] (local build)
Copyright © 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: WDC WDS120G2G0B-00EPW0
Serial Number: 1812B4804085
LU WWN Device Id: 5 001b44 8b6ad2ecd
Firmware Version: UI190000
User Capacity: 120.040.980.480 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: M.2
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Dec 21 08:08:11 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 32) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x15) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 21) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 14
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 28
165 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 47
166 Unknown_Attribute 0x0032 100 100 — Old_age Always - 1
167 Unknown_Attribute 0x0032 100 100 — Old_age Always - 0
168 Unknown_Attribute 0x0032 100 100 — Old_age Always - 4
169 Unknown_Attribute 0x0032 100 100 — Old_age Always - 178
170 Unknown_Attribute 0x0032 100 100 — Old_age Always - 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 25
184 End-to-End_Error 0x0032 100 100 — Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 — Old_age Always - 0
194 Temperature_Celsius 0x0022 075 044 000 Old_age Always - 25 (Min/Max 19/44)
199 UDMA_CRC_Error_Count 0x0032 100 100 — Old_age Always - 0
230 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 55835885581
232 Available_Reservd_Space 0x0033 100 100 005 Pre-fail Always - 100
233 Media_Wearout_Indicator 0x0032 100 100 — Old_age Always - 120
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 266
241 Total_LBAs_Written 0x0030 100 100 000 Old_age Offline - 8
242 Total_LBAs_Read 0x0030 100 100 000 Old_age Offline - 44
244 Unknown_Attribute 0x0032 000 100 — Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 13 -

2 Short offline Completed without error 00% 12 -

Selective Self-tests/Logging not supported

I have now 4 SSD here, and 21 more shipped out in Europe.

What can I do to help solving this issue?

I installed one Ubuntu16.04, then I made an image file with DD, and that was copied with DD to the WD ssd.
Sometimes I made updates on the image file.

This is one of the result:

Kind regards,
David


#14

Hi David,

I’m not sure its the same issue, I understand it’s the same product and host environment, but please open up a support ticket with us