SN770 nvme controller reset when formatted with 4096 byte sectors

I created a case at support at first but I got a standard “bad sectors” reply. I pointed that out to support just now, but while waiting for a reply I’m trying here in the community as well.

I recently got 2 of these WD_BLACK SN770 drives. I want to use them as a mirrored pair in a zfs pool for home use on a linux system.

At first I created a zfs pool and everything actually worked fine. I then noticed that the drives were in 512 byte format so I formatted both of them to 4096 byte format:

:~# nvme id-ns -H /dev/nvme0n1 | grep "LBA Format"
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better
:~# nvme format --lbaf=1 /dev/nvme0n1

I then recreated the pool and it worked at first so I moved some data over to them.

Then I started getting errors in the kernel log among which notably these messages:

[28922.642400] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[28922.714515] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[28922.715277] nvme nvme0: Removing after probe failure status: -19

The devices were no longer present in the /dev filesystem.

I reverted back to the 512 byte format:

:~# nvme format --lbaf=0 /dev/nvme0n1

I then recreated the pool and everything works fine again now.

I tested this on 2 different systems (one desktop system, and one laptop system with and nvme thunderbolt enclosure). Both had the same issue.

However, this still worked fine:

dd if=/dev/nvme0n1 of=ssd.dd bs=4M

This could completely read the whole disk.

I’m not exactly sure what’s the actual cause of this, but it seems like a firmware issue to me. But since dd still seems to be able to read the whole disk, it might be related to how a filesystem talks to the drive.

Anyone else had this problem before? Anyone knows how to fix it? Obviously, I’m aware that just using 512 byte sectors works, but since internally it’s using 4096 bytes anyway, using 4096 bytes would improve performance.

I have the same issue on a SN570 (updated to firmware 234200WD which is the latest AFAIK). After some IO to the drive on 4K blocks the controller crashes/fails (same error message). A reset of the PCIe Link brings it back up, it’s not a permanent issue. Looks like a firmware bug to me.

For what it’s worth the case at WD support has been escalated to engineering team on March 23rd 2023, but no ETA on any fix. Case number: 230311-002306

I too am having the same issue. Single drive seems to work fine, two drives/4K and it starts throwing errors within a few hours.