koldfront

NVMe related freezes #hardware

🕓︎ - 2024-10-03

My home server has frozen a handful of times since summer, around the time I switched to a new NVMe.

I was expecting the problem to be the kernel, because it got upgraded at the same time (yes, wrong move), so I downgraded the kernel package and went on summer vacation.

When it has frozen there has been nothing in the log file, and the machine has still been responding to network packets, but access via ssh has not been possible.

The previous time it froze I happened to have a terminal open, but no commands could be run, as if the disk disappeared.

When it froze this morning, where I was lucky that I saw some log

2024-10-03T07:51:32.710948+02:00 virgil kernel: [2481125.908216] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff 
2024-10-03T07:51:32.710967+02:00 virgil kernel: [2481125.908240] nvme nvme0: Does your device have a faulty power saving mode enabled? 
2024-10-03T07:51:32.710972+02:00 virgil kernel: [2481125.908254] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug 
2024-10-03T07:51:32.758943+02:00 virgil kernel: [2481125.957166] nvme0n1: I/O Cmd(0x2) @ LBA 6366795000, 32 blocks, I/O Error (sct 0x3 / sc 0x71)  
2024-10-03T07:51:32.758964+02:00 virgil kernel: [2481125.957192] blk_print_req_error: 1 callbacks suppressed 
2024-10-03T07:51:32.758969+02:00 virgil kernel: [2481125.957195] I/O error, dev nvme0n1, sector 6366795000 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 
2024-10-03T07:51:32.780690+02:00 virgil kernel: [2481125.980262] nvme 0000:06:00.0: Unable to change power state from D3cold to D0, device inaccessible 
2024-10-03T07:51:32.780706+02:00 virgil kernel: [2481125.980486] nvme nvme0: Removing after probe failure status: -19 

which makes more sense than the arbitrary kernel guess.

So I added those parameters to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, ran update-grub, and rebooted the machine. Hopefully that fixes the problem.

This is with 6.1.106-3 on Debian 12 (bookworm) and a Samsung SSD 990 PRO 4TB.

@blog I have the 2TB version in my laptop running Debian unstable and haven't had any problems. But I originally planned to use that in a desktop, where it turned out to be incompatible with that motherboard and didn't enumerate. So I wonder if you've found another odd incompatibility

- https://tech.lgbt/users/bwh 🕦︎ - 2024-10-04

+=

Ouch, yes, sounds like it could be.

The motherboard in my home server isn't exactly new (it is from around the time where an AMD Ryzen 5 2400GE was the newish desktop CPU with reasonably low power usage, dmidecode says Gigabyte B450M DS3H-CF).

I was quite impressed with the kernel error message giving me concrete settings to try.

Didn't report anywhere it because I assume the kernel has moved on quite a bit since, and I can't provide much additional information.

- Adam Sjøgren 🕓︎ - 2024-10-04

+=

@asjo @blog My desktop has an Asrock motherboard with the same B450M

- https://tech.lgbt/users/bwh 🕓︎ - 2024-10-04

+=

Add comment

To avoid spam many websites make you fill out a CAPTCHA, or log in via an account at a corporation such as Twitter, Facebook, Google or even Microsoft GitHub.

I have chosen to use a more old school method of spam prevention.

To post a comment here, you need to:

¹ Such as Thunderbird, Pan, slrn, tin or Gnus (part of Emacs).

Or, you can fill in this form:

+=