Bug 1165535 - System lockup under memory pressure using encrypted swap on SSD (failed command: WRITE FPDMA QUEUED)
System lockup under memory pressure using encrypted swap on SSD (failed comma...
Status: NEW
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.1
Other Other
: P5 - None : Major (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-03-03 10:38 UTC by Ulrich Windl
Modified: 2021-09-26 12:53 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Screenshot with top running when freeze occurred (117.33 KB, image/png)
2020-10-02 06:07 UTC, Ulrich Windl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Windl 2020-03-03 10:38:10 UTC
On a system that uses an SSD for /boot and encrypted swap only (the rest is on a traditional hard disk), I had the following problem:
I was working normally with the system, when I decided to launch "yast2 online_update".
Unfortunately the system became terribly slow (practically not responding for minutes), so I could not do anything than press the Reset-button after having waited about 30 minutes.

After reboot I see "odd" ATA errors related to the SSD. However the SSD is rather new and in good condition. So I guess the issue has to do with memory pressure.

Here are the logs when things started to become slow:
Mar 03 10:09:57 pc dbus-daemon[2706]: [session uid=1000 pid=2706] Successfully activated service 'org.gnome.evince.Daemon'
Mar 03 10:09:57 pc systemd[2686]: Starting Evince document viewer...
Mar 03 10:09:57 pc systemd[2686]: Started Evince document viewer.
Mar 03 10:10:13 pc kernel: ata2: log page 10h reported inactive tag 5
Mar 03 10:10:14 pc kernel: ata2.00: exception Emask 0x1 SAct 0x7fffffdf SErr 0x0 action 0x0
Mar 03 10:10:14 pc kernel: ata2.00: irq_stat 0x40000009
Mar 03 10:10:14 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 03 10:10:14 pc kernel: ata2.00: cmd 61/08:00:e0:b0:7d/00:00:00:00:00/40 tag 0 ncq dma 4096 out
                                           res 40/00:00:e0:b0:7d/00:00:00:00:00/40 Emask 0x1 (device error)
Mar 03 10:10:14 pc kernel: ata2.00: status: { DRDY }
Mar 03 10:10:14 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 03 10:10:14 pc kernel: ata2.00: cmd 61/08:08:e8:b0:7d/00:00:00:00:00/40 tag 1 ncq dma 4096 out
                                           res 40/00:00:e0:b0:7d/00:00:00:00:00/40 Emask 0x1 (device error)
Mar 03 10:10:14 pc kernel: ata2.00: status: { DRDY }
Mar 03 10:10:14 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 03 10:10:14 pc kernel: ata2.00: cmd 61/08:10:f0:b0:7d/00:00:00:00:00/40 tag 2 ncq dma 4096 out
                                           res 40/00:00:e0:b0:7d/00:00:00:00:00/40 Emask 0x1 (device error)
Mar 03 10:10:14 pc kernel: ata2.00: status: { DRDY }
[...]
Mar 03 10:10:14 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 03 10:10:14 pc kernel: ata2.00: cmd 61/08:f0:d8:b0:7d/00:00:00:00:00/40 tag 30 ncq dma 4096 out
                                           res 40/00:00:e0:b0:7d/00:00:00:00:00/40 Emask 0x1 (device error)
Mar 03 10:10:14 pc kernel: ata2.00: status: { DRDY }
Mar 03 10:10:14 pc kernel: ata2.00: supports DRM functions and may not be fully accessible
Mar 03 10:10:14 pc kernel: ata2.00: supports DRM functions and may not be fully accessible
Mar 03 10:10:14 pc kernel: ata2.00: configured for UDMA/133
Mar 03 10:10:14 pc kernel: ata2: EH complete
Mar 03 10:10:14 pc kernel: ata2.00: Enabling discard_zeroes_data
Mar 03 10:10:14 pc rsyslogd[1332]:  message repeated 13 times: [-- MARK --]
Mar 03 10:10:14 pc rsyslogd[1332]: action 'action 1' suspended (module 'builtin:ompipe'), retry 0. There should be messages before this one giving the reason for suspension. [v8.33.1 try http://www.rsyslog.com/e/2007 ]
Mar 03 10:10:14 pc rsyslogd[1332]: action 'action 1' suspended (module 'builtin:ompipe'), next retry is Tue Mar  3 10:10:44 2020, retry nbr 0. There should be messages before this one giving the reason for suspension. [v8.33.1 try http://www.rsyslog.com/e/2007 ]
[...]
Mar 03 10:15:48 pc /usr/lib/gdm/gdm-x-session[2700]: (II) event2  - Logitech USB-PS/2 Optical Mouse: SYN_DROPPED event - some input events have been lost.
Mar 03 10:16:17 pc /usr/lib/gdm/gdm-x-session[2700]: (II) event2  - Logitech USB-PS/2 Optical Mouse: SYN_DROPPED event - some input events have been lost.
[...]
Mar 03 10:25:13 pc systemd-journald[30532]: Journal started
Mar 03 10:25:18 pc systemd-journald[30532]: System journal (/var/log/journal/89c660865c00403a9bacef32b6828556) is 856.0M, max 819.2M, 0B free.
[...]
Mar 03 10:25:19 pc kernel: ata2.00: exception Emask 0x50 SAct 0x7f000007 SErr 0x800 action 0x6 frozen
Mar 03 10:25:19 pc kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Mar 03 10:25:19 pc kernel: ata2: SError: { HostInt }
Mar 03 10:25:19 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 03 10:25:19 pc kernel: ata2.00: cmd 61/c8:00:38:06:78/01:00:00:00:00/40 tag 0 ncq dma 233472 out
                                           res 40/00:c0:00:00:72/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Mar 03 10:25:19 pc kernel: ata2.00: status: { DRDY }
[...]
Mar 03 10:25:19 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 03 10:25:19 pc kernel: ata2.00: cmd 61/40:f0:f8:00:78/05:00:00:00:00/40 tag 30 ncq dma 688128 out
                                           res 40/00:c0:00:00:72/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Mar 03 10:25:19 pc kernel: ata2.00: status: { DRDY }
Mar 03 10:25:19 pc kernel: ata2: hard resetting link
Mar 03 10:25:19 pc kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Mar 03 10:25:19 pc kernel: ata2.00: supports DRM functions and may not be fully accessible
Mar 03 10:25:19 pc kernel: ata2.00: supports DRM functions and may not be fully accessible
Mar 03 10:25:19 pc kernel: ata2.00: configured for UDMA/133
Mar 03 10:25:19 pc kernel: ata2: EH complete
Mar 03 10:25:19 pc kernel: ata2.00: Enabling discard_zeroes_data
[...]
Mar 03 10:25:19 pc systemd-coredump[30527]: Process 28448 (systemd-journal) of user 0 dumped core.
Mar 03 10:25:19 pc systemd-coredump[30527]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.92bd8e3e91344ef998c2c08ade695cd2.28448.1583227313000000.lz4
Mar 03 10:25:19 pc systemd-coredump[30527]: Stack trace of thread 28448:
Mar 03 10:25:19 pc systemd-coredump[30527]: #0  0x00007f52360114f5 journal_file_move_to_object (libsystemd-shared-234.so)
Mar 03 10:25:19 pc systemd-coredump[30527]: #1  0x00007f5236012f00 n/a (libsystemd-shared-234.so)
Mar 03 10:25:19 pc systemd-coredump[30527]: #2  0x00007f5236013548 n/a (libsystemd-shared-234.so)
Mar 03 10:25:19 pc systemd-coredump[30527]: #3  0x00007f523601470d journal_file_append_entry (libsystemd-shared-234.so)
[...]
Mar 03 10:25:54 pc systemd-journald[30532]: Journal stopped
Mar 03 10:26:38 pc systemd-journald[30532]: Received SIGTERM from PID 1 (systemd).
Mar 03 10:26:38 pc systemd-journald[30545]: Journal started
Mar 03 10:26:38 pc systemd-journald[30545]: System journal (/var/log/journal/89c660865c00403a9bacef32b6828556) is 856.0M, max 819.2M, 0B free.
[...]

From this point on I did not see any more ATA errors, but the system was still unusable, so I hit Reset:

Mar 03 10:55:58 pc kernel: SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=00:22:11:dd:00:11:a0:36:9f:b2:3f:25:08:00 SRC=1.9.154.170 1.9.5.2 LEN=44 TOS=0x00 PREC=0x00 TTL=5
6 ID=52696 PROTO=TCP SPT=47944 DPT=655 WINDOW=1024 RES=0x00 SYN URGP=0 OPT (020405B4) 
-- Reboot --
Mar 03 11:05:24 linux-abcd kernel: Linux version 4.12.14-lp151.28.36-default (geeko@buildhost) (gcc version 7.4.1 20190905 [gcc-7-branch revision 275407] (SUSE Linux) ) #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495)

Final note: The IP address and related identifyable information in the log has been anonymized for privacy.
Comment 1 Hannes Reinecke 2020-03-03 13:00:21 UTC
Can you give some more details on how you configured 'encrypted swap' ?
dm-crypt? Hardware encryption?
Would it be possible use 'normal' swap and check if the error persists?
Comment 2 Ulrich Windl 2020-03-04 07:16:20 UTC
(In reply to Hannes Reinecke from comment #1)
> Can you give some more details on how you configured 'encrypted swap' ?
> dm-crypt? Hardware encryption?

It's a LUKS volume via /etc/crypttab:
cr_swap         /dev/disk/by-id/ata-Samsung_SSD_860_EVO_500GB_S4YAMG8N878778W-part2 none       discard,swap

> Would it be possible use 'normal' swap and check if the error persists?

I think it's hard enough to reproduce the memory pressure (it that was the trigger at all). Maybe even the encrypted swap does not matter; maybe the discard option matters; I don't really know.

Also I had been using this setup for a long time. Only a few weeks ago I replaced a dying harddisk with the SSD, adjusting the setup.  So far it was the first time that this problem had occurred.

(the advantage of the old and noisy Seagate disk was that you could clearly hear when the system started paging, explaining the slowness of the system ;-) With SSD you hardly experience a major slowdown when the system begins paging)

Maybe "ata2: log page 10h reported inactive tag 5" indicates that something's wrong in the kernel. That was the message after which everything began.  I can exclude memory errors, as the system uses an old AMD Phenom II CPU with ECC RAM that is absolutely "rock solid stable".
Comment 3 Miroslav Beneš 2020-06-17 13:01:12 UTC
Hannes, do you need more information from Ulrich?

Ulrich, I suppose the issue is still happening with the newer kernel, or have you managed to find a solution?
Comment 4 Ulrich Windl 2020-06-18 07:21:09 UTC
Didn't happen again, yet. Currently running 4.12.14-lp151.28.52-default:
MiB Mem : 3935.504 total,  425.238 free, 2474.871 used, 1035.395 buff/cache
MiB Swap: 8191.996 total, 8057.703 free,  134.293 used. 1111.168 avail Mem 

But it's possible the memory pressure wasn't that hard since last occurrence. Maybe I should run Eclipse ;-)
Comment 5 Hannes Reinecke 2020-07-01 07:36:11 UTC
I wouldn't discount the possibility that we're not properly for memory allocation problems, or that the code paths catching these issues are improperly done.
IE if there's a memory allocation error (in, say, dm-crypt), and dm-crypt improperly handles the failed allocation in that it catches -ENOMEM yet continues with operation we might end up with such a situation.

But as you figured, triggering this is extremely hard.
But maybe worthwhile a testcase, and maybe our eager QA team can spin off something here.
Sebastian? Do you think it would be possible to devise a testcase for testing this situation?
Comment 6 Sebastian Chlad 2020-07-01 08:11:16 UTC
Since we do not have much logs and there is no easy and known way of reproducing this problem, devising a test case won't be straightforward for sure.

However having some look into the relevant code paths and see if we could come up with some stress tests, hitting as similar condition as possible, might be worth doing.

I will do so then starting from checking if there is any test touching these specific code paths.
Comment 8 Sebastian Chlad 2020-07-07 08:53:26 UTC
The test in VMs environment is being run however even with stressing the memory and having the swap encrypted, I can't reproduce the problem yet.
The test will be run for longer of course to see if we could see those ATA problems.

I will also try that with physical SSD however not with Samsung, as I don't have an access to it.

In the meantime I would like to point out this known problem of certain incompatibilities between Samsung SSDs and AMD SATA controllers.
See this for instance:
https://eu.community.samsung.com/t5/computers-it/860-evo-250gb-causing-freezes-on-amd-system/td-p/575813

@Dear Ulrich: could you please see your system, specifically what controller there is? Perhaps we are hitting the same issue here?

Since the last comment, have you observed the same problem again?
Comment 9 Ulrich Windl 2020-07-13 09:15:27 UTC
(In reply to Sebastian Chlad from comment #8)
> @Dear Ulrich: could you please see your system, specifically what controller
> there is? Perhaps we are hitting the same issue here?

(from lspci)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [IDE mode]
00:14.1 IDE interface: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 IDE Controller

# lspci -nn -vv -s 0:14.1
00:14.1 IDE interface [0101]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 IDE Controller [1002:439c] (prog-if 8a [ISA Compatibility mode controller, supports both channels switched to PCI native mode, supports bus mastering])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5002]
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	NUMA node: 0
	Region 0: I/O ports at 01f0 [size=8]
	Region 1: I/O ports at 03f4
	Region 2: I/O ports at 0170 [size=8]
	Region 3: I/O ports at 0374
	Region 4: I/O ports at fa00 [size=16]
	Capabilities: [70] MSI: Enable- Count=1/2 Maskable- 64bit-
		Address: 00000000  Data: 0000
	Kernel driver in use: pata_atiixp
	Kernel modules: pata_atiixp, pata_acpi, ata_generic
# lspci -nn -vv -s 0:11
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [IDE mode] [1002:4390] (prog-if 01 [AHCI 1.0])
	Subsystem: Gigabyte Technology Co., Ltd GA-MA770-DS3rev2.0 Motherboard [1458:b002]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 22
	NUMA node: 0
	Region 0: I/O ports at ff00 [size=8]
	Region 1: I/O ports at fe00 [size=4]
	Region 2: I/O ports at fd00 [size=8]
	Region 3: I/O ports at fc00 [size=4]
	Region 4: I/O ports at fb00 [size=16]
	Region 5: Memory at fe02f000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: [60] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [70] SATA HBA v1.0 InCfgSpace
	Kernel driver in use: ahci
	Kernel modules: ahci
# hwinfo --storage-ctrl
01: None 00.0: 0102 Floppy disk controller                      
  [Created at floppy.112]
  Unique ID: rdCR.3wRL2_g4d2B
  Hardware Class: storage
  Model: "Floppy disk controller"
  I/O Port: 0x3f2 (rw)
  I/O Ports: 0x3f4-0x3f5 (rw)
  I/O Port: 0x3f7 (rw)
  DMA: 2
  Config Status: cfg=no, avail=yes, need=no, active=unknown

20: PCI 11.0: 0106 SATA controller (AHCI 1.0)
  [Created at pci.386]
  Unique ID: 7EWs.d7Jg0Ujuy3F
  SysFS ID: /devices/pci0000:00/0000:00:11.0
  SysFS BusID: 0000:00:11.0
  Hardware Class: storage
  Model: "ATI SB7x0/SB8x0/SB9x0 SATA Controller [IDE mode]"
  Vendor: pci 0x1002 "ATI Technologies Inc"
  Device: pci 0x4390 "SB7x0/SB8x0/SB9x0 SATA Controller [IDE mode]"
  SubVendor: pci 0x1458 "Gigabyte Technology Co., Ltd"
  SubDevice: pci 0xb002 "GA-MA770-DS3rev2.0 Motherboard"
  Driver: "ahci"
  Driver Modules: "ahci"
  I/O Ports: 0xff00-0xff07 (rw)
  I/O Ports: 0xfe00-0xfe03 (rw)
  I/O Ports: 0xfd00-0xfd07 (rw)
  I/O Ports: 0xfc00-0xfc03 (rw)
  I/O Ports: 0xfb00-0xfb0f (rw)
  Memory Range: 0xfe02f000-0xfe02f3ff (rw,non-prefetchable)
  IRQ: 22 (229212 events)
  Module Alias: "pci:v00001002d00004390sv00001458sd0000B002bc01sc06i01"
  Driver Info #0:
    Driver Status: ahci is active
    Driver Activation Cmd: "modprobe ahci"
  Config Status: cfg=no, avail=yes, need=no, active=unknown

35: PCI 14.1: 0101 IDE interface (ISA Compatibility mode controller, supports both channels switched to PCI native mode, supports bus mastering)
  [Created at pci.386]
  Unique ID: Eu86.tIFf2VYrYQC
  SysFS ID: /devices/pci0000:00/0000:00:14.1
  SysFS BusID: 0000:00:14.1
  Hardware Class: storage
  Model: "ATI SB7x0/SB8x0/SB9x0 IDE Controller"
  Vendor: pci 0x1002 "ATI Technologies Inc"
  Device: pci 0x439c "SB7x0/SB8x0/SB9x0 IDE Controller"
  SubVendor: pci 0x1458 "Gigabyte Technology Co., Ltd"
  SubDevice: pci 0x5002 
  Driver: "pata_atiixp"
  Driver Modules: "pata_atiixp"
  I/O Ports: 0x1f0-0x1f7 (rw)
  I/O Port: 0x3f6 (rw)
  I/O Ports: 0x170-0x177 (rw)
  I/O Port: 0x376 (rw)
  I/O Ports: 0xfa00-0xfa0f (rw)
  IRQ: 16 (14552 events)
  Module Alias: "pci:v00001002d0000439Csv00001458sd00005002bc01sc01i8A"
  Driver Info #0:
    Driver Status: pata_atiixp is active
    Driver Activation Cmd: "modprobe pata_atiixp"
  Driver Info #1:
    Driver Status: pata_acpi is not active
    Driver Activation Cmd: "modprobe pata_acpi"
  Driver Info #2:
    Driver Status: ata_generic is active
    Driver Activation Cmd: "modprobe ata_generic"
  Config Status: cfg=no, avail=yes, need=no, active=unknown

MB/BIOS:
  System Info: #1
    Manufacturer: "Gigabyte Technology Co., Ltd."
    Product: "GA-MA770T-UD3P"

CPU:
01: None 00.0: 10103 CPU                                        
  [Created at cpu.462]
  Unique ID: rdCR.j8NaKXDZtZ6
  Hardware Class: cpu
  Arch: X86-64
  Vendor: "AuthenticAMD"
  Model: 16.5.3 "AMD Phenom(tm) II X4 840 Processor"
  Features: fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,ht,syscall,nx,mmxext,fxsr_opt,pdpe1gb,rdtscp,lm,3dnowext,3dnow,constant_tsc,rep_good,nopl,nonstop_tsc,cpuid,extd_apicid,pni,monitor,cx16,popcnt,lahf_lm,cmp_legacy,svm,extapic,cr8_legacy,abm,sse4a,misalignsse,3dnowprefetch,osvw,ibs,skinit,wdt,hw_pstate,vmmcall,npt,lbrv,svm_lock,nrip_save
  Clock: 800 MHz
  BogoMips: 6429.02
  Cache: 512 kb
  Units/Processor: 4
  Config Status: cfg=no, avail=yes, need=no, active=unknown

> Since the last comment, have you observed the same problem again?

No. Maybe it was fixed in the kernel.
Comment 10 Ulrich Windl 2020-07-27 12:48:33 UTC
When not expecting it, the unexpected happens:
Today a few seconds after resuming from an automatic suspend to disk (due to inactivity), first  Firefox crashed, the I realized that any access to /home resulted in an I/O error.
This may sound different from the original issue, but /home is an encrypted logical volume (LV), and the error message in syslog reads quite similar:

Jul 27 08:02:52 pc kernel: ata1.00: exception Emask 0x0 SAct 0x8000000 SErr 0xc0000 action 0x0
Jul 27 08:02:52 pc kernel: ata1.00: irq_stat 0x40000008
Jul 27 08:02:52 pc kernel: ata1: SError: { CommWake 10B8B }
Jul 27 08:02:52 pc kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 27 08:02:52 pc kernel: ata1.00: cmd 60/20:d8:e0:72:19/00:00:22:00:00/40 tag 27 ncq dma 16384 in
                                           res 41/40:00:f1:72:19/00:00:22:00:00/40 Emask 0x409 (media error) <F>
Jul 27 08:02:52 pc kernel: ata1.00: status: { DRDY ERR }
Jul 27 08:02:52 pc kernel: ata1.00: error: { UNC }
Jul 27 08:02:52 pc kernel: ata1: hard resetting link
Jul 27 08:02:52 pc kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 27 08:02:52 pc kernel: ata1.00: configured for UDMA/133
Jul 27 08:02:52 pc kernel: sd 0:0:0:0: [sda] tag#27 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 27 08:02:52 pc kernel: sd 0:0:0:0: [sda] tag#27 Sense Key : Medium Error [current] 
Jul 27 08:02:52 pc kernel: sd 0:0:0:0: [sda] tag#27 Add. Sense: Unrecovered read error - auto reallocate failed
Jul 27 08:02:52 pc kernel: sd 0:0:0:0: [sda] tag#27 CDB: Read(10) 28 00 22 19 72 e0 00 00 20 00
Jul 27 08:02:52 pc kernel: print_req_error: I/O error, dev sda, sector 572093169
Jul 27 08:02:52 pc kernel: ata1: EH complete
Jul 27 08:02:52 pc kernel: XFS (dm-8): metadata I/O error: block 0x7e312e0 ("xfs_trans_read_buf_map") error 5 numblks 32
Jul 27 08:02:52 pc kernel: XFS (dm-8): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Jul 27 08:02:53 pc kernel: XFS (dm-8): xfs_do_force_shutdown(0x8) called from line 3496 of file ../fs/xfs/xfs_inode.c.  Return address = 0xffffffffa12b5c2a
Jul 27 08:02:53 pc kernel: XFS (dm-8): Corruption of in-memory data detected.  Shutting down filesystem
Jul 27 08:02:53 pc kernel: XFS (dm-8): Please umount the filesystem and rectify the problem(s)
...
Jul 27 12:27:15 pc kernel: ata2.00: exception Emask 0x50 SAct 0x780 SErr 0x800 action 0x6 frozen
Jul 27 12:27:15 pc kernel: ata2.00: irq_stat 0x08000000, interface fatal error
Jul 27 12:27:15 pc kernel: ata2: SError: { HostInt }
Jul 27 12:27:15 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jul 27 12:27:15 pc kernel: ata2.00: cmd 61/00:38:00:c0:09/08:00:00:00:00/40 tag 7 ncq dma 1048576 ou
                                    res 40/00:38:00:c0:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Jul 27 12:27:15 pc kernel: ata2.00: status: { DRDY }
Jul 27 12:27:15 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jul 27 12:27:15 pc kernel: ata2.00: cmd 61/00:40:00:c8:09/08:00:00:00:00/40 tag 8 ncq dma 1048576 ou
                                    res 40/00:38:00:c0:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Jul 27 12:27:15 pc kernel: ata2.00: status: { DRDY }
Jul 27 12:27:15 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jul 27 12:27:15 pc kernel: ata2.00: cmd 61/00:48:00:d0:09/08:00:00:00:00/40 tag 9 ncq dma 1048576 ou
                                    res 40/00:38:00:c0:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Jul 27 12:27:15 pc kernel: ata2.00: status: { DRDY }
Jul 27 12:27:15 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Jul 27 12:27:15 pc kernel: ata2.00: cmd 61/58:50:00:d8:09/08:00:00:00:00/40 tag 10 ncq dma 1093632 ou
                                    res 40/00:38:00:c0:09/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Jul 27 12:27:15 pc kernel: ata2.00: status: { DRDY }
Jul 27 12:27:15 pc kernel: ata2: hard resetting link
Jul 27 12:27:15 pc kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 27 12:27:15 pc kernel: ata2.00: supports DRM functions and may not be fully accessible
Jul 27 12:27:15 pc kernel: ata2.00: supports DRM functions and may not be fully accessible
Jul 27 12:27:15 pc kernel: ata2.00: configured for UDMA/133
Jul 27 12:27:15 pc kernel: ata2: EH complete
Jul 27 12:27:15 pc kernel: ata2.00: Enabling discard_zeroes_data

Note: The system has ECC RAM, so an im-memory data corruption by bit-rot is rather unlikely.
Kernel in use was 4.12.14-lp151.28.52-default
Comment 11 Miroslav Beneš 2020-09-16 11:26:26 UTC
Frankly, I am not sure what to do here. It would be too easy to blame it on HW incompatibility, which Sebastian mentioned.

Hannes, any idea?

Ulrich, Leap 15.2 was released meanwhile with 5.3 kernel, so things may change with that.
Comment 12 Ulrich Windl 2020-09-29 09:45:38 UTC
(In reply to Miroslav Beneš from comment #11)
> Ulrich, Leap 15.2 was released meanwhile with 5.3 kernel, so things may
> change with that.

Actually I have upgraded from Leap 15.1 to 15.2 at the beginning of this month.
Comment 13 Ulrich Windl 2020-10-02 06:07:06 UTC
Created attachment 842196 [details]
Screenshot with top running when freeze occurred

(In reply to Miroslav Beneš from comment #11)
> Frankly, I am not sure what to do here. It would be too easy to blame it on
> HW incompatibility, which Sebastian mentioned.

Counter-argument: On a completely different and new hardware (Machine with four Xeon Gold (72 cores, 144 threads, > 700GB RAM) I had a kernel "freeze" while performing some memory load test that just uses a few processes that shuffle process-local memory around. The machine in question only had SSDs, configured as two RAIDs via PERC: RAID1 for base OS, and RAID6 for additional data.
When performing the test without having any swap configured, the kernel started to use the OOM killer to terminate a few processes.
After having configured 5GB swap @ prio=4 and >700GB swap @prio=2 the kernel froze shortly after having used swap. The special thing about the new setup was that the bug swap was configured via LVM and a thin volume.
The kernel in question was 5.3.18-24.15-default (actually from SLES15 SP2), and there was plenty of swap space available when this had happened.
Interestingly the kernel still responded to PING, but that was all: No VT-switching, not console getty, no ssh connection, no reaction on existing SSH connections. However I had a "top" with sleep time 1s running at the moment when the kernel froze.
I gave it more than 10 minutes to recover, but nothing had happened and there was no message in /var/log/messages after the date top last had refreshed.
Comment 14 Ulrich Windl 2020-10-02 07:11:22 UTC
I was able to reproduce the "freeze": I had a top running with a sleep delay of 0.2s, and shortly after starting my test, to became slow and slower until it finally stopped any output. Then the system did not accept any more commands.
Comment 15 Nikolai Nikolaevskii 2021-09-26 12:53:31 UTC
ILL problem with Samsung SATA SSDs. Problem was solved by recent kernel update.

More info: https://forums.opensuse.org/showthread.php/559734-Avoid-using-Samsung-SATA-SSDs-especially-860-and-870