Bug 1202203 - kernel 5.19 causes lots of openQA failures (I/O errors+crashes)
kernel 5.19 causes lots of openQA failures (I/O errors+crashes)
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: Jiri Slaby
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2022-08-08 05:53 UTC by Jiri Slaby
Modified: 2022-08-11 08:34 UTC (History)
6 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jiri Slaby 2022-08-08 05:53:56 UTC
Like these:
https://openqa.opensuse.org/tests/2502148
loop2: detected capacity change from 0 to 72264
EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 57375 starting block 137216)
Buffer I/O error on device zram0, logical block 137216
Buffer I/O error on device zram0, logical block 137217
...
SQUASHFS error: xz decompression failed, data probably corrupt
SQUASHFS error: Failed to read block 0x2e41680: -5
SQUASHFS error: xz decompression failed, data probably corrupt
SQUASHFS error: Failed to read block 0x2e41680: -5
Bus error


https://openqa.opensuse.org/tests/2502145
FS-Cache: Loaded
begin 644 ldconfig.core.pid_2094.sig_7.time_1659859442


https://openqa.opensuse.org/tests/2502146
FS-Cache: Loaded
begin 644 Xorg.bin.core.pid_3733.sig_6.time_1659858784


https://openqa.opensuse.org/tests/2502148
EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 57375 starting block 137216)
Buffer I/O error on device zram0, logical block 137216
Buffer I/O error on device zram0, logical block 137217


https://openqa.opensuse.org/tests/2502154
[   13.158090][  T634] FS-Cache: Loaded
[  525.627024][    C0] sysrq: Show State
...


* Those are various failures -- crashes of ldconfig, Xorg; I/O failures on zram; the last one is a lockup likely, something invoked sysrq after 500s stall.
* I was not able to reproduce with the provided images from the assets (yet).
* CCing fs+block fellows, they might have a clue.

They all occur to me like a zram or fs-cache failure. Any tests I/you could run on 5.19 to exercise those?
Comment 1 Jiri Slaby 2022-08-08 06:09:47 UTC
Interesting, I've just hit:
> init[1]: segfault at 18 ip 00007fb6154b4c81 sp 00007ffc243ed600 error 6 in libc.so.6[7fb61543f000+185000]
> Code: 41 5f c3 66 0f 1f 44 00 00 42 f6 44 10 08 01 0f 84 04 01 00 00 48 83 e1 fe 48 89 48 08 49 8b 47 70 49 89 5f 70 66 48 0f 6e c0 <48> 89 58 18 0f 16 44 24 08 48 81 fd ff 03 00 00 76 08 66 0f ef c9
> ***  signal 11 ***
> malloc(): unsorted double linked list corrupted
> traps: init[1] general protection fault ip:7fb61543f8b9 sp:7ffc243ebf40 error:0 in libc.so.6[7fb61543f000+185000]
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> CPU: 0 PID: 1 Comm: init Not tainted 5.19.0-1-default #1 openSUSE Tumbleweed e1df13166a33f423514290c702e43cfbb2b5b575

So this looks like a memory corruption to me. Let's try a KASAN kernel inside the image (finding out how ATM).
Comment 2 Jiri Slaby 2022-08-08 10:42:40 UTC
(In reply to Jiri Slaby from comment #1)
> So this looks like a memory corruption to me. Let's try a KASAN kernel
> inside the image (finding out how ATM).

Not much helpful:
kasan: KernelAddressSanitizer initialized
...
zram: module verification failed: signature and/or required key missing - tainting kernel
zram: Added device: zram0
zram0: detected capacity change from 0 to 2097152
EXT4-fs (zram0): mounting ext2 file system using the ext4 subsystem
EXT4-fs (zram0): mounted filesystem without journal. Quota mode: none.
EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 16386 starting block 159744)
Buffer I/O error on device zram0, logical block 159744
Buffer I/O error on device zram0, logical block 159745

I am actually out of ideas.
Comment 3 Jiri Slaby 2022-08-09 06:07:43 UTC
(In reply to Jiri Slaby from comment #2)
> I am actually out of ideas.

Upstream report:
https://lore.kernel.org/all/702b3187-14bf-b733-263b-20272f53105d@kernel.org/
Comment 4 Jiri Slaby 2022-08-09 08:14:14 UTC
This is likely the culprit:
commit e7be8d1dd983156bbdd22c0319b71119a8fbb697
Author: Alexey Romanov <avromanov@sberdevices.ru>
Date:   Thu May 12 20:23:07 2022 -0700

    zram: remove double compression logic 

Resubmitted with that reverted. Let's see.
Comment 5 Dominique Leuenberger 2022-08-09 14:39:38 UTC
(In reply to Jiri Slaby from comment #4)
> This is likely the culprit:
> commit e7be8d1dd983156bbdd22c0319b71119a8fbb697
> Author: Alexey Romanov <avromanov@sberdevices.ru>
> Date:   Thu May 12 20:23:07 2022 -0700
> 
>     zram: remove double compression logic 
> 
> Resubmitted with that reverted. Let's see.

https://openqa.opensuse.org/tests/overview?state=assigned&state=setup&state=running&state=uploading&state=scheduled&distri=opensuse&version=Staging%3AJ&build=J.481.4&groupid=2

First tests have passed over the installer already and are busy testing the desktop related features
Comment 6 Jiri Slaby 2022-08-09 15:42:13 UTC
(In reply to Dominique Leuenberger from comment #5)
> First tests have passed over the installer already and are busy testing the
> desktop related features

Yeah, I've just noticed. There is an unrelated "breakage":
SELinux: CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE is non-zero. This is deprecated and will be rejected in a future kernel release.

I will create a bug for it and set it to 0.
Comment 7 Jiri Slaby 2022-08-11 08:34:04 UTC
Fixed.