Bug 1219692 - [Build 53.1] openQA test fails in bootloader_start - Kernel panic - not syncing: Fatal exception: panic_on_oops
Summary: [Build 53.1] openQA test fails in bootloader_start - Kernel panic - not synci...
Status: RESOLVED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP6
Classification: openSUSE
Component: Kernel (show other bugs)
Version: unspecified
Hardware: S/390-64 SLES 15
: P2 - High : Critical
Target Milestone: ---
Assignee: Kernel Bugs
QA Contact:
URL: http://openqa.oqa.prg2.suse.org/tests...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-08 01:31 UTC by yutao wang
Modified: 2024-06-25 18:09 UTC (History)
13 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
serial out put (7.80 KB, text/plain)
2024-02-08 01:31 UTC, yutao wang
Details
slub_debug boot.log (107.66 KB, text/x-log)
2024-02-08 14:45 UTC, Petr Pavlu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description yutao wang 2024-02-08 01:31:46 UTC
Created attachment 872563 [details]
serial out put

Hardware: s390x
Software: SLE-15-SP6-Online-s390x-Build53.1-Media1.iso
**********************************************************************
## Observation

openQA test in scenario sle-15-SP6-Online-s390x-lvm_thin_provisioning@s390x-kvm fails in
[bootloader_start](http://openqa.oqa.prg2.suse.org/tests/13456376/modules/bootloader_start/steps/43)

## Reproducible

Fails since (at least) http://openqa.oqa.prg2.suse.org/tests/13456376

## Further details

Always latest result in this scenario: [latest](http://openqa.oqa.prg2.suse.org/tests/latest?arch=s390x&distri=sle&flavor=Online&machine=s390x-kvm&test=lvm_thin_provisioning&version=15-SP6)
********************************************************************
Test steps:
1. Install product: SLE-15-SP6-Online-s390x-Build53.1-Media1.iso
2. Did bootloader start
Got error message:
[    4.265998][  T648] Call Trace:

[    4.266000][  T648]  [<00000000366d065a>] __kmem_cache_alloc_node+0x362/0x5d8 

[    4.266004][  T648]  [<000000003664af26>] __kmalloc_node_track_caller+0x66/0x1e8 

[    4.266007][  T648]  [<000000003663abec>] kstrdup+0x54/0x90 

[    4.266010][  T648]  [<000000003649e54a>] mod_sysfs_setup+0x402/0x820 

[    4.266013][  T648]  [<000000003649aca4>] load_module+0x20bc/0x2200 

[    4.266018][  T648]  [<000000003649afc2>] __do_sys_init_module+0x1da/0x210 

[    4.266021][  T648]  [<0000000036d8a326>] __do_syscall+0x1d6/0x200 

[    4.266026][  T648]  [<0000000036d9d2a0>] system_call+0x70/0x98 

[    4.266029][  T648] Last Breaking-Event-Address:

[    4.266030][  T648]  [<00000000366d0410>] __kmem_cache_alloc_node+0x118/0x5d8

[    4.266033][  T648] Kernel panic - not syncing: Fatal exception: panic_on_oops
Comment 1 Timo Jyrinki 2024-02-08 09:12:55 UTC
This bug prevents all of testing of s390x Build53.1.
Comment 2 Takashi Iwai 2024-02-08 09:30:46 UTC
Which was the last working kernel?  Suppose this being a regression, we need to narrow down the range.
Comment 3 Denis Kirjanov 2024-02-08 10:11:21 UTC
(In reply to Takashi Iwai from comment #2)
> Which was the last working kernel?  Suppose this being a regression, we need
> to narrow down the range.

Is is always on loading netfs?
Comment 4 Eugenio Paolantonio 2024-02-08 10:29:46 UTC
Previous working kernel was the one shipped with Beta3-202401 i.e. bd9aaa617d13416dbc870f6991ef0556c1263b8c [0]


Build 53.1 shipped with the next week's submission [1], I *think* that the commit tip is 03f03f8a7447a61312a4934278098d9a41d9f6bc but I'm not 100% sure.


[0] https://build.suse.de/request/show/319318
[1] https://build.suse.de/request/show/320105
Comment 5 Vlastimil Babka 2024-02-08 10:40:38 UTC
Can you make it boot with slub_debug kernel parameter?
Comment 6 Anthony Iliopoulos 2024-02-08 12:27:20 UTC
kernel at /mounts/dist/ibs/SUSE:/SLE-15-SP6:/GA/standard/s390x/kernel-default-6.4.0-150600.6.3.s390x.rpm

../mm/slub.c:
395             return freelist_dereference(s, object + s->offset);
   0x000000000044464c <+852>:   lgr     %r2,%r10
   0x0000000000444650 <+856>:   algf    %r2,40(%r11)

../include/uapi/linux/swab.h:
139             return __swab64(y);
   0x0000000000444656 <+862>:   lrvgr   %r1,%r2

../mm/slub.c:
377             return (void *)((unsigned long)ptr ^ s->random ^
   0x000000000044465a <+866>:   lg      %r2,0(%r2)

so this is:

freelist_dereference() -> freelist_ptr() -> __builtin_bswap64()

the crash trace shows the faulting instruction within bswap64:

>00000000366d065a: e32020000004	lg	%r2,0(%r2)

so presumably %r2 (8730a257d0825dcf) cannot be dereferenced.

Not sure what to make of this.
Comment 7 Petr Pavlu 2024-02-08 14:45:35 UTC
Created attachment 872591 [details]
slub_debug boot.log
Comment 8 Petr Pavlu 2024-02-08 14:47:53 UTC
In case someone would like to check locally, the problem can be reproduced
under qemu-system-s390x as follows:
* Grab the installation image from
  https://download.suse.de/install/SLE-15-SP6-Online-TEST/SLE-15-SP6-Online-s390x-Build53.1-Media1.iso
  and mount it via loopback at $CDROM.
* Run:
  $ qemu-system-s390x -m 2G --nographic -kernel $CDROM/boot/s390x/linux -initrd $CDROM/boot/s390x/initrd

Kernel command line options can be specified via the "-append" parameter.

(In reply to Vlastimil Babka from comment #5)
> Can you make it boot with slub_debug kernel parameter?

Attachment 872591 [details] is a log with slub_debug, using the above steps.
Comment 9 Takashi Iwai 2024-02-08 15:22:04 UTC
(In reply to Petr Pavlu from comment #8)
> (In reply to Vlastimil Babka from comment #5)
> > Can you make it boot with slub_debug kernel parameter?
> 
> Attachment 872591 [details] is a log with slub_debug, using the above steps.

Looks like that the boot succeeded with this option?  If so, it can be some race that caused a crash (or memory corruption leading to a crash).
Comment 10 Takashi Iwai 2024-02-08 15:43:33 UTC
(In reply to Petr Pavlu from comment #8)
> In case someone would like to check locally, the problem can be reproduced
> under qemu-system-s390x as follows:
> * Grab the installation image from
>  
> https://download.suse.de/install/SLE-15-SP6-Online-TEST/SLE-15-SP6-Online-
> s390x-Build53.1-Media1.iso
>   and mount it via loopback at $CDROM.
> * Run:
>   $ qemu-system-s390x -m 2G --nographic -kernel $CDROM/boot/s390x/linux
> -initrd $CDROM/boot/s390x/initrd

I tried it now, and this showed a crash, but at a different point:

[    0.342684][    T0] Linux version 6.4.0-150600.6-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.41.0.20230908-150100.7.46) #1 SMP Thu Feb  1 20:06:48 UTC 2024 (03f03f8)
....
[   11.372544][    T1] mousedev: PS/2 mouse device common for all mice
[   11.372805][    T1] drop_monitor: Initializing network drop monitor service
[   11.440016][    T1] NET: Registered PF_INET6 protocol family
[   11.543004][    T1] Segment Routing with IPv6
[   11.543113][    T1] RPL Segment Routing with IPv6
[   11.543392][    T1] In-situ OAM (IOAM) with IPv6
[   11.544885][    T1] cio: Channel measurement facility initialized using format extended (mode autodetected)
[   11.547122][   T34] sclp_sd: Store Data request failed (eq=2, di=3, response=0x40f0, flags=0x00, status=0, rc=-5)
[   11.548242][    T1] ap: The hardware system does not support AP instructions
[   11.556554][    T1] Unable to handle kernel pointer dereference in virtual kernel address space
[   11.556599][    T1] Failing address: 500000000371e000 TEID: 500000000371e803
[   11.556634][    T1] Fault in home space mode while using kernel ASCE.
[   11.556758][    T1] AS:0000000001960007 R3:000000007c298007 S:000000007c27c000 P:000000000371e13f 
[   11.557081][    T1] Oops: 0038 ilc:3 [#1] SMP 
[   11.557175][    T1] Modules linked in:
[   11.557267][    T1] Supported: Yes
[   11.557380][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.6-default #1 SLE15-SP6 1cf778db29c1e7688f32f84b2af5d36b920ea634
[   11.557533][    T1] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[   11.557618][    T1] Krnl PSW : 0704d00180000000 000000000044465a (__kmem_cache_alloc_node+0x362/0x5d8)
[   11.557856][    T1]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[   11.557908][    T1] Krnl GPRS: 0000000000000cc0 18ed710300000050 500000000371ed18 0000000000000cc0
[   11.557946][    T1]            ffffffffffffffff 000000000000000e 000000000054aa2c 0000000000000cc0
[   11.557974][    T1]            00000000012c8ad8 0000000000000288 500000000371ed10 0000000001c81300
[   11.558002][    T1]            0000000001e78000 ffffffffffffffff 0000000000444346 0000037fffb13808
[   11.558430][    T1] Krnl Code: 000000000044464c: b904002a		lgr	%r2,%r10
[   11.558430][    T1]            0000000000444650: e320b028001a	algf	%r2,40(%r11)
[   11.558430][    T1]           #0000000000444656: b90f0012		lrvgr	%r1,%r2
[   11.558430][    T1]           >000000000044465a: e32020000004	lg	%r2,0(%r2)
[   11.558430][    T1]            0000000000444660: e320b0b80082	xg	%r2,184(%r11)
[   11.558430][    T1]            0000000000444666: b9820012		xgr	%r1,%r2
[   11.558430][    T1]            000000000044466a: e310f0c00024	stg	%r1,192(%r15)
[   11.558430][    T1]            0000000000444670: e310f0c00004	lg	%r1,192(%r15)
[   11.558784][    T1] Call Trace:
[   11.559003][    T1]  [<000000000044465a>] __kmem_cache_alloc_node+0x362/0x5d8 
[   11.559212][    T1]  [<00000000003bef26>] __kmalloc_node_track_caller+0x66/0x1e8 
[   11.559253][    T1]  [<00000000003aebec>] kstrdup+0x54/0x90 
[   11.559281][    T1]  [<000000000054aa2c>] __kernfs_new_node+0x44/0x238 
[   11.559308][    T1]  [<000000000054c66c>] kernfs_new_node+0x5c/0xa0 
[   11.559334][    T1]  [<000000000054cb88>] kernfs_create_dir_ns+0x30/0xc8 
[   11.559360][    T1]  [<000000000055059e>] sysfs_create_dir_ns+0x76/0xf0 
[   11.559388][    T1]  [<0000000000ad8e94>] kobject_add_internal+0xdc/0x2f8 
[   11.559418][    T1]  [<0000000000ad9468>] kobject_init_and_add+0x70/0xb0 
[   11.559443][    T1]  [<00000000014f5028>] locate_module_kobject+0xa0/0x110 
[   11.559474][    T1]  [<00000000014f51a8>] param_sysfs_builtin_init+0x110/0x200 
[   11.559501][    T1]  [<0000000000100b68>] do_one_initcall+0x40/0x210 
[   11.559528][    T1]  [<00000000014e89d8>] kernel_init_freeable+0x258/0x3c8 
[   11.559554][    T1]  [<0000000000aff36e>] kernel_init+0x2e/0x180 
[   11.559583][    T1]  [<0000000000102b00>] __ret_from_fork+0x40/0x58 
[   11.559609][    T1]  [<0000000000b112d2>] ret_from_fork+0xa/0x30 
[   11.559652][    T1] Last Breaking-Event-Address:
[   11.559671][    T1]  [<0000000000444410>] __kmem_cache_alloc_node+0x118/0x5d8
[   11.559883][    T1] Kernel panic - not syncing: Fatal exception: panic_on_oops
Comment 11 Anthony Iliopoulos 2024-02-08 15:47:59 UTC
(In reply to Takashi Iwai from comment #9)
> (In reply to Petr Pavlu from comment #8)
> > (In reply to Vlastimil Babka from comment #5)
> > > Can you make it boot with slub_debug kernel parameter?
> > 
> > Attachment 872591 [details] is a log with slub_debug, using the above steps.
> 
> Looks like that the boot succeeded with this option?  If so, it can be some
> race that caused a crash (or memory corruption leading to a crash).

yes seems to be booting with slub_debug since the code is doing the fixing.

reverting commit 7f146b24297b ("driver core: Replace kstrdup() + strreplace() with kstrdup_and_replace()") seems to be fixing the issue (can boot ok without slub_debug), but I am not sure this is the root case.

also the issue doesn't reproduce upstream.
Comment 12 Takashi Iwai 2024-02-08 15:56:36 UTC
(In reply to Anthony Iliopoulos from comment #11)
> (In reply to Takashi Iwai from comment #9)
> > (In reply to Petr Pavlu from comment #8)
> > > (In reply to Vlastimil Babka from comment #5)
> > > > Can you make it boot with slub_debug kernel parameter?
> > > 
> > > Attachment 872591 [details] is a log with slub_debug, using the above steps.
> > 
> > Looks like that the boot succeeded with this option?  If so, it can be some
> > race that caused a crash (or memory corruption leading to a crash).
> 
> yes seems to be booting with slub_debug since the code is doing the fixing.
> 
> reverting commit 7f146b24297b ("driver core: Replace kstrdup() +
> strreplace() with kstrdup_and_replace()") seems to be fixing the issue (can
> boot ok without slub_debug), but I am not sure this is the root case.
> 
> also the issue doesn't reproduce upstream.

Ah, that explains the possible memory corruption.

I believe the missing piece is rather the upstream commit d01a77afd6bef1b3a2ed15e8ca6887ca7da0cddc
    lib/string_helpers: Change returned value of the strreplace()

Without this, strreplace() returns the end of the string, and the use in commit  7f146b24297b assumes this fix.
That is, the return value (and the stored value) of device_get_devnode() is utterly bogus as of now.
Comment 13 Takashi Iwai 2024-02-08 16:12:30 UTC
Unfortunately the things aren't easy.  We have other user of strreplace() in fs/jbd2/journal.c.
So, at least, the commit 7afb6d8fa81fd8d332f70ead5e35d8c90abb8165
    jbd2: Avoid printing outside the boundary of the buffer
needs to be backported as well.

Above all, the remaining question is whether it's preferred to change the existing API function behavior for external drivers.
I *guess* it's likely OK, as we'd align with the current upstream.  But this needs a bit more cares...
Comment 14 Takashi Iwai 2024-02-08 16:32:36 UTC
As a safer option, I'm going to revert the backports of commit 045ad46441a and 7f146b24297b.  The former can be the likely culprit of the memory corruption, and the latter is obviously dangerous to use.

If we all agree for aligning with the upstream changes, we can re-backport those with other fixes later (but before kABI freeze :)
Comment 17 Takashi Iwai 2024-02-12 11:00:22 UTC
After discussions, I re-backported the full patches again but with the proper missing fixes at this time.

The (re-)backported upstream commits are:
7afb6d8fa81fd8d332f70ead5e35d8c90abb8165
 jbd2: Avoid printing outside the boundary of the buffer
d01a77afd6bef1b3a2ed15e8ca6887ca7da0cddc
 lib/string_helpers: Change returned value of the strreplace()
045ad46441a1cf2ff358af0126f8c57f337804fa
 lib/string_helpers: Add kstrdup_and_replace() helper
7f146b24297bf58c19f916aee530155c830c5598
 driver core: Replace kstrdup() + strreplace() with kstrdup_and_replace()
Comment 18 Takashi Iwai 2024-02-13 13:56:35 UTC
Let's close.  If the problem still persists with the latest kernel, please reopen.
Comment 20 Radoslav Tzvetkov 2024-02-26 10:24:20 UTC
A SR mentioning this bug was successfully integrated into Build 58.1. Please, if needed, set it to RESOLVED FIXED for QE to verify it.