Bugzilla – Bug 1219692
[Build 53.1] openQA test fails in bootloader_start - Kernel panic - not syncing: Fatal exception: panic_on_oops
Last modified: 2024-06-25 18:09:13 UTC
Created attachment 872563 [details] serial out put Hardware: s390x Software: SLE-15-SP6-Online-s390x-Build53.1-Media1.iso ********************************************************************** ## Observation openQA test in scenario sle-15-SP6-Online-s390x-lvm_thin_provisioning@s390x-kvm fails in [bootloader_start](http://openqa.oqa.prg2.suse.org/tests/13456376/modules/bootloader_start/steps/43) ## Reproducible Fails since (at least) http://openqa.oqa.prg2.suse.org/tests/13456376 ## Further details Always latest result in this scenario: [latest](http://openqa.oqa.prg2.suse.org/tests/latest?arch=s390x&distri=sle&flavor=Online&machine=s390x-kvm&test=lvm_thin_provisioning&version=15-SP6) ******************************************************************** Test steps: 1. Install product: SLE-15-SP6-Online-s390x-Build53.1-Media1.iso 2. Did bootloader start Got error message: [ 4.265998][ T648] Call Trace: [ 4.266000][ T648] [<00000000366d065a>] __kmem_cache_alloc_node+0x362/0x5d8 [ 4.266004][ T648] [<000000003664af26>] __kmalloc_node_track_caller+0x66/0x1e8 [ 4.266007][ T648] [<000000003663abec>] kstrdup+0x54/0x90 [ 4.266010][ T648] [<000000003649e54a>] mod_sysfs_setup+0x402/0x820 [ 4.266013][ T648] [<000000003649aca4>] load_module+0x20bc/0x2200 [ 4.266018][ T648] [<000000003649afc2>] __do_sys_init_module+0x1da/0x210 [ 4.266021][ T648] [<0000000036d8a326>] __do_syscall+0x1d6/0x200 [ 4.266026][ T648] [<0000000036d9d2a0>] system_call+0x70/0x98 [ 4.266029][ T648] Last Breaking-Event-Address: [ 4.266030][ T648] [<00000000366d0410>] __kmem_cache_alloc_node+0x118/0x5d8 [ 4.266033][ T648] Kernel panic - not syncing: Fatal exception: panic_on_oops
This bug prevents all of testing of s390x Build53.1.
Which was the last working kernel? Suppose this being a regression, we need to narrow down the range.
(In reply to Takashi Iwai from comment #2) > Which was the last working kernel? Suppose this being a regression, we need > to narrow down the range. Is is always on loading netfs?
Previous working kernel was the one shipped with Beta3-202401 i.e. bd9aaa617d13416dbc870f6991ef0556c1263b8c [0] Build 53.1 shipped with the next week's submission [1], I *think* that the commit tip is 03f03f8a7447a61312a4934278098d9a41d9f6bc but I'm not 100% sure. [0] https://build.suse.de/request/show/319318 [1] https://build.suse.de/request/show/320105
Can you make it boot with slub_debug kernel parameter?
kernel at /mounts/dist/ibs/SUSE:/SLE-15-SP6:/GA/standard/s390x/kernel-default-6.4.0-150600.6.3.s390x.rpm ../mm/slub.c: 395 return freelist_dereference(s, object + s->offset); 0x000000000044464c <+852>: lgr %r2,%r10 0x0000000000444650 <+856>: algf %r2,40(%r11) ../include/uapi/linux/swab.h: 139 return __swab64(y); 0x0000000000444656 <+862>: lrvgr %r1,%r2 ../mm/slub.c: 377 return (void *)((unsigned long)ptr ^ s->random ^ 0x000000000044465a <+866>: lg %r2,0(%r2) so this is: freelist_dereference() -> freelist_ptr() -> __builtin_bswap64() the crash trace shows the faulting instruction within bswap64: >00000000366d065a: e32020000004 lg %r2,0(%r2) so presumably %r2 (8730a257d0825dcf) cannot be dereferenced. Not sure what to make of this.
Created attachment 872591 [details] slub_debug boot.log
In case someone would like to check locally, the problem can be reproduced under qemu-system-s390x as follows: * Grab the installation image from https://download.suse.de/install/SLE-15-SP6-Online-TEST/SLE-15-SP6-Online-s390x-Build53.1-Media1.iso and mount it via loopback at $CDROM. * Run: $ qemu-system-s390x -m 2G --nographic -kernel $CDROM/boot/s390x/linux -initrd $CDROM/boot/s390x/initrd Kernel command line options can be specified via the "-append" parameter. (In reply to Vlastimil Babka from comment #5) > Can you make it boot with slub_debug kernel parameter? Attachment 872591 [details] is a log with slub_debug, using the above steps.
(In reply to Petr Pavlu from comment #8) > (In reply to Vlastimil Babka from comment #5) > > Can you make it boot with slub_debug kernel parameter? > > Attachment 872591 [details] is a log with slub_debug, using the above steps. Looks like that the boot succeeded with this option? If so, it can be some race that caused a crash (or memory corruption leading to a crash).
(In reply to Petr Pavlu from comment #8) > In case someone would like to check locally, the problem can be reproduced > under qemu-system-s390x as follows: > * Grab the installation image from > > https://download.suse.de/install/SLE-15-SP6-Online-TEST/SLE-15-SP6-Online- > s390x-Build53.1-Media1.iso > and mount it via loopback at $CDROM. > * Run: > $ qemu-system-s390x -m 2G --nographic -kernel $CDROM/boot/s390x/linux > -initrd $CDROM/boot/s390x/initrd I tried it now, and this showed a crash, but at a different point: [ 0.342684][ T0] Linux version 6.4.0-150600.6-default (geeko@buildhost) (gcc (SUSE Linux) 7.5.0, GNU ld (GNU Binutils; SUSE Linux Enterprise 15) 2.41.0.20230908-150100.7.46) #1 SMP Thu Feb 1 20:06:48 UTC 2024 (03f03f8) .... [ 11.372544][ T1] mousedev: PS/2 mouse device common for all mice [ 11.372805][ T1] drop_monitor: Initializing network drop monitor service [ 11.440016][ T1] NET: Registered PF_INET6 protocol family [ 11.543004][ T1] Segment Routing with IPv6 [ 11.543113][ T1] RPL Segment Routing with IPv6 [ 11.543392][ T1] In-situ OAM (IOAM) with IPv6 [ 11.544885][ T1] cio: Channel measurement facility initialized using format extended (mode autodetected) [ 11.547122][ T34] sclp_sd: Store Data request failed (eq=2, di=3, response=0x40f0, flags=0x00, status=0, rc=-5) [ 11.548242][ T1] ap: The hardware system does not support AP instructions [ 11.556554][ T1] Unable to handle kernel pointer dereference in virtual kernel address space [ 11.556599][ T1] Failing address: 500000000371e000 TEID: 500000000371e803 [ 11.556634][ T1] Fault in home space mode while using kernel ASCE. [ 11.556758][ T1] AS:0000000001960007 R3:000000007c298007 S:000000007c27c000 P:000000000371e13f [ 11.557081][ T1] Oops: 0038 ilc:3 [#1] SMP [ 11.557175][ T1] Modules linked in: [ 11.557267][ T1] Supported: Yes [ 11.557380][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.6-default #1 SLE15-SP6 1cf778db29c1e7688f32f84b2af5d36b920ea634 [ 11.557533][ T1] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 11.557618][ T1] Krnl PSW : 0704d00180000000 000000000044465a (__kmem_cache_alloc_node+0x362/0x5d8) [ 11.557856][ T1] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [ 11.557908][ T1] Krnl GPRS: 0000000000000cc0 18ed710300000050 500000000371ed18 0000000000000cc0 [ 11.557946][ T1] ffffffffffffffff 000000000000000e 000000000054aa2c 0000000000000cc0 [ 11.557974][ T1] 00000000012c8ad8 0000000000000288 500000000371ed10 0000000001c81300 [ 11.558002][ T1] 0000000001e78000 ffffffffffffffff 0000000000444346 0000037fffb13808 [ 11.558430][ T1] Krnl Code: 000000000044464c: b904002a lgr %r2,%r10 [ 11.558430][ T1] 0000000000444650: e320b028001a algf %r2,40(%r11) [ 11.558430][ T1] #0000000000444656: b90f0012 lrvgr %r1,%r2 [ 11.558430][ T1] >000000000044465a: e32020000004 lg %r2,0(%r2) [ 11.558430][ T1] 0000000000444660: e320b0b80082 xg %r2,184(%r11) [ 11.558430][ T1] 0000000000444666: b9820012 xgr %r1,%r2 [ 11.558430][ T1] 000000000044466a: e310f0c00024 stg %r1,192(%r15) [ 11.558430][ T1] 0000000000444670: e310f0c00004 lg %r1,192(%r15) [ 11.558784][ T1] Call Trace: [ 11.559003][ T1] [<000000000044465a>] __kmem_cache_alloc_node+0x362/0x5d8 [ 11.559212][ T1] [<00000000003bef26>] __kmalloc_node_track_caller+0x66/0x1e8 [ 11.559253][ T1] [<00000000003aebec>] kstrdup+0x54/0x90 [ 11.559281][ T1] [<000000000054aa2c>] __kernfs_new_node+0x44/0x238 [ 11.559308][ T1] [<000000000054c66c>] kernfs_new_node+0x5c/0xa0 [ 11.559334][ T1] [<000000000054cb88>] kernfs_create_dir_ns+0x30/0xc8 [ 11.559360][ T1] [<000000000055059e>] sysfs_create_dir_ns+0x76/0xf0 [ 11.559388][ T1] [<0000000000ad8e94>] kobject_add_internal+0xdc/0x2f8 [ 11.559418][ T1] [<0000000000ad9468>] kobject_init_and_add+0x70/0xb0 [ 11.559443][ T1] [<00000000014f5028>] locate_module_kobject+0xa0/0x110 [ 11.559474][ T1] [<00000000014f51a8>] param_sysfs_builtin_init+0x110/0x200 [ 11.559501][ T1] [<0000000000100b68>] do_one_initcall+0x40/0x210 [ 11.559528][ T1] [<00000000014e89d8>] kernel_init_freeable+0x258/0x3c8 [ 11.559554][ T1] [<0000000000aff36e>] kernel_init+0x2e/0x180 [ 11.559583][ T1] [<0000000000102b00>] __ret_from_fork+0x40/0x58 [ 11.559609][ T1] [<0000000000b112d2>] ret_from_fork+0xa/0x30 [ 11.559652][ T1] Last Breaking-Event-Address: [ 11.559671][ T1] [<0000000000444410>] __kmem_cache_alloc_node+0x118/0x5d8 [ 11.559883][ T1] Kernel panic - not syncing: Fatal exception: panic_on_oops
(In reply to Takashi Iwai from comment #9) > (In reply to Petr Pavlu from comment #8) > > (In reply to Vlastimil Babka from comment #5) > > > Can you make it boot with slub_debug kernel parameter? > > > > Attachment 872591 [details] is a log with slub_debug, using the above steps. > > Looks like that the boot succeeded with this option? If so, it can be some > race that caused a crash (or memory corruption leading to a crash). yes seems to be booting with slub_debug since the code is doing the fixing. reverting commit 7f146b24297b ("driver core: Replace kstrdup() + strreplace() with kstrdup_and_replace()") seems to be fixing the issue (can boot ok without slub_debug), but I am not sure this is the root case. also the issue doesn't reproduce upstream.
(In reply to Anthony Iliopoulos from comment #11) > (In reply to Takashi Iwai from comment #9) > > (In reply to Petr Pavlu from comment #8) > > > (In reply to Vlastimil Babka from comment #5) > > > > Can you make it boot with slub_debug kernel parameter? > > > > > > Attachment 872591 [details] is a log with slub_debug, using the above steps. > > > > Looks like that the boot succeeded with this option? If so, it can be some > > race that caused a crash (or memory corruption leading to a crash). > > yes seems to be booting with slub_debug since the code is doing the fixing. > > reverting commit 7f146b24297b ("driver core: Replace kstrdup() + > strreplace() with kstrdup_and_replace()") seems to be fixing the issue (can > boot ok without slub_debug), but I am not sure this is the root case. > > also the issue doesn't reproduce upstream. Ah, that explains the possible memory corruption. I believe the missing piece is rather the upstream commit d01a77afd6bef1b3a2ed15e8ca6887ca7da0cddc lib/string_helpers: Change returned value of the strreplace() Without this, strreplace() returns the end of the string, and the use in commit 7f146b24297b assumes this fix. That is, the return value (and the stored value) of device_get_devnode() is utterly bogus as of now.
Unfortunately the things aren't easy. We have other user of strreplace() in fs/jbd2/journal.c. So, at least, the commit 7afb6d8fa81fd8d332f70ead5e35d8c90abb8165 jbd2: Avoid printing outside the boundary of the buffer needs to be backported as well. Above all, the remaining question is whether it's preferred to change the existing API function behavior for external drivers. I *guess* it's likely OK, as we'd align with the current upstream. But this needs a bit more cares...
As a safer option, I'm going to revert the backports of commit 045ad46441a and 7f146b24297b. The former can be the likely culprit of the memory corruption, and the latter is obviously dangerous to use. If we all agree for aligning with the upstream changes, we can re-backport those with other fixes later (but before kABI freeze :)
After discussions, I re-backported the full patches again but with the proper missing fixes at this time. The (re-)backported upstream commits are: 7afb6d8fa81fd8d332f70ead5e35d8c90abb8165 jbd2: Avoid printing outside the boundary of the buffer d01a77afd6bef1b3a2ed15e8ca6887ca7da0cddc lib/string_helpers: Change returned value of the strreplace() 045ad46441a1cf2ff358af0126f8c57f337804fa lib/string_helpers: Add kstrdup_and_replace() helper 7f146b24297bf58c19f916aee530155c830c5598 driver core: Replace kstrdup() + strreplace() with kstrdup_and_replace()
Let's close. If the problem still persists with the latest kernel, please reopen.
A SR mentioning this bug was successfully integrated into Build 58.1. Please, if needed, set it to RESOLVED FIXED for QE to verify it.