|
Bugzilla – Full Text Bug Listing |
| Summary: | kernel 6.7.6-1-default Oops: unable to handle page faults and workqueue lockup (udev related?) | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Michael Burge <michael.burge77> |
| Component: | Kernel | Assignee: | openSUSE Kernel Bugs <kernel-bugs> |
| Status: | NEW --- | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P5 - None | CC: | ailiopoulos, luna.dragon |
| Version: | Current | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
full journalctl of the third crash
dmesg k6.7.9-1-default systemd Oops |
||
The first oops excerpt from the logs is the only reliable hint, the rest are just side-effects of the bug: Mar 01 15:55:44 ost kernel: BUG: unable to handle page fault for address: 000000000000394e Mar 01 15:55:44 ost kernel: #PF: supervisor read access in kernel mode Mar 01 15:55:44 ost kernel: #PF: error_code(0x0000) - not-present page Mar 01 15:55:44 ost kernel: PGD 0 P4D 0 Mar 01 15:55:44 ost kernel: Oops: 0000 [#2] PREEMPT SMP PTI Mar 01 15:55:44 ost kernel: CPU: 8 PID: 19366 Comm: tlp Tainted: P D OE 6.7.6-1-default #1 openSUSE Tumbleweed cd9816be5099dbe04750b2583fe34462de6dcdca Mar 01 15:55:44 ost kernel: Hardware name: ASUSTeK COMPUTER INC. Zephyrus M GU502GV_GU502GV/GU502GV, BIOS GU502GV.305 09/25/2019 Mar 01 15:55:44 ost kernel: RIP: 0010:simple_xattr_get+0x31/0xa0 Mar 01 15:55:44 ost kernel: Code: 00 00 41 56 49 89 ce 41 55 4c 8d 6f 08 41 54 49 89 d4 55 48 89 f5 53 48 89 fb 4c 89 ef e8 57 54 88 00 48 8b 1b 48 85 db 74 1b <48> 8b 7b 18 48 89 ee e8 63 75 85 00 85 c0 78 27 74 2b 48 8b 5b 08 Mar 01 15:55:44 ost kernel: RSP: 0018:ffffb415ca8c3b20 EFLAGS: 00010206 Mar 01 15:55:44 ost kernel: RAX: 0000000000000200 RBX: 0000000000003936 RCX: 0000000000000000 Mar 01 15:55:44 ost kernel: RDX: 0000000000000000 RSI: ffffffff99fc5446 RDI: ffff8fb6c38dac20 Mar 01 15:55:44 ost kernel: RBP: ffffffff99fc5446 R08: 0000000000000000 R09: 0000000000000000 Mar 01 15:55:44 ost kernel: R10: ffffffff99fc5446 R11: 0000000000000000 R12: 0000000000000000 Mar 01 15:55:44 ost kernel: R13: ffff8fb6c38dac20 R14: 0000000000000000 R15: ffff8fb795c31900 Mar 01 15:55:44 ost kernel: FS: 00007f55d60db600(0000) GS:ffff8fba2dc00000(0000) knlGS:0000000000000000 Mar 01 15:55:44 ost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 01 15:55:44 ost kernel: CR2: 000000000000394e CR3: 00000004278bc004 CR4: 00000000003706f0 Mar 01 15:55:44 ost kernel: Call Trace: Mar 01 15:55:44 ost kernel: <TASK> Mar 01 15:55:44 ost kernel: __vfs_getxattr+0x7f/0xb0 Mar 01 15:55:44 ost kernel: cap_inode_need_killpriv+0x1e/0x30 Mar 01 15:55:44 ost kernel: security_inode_need_killpriv+0x2d/0x50 Mar 01 15:55:44 ost kernel: dentry_needs_remove_privs+0x32/0x60 Mar 01 15:55:44 ost kernel: do_truncate+0x5b/0xd0 Mar 01 15:55:44 ost kernel: path_openat+0xfc3/0x1200 Mar 01 15:55:44 ost kernel: do_filp_open+0xb8/0x160 Mar 01 15:55:44 ost kernel: do_sys_openat2+0x91/0xc0 Mar 01 15:55:44 ost kernel: __x64_sys_openat+0x57/0xa0 Mar 01 15:55:44 ost kernel: do_syscall_64+0x61/0xe0 Mar 01 15:55:44 ost kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76 I think this happens during ltp trying to open some sysfs file with O_TRUNC. The failure is during rbtree_simple_xattr_cmp when strcmp attempts to deref xattr->name. xattr is at 0x0000000000003936 so accessing xattr->name is causing the oops. Not sure why xattrs->rb_root would be pointing there, this should have been initialized to null. If you could setup kdump [1] and reproduce the issue (with the latest stable kernel) there may be more hints there for analysis. Also please set /proc/sys/kernel/panic_on_oops = 1. [1] https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha-tuning-kexec.html#cha-tuning-kdump-basic (In reply to Anthony Iliopoulos from comment #1) > If you could setup kdump [1] and reproduce the issue (with the latest stable > kernel) > there may be more hints there for analysis. > > Also please set /proc/sys/kernel/panic_on_oops = 1. > > [1] > https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha- > tuning-kexec.html#cha-tuning-kdump-basic Thank you, I've done as recommended, but am currently rocking 6.7.7-1-default without issue. Will update if I get anything. Regards Maybe related, kernel 6.7.9-1-default
KERNEL: /usr/lib/modules/6.7.9-1-default/vmlinux.xz [TAINTED]
DEBUGINFO: /usr/lib/debug//usr/lib/modules/6.7.9-1-default//vmlinux.debug
DUMPFILE: /var/crash/2024-03-19-12-05/vmcore [PARTIAL DUMP]
CPUS: 12
DATE: Tue Mar 19 12:05:05 PDT 2024
UPTIME: 2 days, 01:10:21
LOAD AVERAGE: 2.65, 2.56, 2.31
TASKS: 624
NODENAME: ost
RELEASE: 6.7.9-1-default
VERSION: #1 SMP PREEMPT_DYNAMIC Thu Mar 7 06:07:11 UTC 2024 (6049de6)
MACHINE: x86_64 (2600 Mhz)
MEMORY: 15.9 GB
PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
PID: 2732
COMMAND: "systemd"
TASK: ffff9bd13d92d280 [THREAD_INFO: ffff9bd13d92d280]
CPU: 5
STATE: TASK_RUNNING (PANIC)
PID: 2732 TASK: ffff9bd13d92d280 CPU: 5 COMMAND: "systemd"
#0 [ffffbbf082f5fa18] machine_kexec at ffffffffad28f70f
#1 [ffffbbf082f5fa70] __crash_kexec at ffffffffad3c5dfe
#2 [ffffbbf082f5fb30] crash_kexec at ffffffffad3c727c
#3 [ffffbbf082f5fb38] oops_end at ffffffffad244fe4
#4 [ffffbbf082f5fb58] page_fault_oops at ffffffffad2a4411
#5 [ffffbbf082f5fbd8] exc_page_fault at ffffffffadebabf1
#6 [ffffbbf082f5fc00] asm_exc_page_fault at ffffffffae0012a6
[exception RIP: rb_first+15]
RIP: ffffffffadea1e0f RSP: ffffbbf082f5fcb8 RFLAGS: 00010202
RAX: 0000000000004ace RBX: ffff9bd0dcfd5280 RCX: 000000000100002e
RDX: 0000000000004ace RSI: 0000000000000000 RDI: ffff9bd0d13d4088
RBP: ffff9bd0dcfd5d00 R8: 0000000000000000 R9: 000000000100002e
R10: ffff9bd3d4e3ae10 R11: ffffbbf082f5fcb8 R12: 0000000000000000
R13: ffff9bd0d13d4088 R14: 0000000000000002 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffbbf082f5fcb8] simple_xattrs_free at ffffffffad64b6e9
#8 [ffffbbf082f5fce0] kernfs_put at ffffffffad6d3420
#9 [ffffbbf082f5fd08] evict at ffffffffad639091
#10 [ffffbbf082f5fd28] __dentry_kill at ffffffffad634493
#11 [ffffbbf082f5fd48] shrink_dentry_list at ffffffffad635e32
#12 [ffffbbf082f5fd70] shrink_dcache_parent at ffffffffad636110
#13 [ffffbbf082f5fdb0] vfs_rmdir at ffffffffad623620
#14 [ffffbbf082f5fdd8] do_rmdir at ffffffffad629dfd
#15 [ffffbbf082f5fe38] __x64_sys_rmdir at ffffffffad629ee2
#16 [ffffbbf082f5fe48] do_syscall_64 at ffffffffadeb4461
#17 [ffffbbf082f5ff50] entry_SYSCALL_64_after_hwframe at ffffffffae00012a
RIP: 00007fc5ffb05ebb RSP: 00007ffd88211988 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc5ffb05ebb
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000555c8e9eb820
RBP: 0000555c8ea17640 R8: 00007fc5ffebd4ee R9: 0000000000000007
R10: 0000555c8e9a49d0 R11: 0000000000000246 R12: 00007fc6000c6f3e
R13: 0000000000000001 R14: 0000000000000000 R15: 0000555c8e9eb820
ORIG_RAX: 0000000000000054 CS: 0033 SS: 002b
Happened on logging out of KDE after doing a zypper dup, so I don't expect to be able to repro it.
Produced a 332M vmcore
Kernel crashdump
----------------
dmesg status: saved successfully
vmcore status: saved successfully
Kernel version: 6.7.9-1-default
Crash time: 2024-03-19T12:05:05
Dump time: 2024-03-19T12:05:12
Host: ost
Dump level: 31
Dump format: compressed
Created attachment 873646 [details]
dmesg k6.7.9-1-default systemd Oops
please attach the kdump. (In reply to Anthony Iliopoulos from comment #5) > please attach the kdump. Would that be the vmcore file? It is 332MiB.. (In reply to Michael Burge from comment #6) > (In reply to Anthony Iliopoulos from comment #5) > > please attach the kdump. > > Would that be the vmcore file? It is 332MiB.. yes, at tarball with everything under /var/crash/2024-03-19-12-05 please. If this exceeds the bugzilla attachment limit, then any other file sharing service would be fine (not sure if opensuse.org offers something like that). (In reply to Anthony Iliopoulos from comment #7) > (In reply to Michael Burge from comment #6) > > (In reply to Anthony Iliopoulos from comment #5) > > > please attach the kdump. > > > > Would that be the vmcore file? It is 332MiB.. > > yes, at tarball with everything under /var/crash/2024-03-19-12-05 please. If > this exceeds the bugzilla attachment limit, then any other file sharing > service would be fine (not sure if opensuse.org offers something like that). https://drive.google.com/file/d/1d5pA29GWpGVuks2xNAxN56NfUojMRZLR/view?usp=sharing Some recent k6.8.1 dumps https://drive.google.com/drive/folders/1k1MjkZuTX5eVjSgxDlC2j2MtaBDUG8bu I think upstream commit 4207b556e62f ("kernfs: RCU protect kernfs_nodes and avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()") fixes this issue, and this was backported to v6.8.6 which is now available in TW.
Could you please give it a try and see if this is still reproducible?
(In reply to Anthony Iliopoulos from comment #10) > I think upstream commit 4207b556e62f ("kernfs: RCU protect kernfs_nodes and > avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()") fixes this > issue, and this was backported to v6.8.6 which is now available in TW. > > Could you please give it a try and see if this is still reproducible? I am quite confident this indeed fixes it. Timeframe is only ~24hrs, but I've observed the signs of the underlying state changes that would've previously crashed it, ie corresponding journalctl entries, software and physical battery indicators, fan powercurve changes, go by uneventfully(for a lack of a better word). Thank you for the insights and support ! *** Bug 1221884 has been marked as a duplicate of this bug. *** Update: 46hrs uptime! 6.8.6-1-default The issue is around `simple_xattr_get` this time: [134171.302114] BUG: unable to handle page fault for address: 000000000000557f [134171.302118] #PF: supervisor read access in kernel mode [134171.302120] #PF: error_code(0x0000) - not-present page [134171.302121] PGD 0 P4D 0 [134171.302124] Oops: 0000 [#1] PREEMPT SMP PTI [134171.302126] CPU: 5 PID: 1 Comm: systemd Kdump: loaded Tainted: P OE 6.8.6-1-default #1 openSUSE Tumbleweed 001ee2b6272db94efdc27082222fee99f65f968f [134171.302129] Hardware name: ASUSTeK COMPUTER INC. Zephyrus M GU502GV_GU502GV/GU502GV, BIOS GU502GV.311 01/05/2021 [134171.302131] RIP: 0010:simple_xattr_get+0x31/0xa0 [134171.302135] Code: 00 00 41 56 49 89 ce 41 55 4c 8d 6f 08 41 54 49 89 d4 55 48 89 f5 53 48 89 fb 4c 89 ef e8 67 52 89 00 48 8b 1b 48 85 db 74 1b <48> 8b 7b 18 48 89 ee e8 b3 79 86 00 85 c0 78 27 74 2b 48 8b 5b 08 [134171.302137] RSP: 0018:ffffa12bc0043c80 EFLAGS: 00010202 [134171.302139] RAX: 0000000000000000 RBX: 0000000000005567 RCX: 0000000000000067 [134171.302140] RDX: ffff8b36b50c2600 RSI: ffffa12bc0043d50 RDI: ffff8b2f8180d4a0 [134171.302142] RBP: ffffa12bc0043d50 R08: ffff8b36b50c2600 R09: 0000000000000067 [134171.302143] R10: ffffa12bc0043d50 R11: 0000000000000002 R12: ffff8b36b50c2600 [134171.302145] R13: ffff8b2f8180d4a0 R14: 0000000000000067 R15: ffffa12bc0043d50 [134171.302146] FS: 00007f5f23a41900(0000) GS:ffff8b36c7c80000(0000) knlGS:0000000000000000 [134171.302148] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134171.302149] CR2: 000000000000557f CR3: 0000000150e12003 CR4: 00000000003706f0 [134171.302151] Call Trace: [134171.302152] <TASK> [134171.302155] ? __die+0x23/0x70 [134171.302158] ? page_fault_oops+0x14d/0x490 [134171.302161] ? _copy_to_iter+0x8b/0x630 [134171.302164] ? seq_printf+0x92/0xb0 [134171.302167] ? exc_page_fault+0x71/0x160 [134171.302169] ? asm_exc_page_fault+0x26/0x30 [134171.302174] ? simple_xattr_get+0x31/0xa0 [134171.302177] __vfs_getxattr+0x7f/0xb0 [134171.302180] do_getxattr+0xf0/0x180 [134171.302182] getxattr+0x7f/0xc0 [134171.302185] ? __pte_offset_map+0x1b/0x180 [134171.302188] ? __check_object_size+0x25b/0x2c0 [134171.302191] ? strncpy_from_user+0x4f/0x150 [134171.302194] ? user_path_at_empty+0x44/0x60 [134171.302197] path_getxattr+0x6a/0xc0 [134171.302200] do_syscall_64+0x83/0x170 [134171.302203] ? do_user_addr_fault+0x304/0x670 [134171.302206] ? exc_page_fault+0x71/0x160 [134171.302208] entry_SYSCALL_64_after_hwframe+0x78/0x80 kdump: https://drive.google.com/drive/folders/1_B-9l3cAUMsYUZHHSlAAITWH7rCHKotZ?usp=drive_link Just experienced another crash and the stack trace looks all too familiar: [121413.471472] RIP: 0010:rb_first+0xf/0x30 ... [121413.471494] ? __die+0x23/0x70 [121413.471497] ? page_fault_oops+0x14d/0x490 [121413.471500] ? __free_one_page+0x62/0x500 [121413.471504] ? exc_page_fault+0x71/0x160 [121413.471507] ? asm_exc_page_fault+0x26/0x30 [121413.471511] ? rb_first+0xf/0x30 [121413.471513] simple_xattrs_free+0x29/0x90 [121413.471516] kernfs_free_rcu+0x2f/0x50 [121413.471519] rcu_do_batch+0x1ef/0x800 [121413.471523] rcu_core+0x1b5/0x4c0 [121413.471526] __do_softirq+0xbf/0x2a5 [121413.471529] ? __pfx_smpboot_thread_fn+0x10/0x10 [121413.471531] run_ksoftirqd+0x2c/0x40 [121413.471534] smpboot_thread_fn+0xd9/0x1d0 [121413.471536] kthread+0xe5/0x120 [121413.471539] ? __pfx_kthread+0x10/0x10 [121413.471542] ret_from_fork+0x31/0x50 [121413.471545] ? __pfx_kthread+0x10/0x10 [121413.471547] ret_from_fork_asm+0x1b/0x30 I were directed to this post on nvidia's forums https://forums.developer.nvidia.com/t/series-550-freezes-laptop/284772 and although I make use of CUDA in my daily tasks, have since uninstalled the driver (550.67 since 2024-03-20, don't remember the previous version at the time of filing this issue on 2024-03-02). Curious if the other person that had filed an issue is using the proprietary NVIDIA driver. PS Since I had no CUDA to do work, I decided to play a game which, in recent weeks, used to result in an outright restart without any log entries/stack traces/kdumps within minutes of it starting up. This time it didn't, and it ran for hours, which points to the nvidia driver being the cause.. |
Created attachment 873182 [details] full journalctl of the third crash I have an old portable with a wonky dc power jack which tends to generate a few quick AC0<->BAT0 events in succession. Starting with kernel 6.7.6-1-default the system locks up. Log is full of ``` Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Process '/usr/sbin/tlp auto' terminated by signal KILL. Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Failed to wait for spawned command '/usr/sbin/tlp auto': Input/output error Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Failed to execute '/usr/sbin/tlp auto', ignoring: Input/output error Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Process '/usr/sbin/tlp auto' terminated by signal KILL. Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Failed to wait for spawned command '/usr/sbin/tlp auto': Input/output error Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Failed to execute '/usr/sbin/tlp auto', ignoring: Input/output error Mar 01 17:34:17 ost plasmashell[2682]: plasma-pk-updates: acPluggedChanged onBattery: false -> true Mar 01 17:34:17 ost kernel: BUG: unable to handle page fault for address: 0000000000007b8a Mar 01 17:34:17 ost kernel: #PF: supervisor read access in kernel mode Mar 01 17:34:17 ost kernel: #PF: error_code(0x0000) - not-present page ``` until it locks up with ``` Mar 01 18:11:29 ost kernel: note: kworker/11:5[23457] exited with irqs disabled Mar 01 18:11:50 ost systemd-logind[1443]: Power key pressed short. Mar 01 18:11:54 ost systemd-logind[1443]: Power key pressed short. Mar 01 18:11:55 ost systemd-logind[1443]: Power key pressed short. Mar 01 18:11:58 ost plasmashell[5912]: [5912:0301/181158.744595:ERROR:command_buffer_proxy_impl.cc(128)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer. Mar 01 18:12:05 ost systemd-logind[1443]: Power key pressed short. Mar 01 18:12:06 ost kernel: BUG: workqueue lockup - pool cpus=11 node=0 flags=0x0 nice=0 stuck for 37s! ``` at which point only holding down the power button works.