Bug 1220814 - kernel 6.7.6-1-default Oops: unable to handle page faults and workqueue lockup (udev related?)
Summary: kernel 6.7.6-1-default Oops: unable to handle page faults and workqueue locku...
Status: NEW
: 1221884 (view as bug list)
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: x86-64 Other
: P5 - None : Critical (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-02 02:32 UTC by Michael Burge
Modified: 2024-04-21 06:19 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
full journalctl of the third crash (1.68 MB, text/plain)
2024-03-02 02:32 UTC, Michael Burge
Details
dmesg k6.7.9-1-default systemd Oops (125.95 KB, text/plain)
2024-03-19 20:02 UTC, Michael Burge
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Burge 2024-03-02 02:32:47 UTC
Created attachment 873182 [details]
full journalctl of the third crash

I have an old portable with a wonky dc power jack which tends to generate a few quick AC0<->BAT0 events in succession. Starting with kernel 6.7.6-1-default the system locks up.

Log is full of
```
Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Process '/usr/sbin/tlp auto' terminated by signal KILL.
Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Failed to wait for spawned command '/usr/sbin/tlp auto': Input/output error
Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Failed to execute '/usr/sbin/tlp auto', ignoring: Input/output error
Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Process '/usr/sbin/tlp auto' terminated by signal KILL.
Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Failed to wait for spawned command '/usr/sbin/tlp auto': Input/output error
Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Failed to execute '/usr/sbin/tlp auto', ignoring: Input/output error
Mar 01 17:34:17 ost plasmashell[2682]: plasma-pk-updates: acPluggedChanged onBattery: false -> true
Mar 01 17:34:17 ost kernel: BUG: unable to handle page fault for address: 0000000000007b8a
Mar 01 17:34:17 ost kernel: #PF: supervisor read access in kernel mode
Mar 01 17:34:17 ost kernel: #PF: error_code(0x0000) - not-present page
```
until it locks up with
```
Mar 01 18:11:29 ost kernel: note: kworker/11:5[23457] exited with irqs disabled
Mar 01 18:11:50 ost systemd-logind[1443]: Power key pressed short.
Mar 01 18:11:54 ost systemd-logind[1443]: Power key pressed short.
Mar 01 18:11:55 ost systemd-logind[1443]: Power key pressed short.
Mar 01 18:11:58 ost plasmashell[5912]: [5912:0301/181158.744595:ERROR:command_buffer_proxy_impl.cc(128)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
Mar 01 18:12:05 ost systemd-logind[1443]: Power key pressed short.
Mar 01 18:12:06 ost kernel: BUG: workqueue lockup - pool cpus=11 node=0 flags=0x0 nice=0 stuck for 37s!
```
at which point only holding down the power button works.
Comment 1 Anthony Iliopoulos 2024-03-04 13:38:04 UTC
The first oops excerpt from the logs is the only reliable hint,
the rest are just side-effects of the bug:

Mar 01 15:55:44 ost kernel: BUG: unable to handle page fault for address: 000000000000394e
Mar 01 15:55:44 ost kernel: #PF: supervisor read access in kernel mode
Mar 01 15:55:44 ost kernel: #PF: error_code(0x0000) - not-present page
Mar 01 15:55:44 ost kernel: PGD 0 P4D 0 
Mar 01 15:55:44 ost kernel: Oops: 0000 [#2] PREEMPT SMP PTI
Mar 01 15:55:44 ost kernel: CPU: 8 PID: 19366 Comm: tlp Tainted: P      D    OE      6.7.6-1-default #1 openSUSE Tumbleweed cd9816be5099dbe04750b2583fe34462de6dcdca
Mar 01 15:55:44 ost kernel: Hardware name: ASUSTeK COMPUTER INC. Zephyrus M GU502GV_GU502GV/GU502GV, BIOS GU502GV.305 09/25/2019
Mar 01 15:55:44 ost kernel: RIP: 0010:simple_xattr_get+0x31/0xa0
Mar 01 15:55:44 ost kernel: Code: 00 00 41 56 49 89 ce 41 55 4c 8d 6f 08 41 54 49 89 d4 55 48 89 f5 53 48 89 fb 4c 89 ef e8 57 54 88 00 48 8b 1b 48 85 db 74 1b <48> 8b 7b 18 48 89 ee e8 63 75 85 00 85 c0 78 27 74 2b 48 8b 5b 08
Mar 01 15:55:44 ost kernel: RSP: 0018:ffffb415ca8c3b20 EFLAGS: 00010206
Mar 01 15:55:44 ost kernel: RAX: 0000000000000200 RBX: 0000000000003936 RCX: 0000000000000000
Mar 01 15:55:44 ost kernel: RDX: 0000000000000000 RSI: ffffffff99fc5446 RDI: ffff8fb6c38dac20
Mar 01 15:55:44 ost kernel: RBP: ffffffff99fc5446 R08: 0000000000000000 R09: 0000000000000000
Mar 01 15:55:44 ost kernel: R10: ffffffff99fc5446 R11: 0000000000000000 R12: 0000000000000000
Mar 01 15:55:44 ost kernel: R13: ffff8fb6c38dac20 R14: 0000000000000000 R15: ffff8fb795c31900
Mar 01 15:55:44 ost kernel: FS:  00007f55d60db600(0000) GS:ffff8fba2dc00000(0000) knlGS:0000000000000000
Mar 01 15:55:44 ost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 01 15:55:44 ost kernel: CR2: 000000000000394e CR3: 00000004278bc004 CR4: 00000000003706f0
Mar 01 15:55:44 ost kernel: Call Trace:
Mar 01 15:55:44 ost kernel:  <TASK>
Mar 01 15:55:44 ost kernel:  __vfs_getxattr+0x7f/0xb0
Mar 01 15:55:44 ost kernel:  cap_inode_need_killpriv+0x1e/0x30
Mar 01 15:55:44 ost kernel:  security_inode_need_killpriv+0x2d/0x50
Mar 01 15:55:44 ost kernel:  dentry_needs_remove_privs+0x32/0x60
Mar 01 15:55:44 ost kernel:  do_truncate+0x5b/0xd0
Mar 01 15:55:44 ost kernel:  path_openat+0xfc3/0x1200
Mar 01 15:55:44 ost kernel:  do_filp_open+0xb8/0x160
Mar 01 15:55:44 ost kernel:  do_sys_openat2+0x91/0xc0
Mar 01 15:55:44 ost kernel:  __x64_sys_openat+0x57/0xa0
Mar 01 15:55:44 ost kernel:  do_syscall_64+0x61/0xe0
Mar 01 15:55:44 ost kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0x76

I think this happens during ltp trying to open some sysfs file with O_TRUNC.

The failure is during rbtree_simple_xattr_cmp when strcmp attempts to deref
xattr->name. xattr is at 0x0000000000003936 so accessing xattr->name is causing
the oops. Not sure why xattrs->rb_root would be pointing there, this should
have been initialized to null.

If you could setup kdump [1] and reproduce the issue (with the latest stable kernel)
there may be more hints there for analysis.

Also please set /proc/sys/kernel/panic_on_oops = 1.

[1] https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha-tuning-kexec.html#cha-tuning-kdump-basic
Comment 2 Michael Burge 2024-03-04 19:52:41 UTC
(In reply to Anthony Iliopoulos from comment #1)
> If you could setup kdump [1] and reproduce the issue (with the latest stable
> kernel)
> there may be more hints there for analysis.
> 
> Also please set /proc/sys/kernel/panic_on_oops = 1.
> 
> [1]
> https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha-
> tuning-kexec.html#cha-tuning-kdump-basic

Thank you,

I've done as recommended, but am currently rocking 6.7.7-1-default without issue.
Will update if I get anything.

Regards
Comment 3 Michael Burge 2024-03-19 20:01:15 UTC
Maybe related, kernel 6.7.9-1-default

      KERNEL: /usr/lib/modules/6.7.9-1-default/vmlinux.xz  [TAINTED]
   DEBUGINFO: /usr/lib/debug//usr/lib/modules/6.7.9-1-default//vmlinux.debug
    DUMPFILE: /var/crash/2024-03-19-12-05/vmcore  [PARTIAL DUMP]
        CPUS: 12
        DATE: Tue Mar 19 12:05:05 PDT 2024
      UPTIME: 2 days, 01:10:21
LOAD AVERAGE: 2.65, 2.56, 2.31
       TASKS: 624
    NODENAME: ost
     RELEASE: 6.7.9-1-default
     VERSION: #1 SMP PREEMPT_DYNAMIC Thu Mar  7 06:07:11 UTC 2024 (6049de6)
     MACHINE: x86_64  (2600 Mhz)
      MEMORY: 15.9 GB
       PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
         PID: 2732
     COMMAND: "systemd"
        TASK: ffff9bd13d92d280  [THREAD_INFO: ffff9bd13d92d280]
         CPU: 5
       STATE: TASK_RUNNING (PANIC)


PID: 2732     TASK: ffff9bd13d92d280  CPU: 5    COMMAND: "systemd"
 #0 [ffffbbf082f5fa18] machine_kexec at ffffffffad28f70f
 #1 [ffffbbf082f5fa70] __crash_kexec at ffffffffad3c5dfe
 #2 [ffffbbf082f5fb30] crash_kexec at ffffffffad3c727c
 #3 [ffffbbf082f5fb38] oops_end at ffffffffad244fe4
 #4 [ffffbbf082f5fb58] page_fault_oops at ffffffffad2a4411
 #5 [ffffbbf082f5fbd8] exc_page_fault at ffffffffadebabf1
 #6 [ffffbbf082f5fc00] asm_exc_page_fault at ffffffffae0012a6
    [exception RIP: rb_first+15]
    RIP: ffffffffadea1e0f  RSP: ffffbbf082f5fcb8  RFLAGS: 00010202
    RAX: 0000000000004ace  RBX: ffff9bd0dcfd5280  RCX: 000000000100002e
    RDX: 0000000000004ace  RSI: 0000000000000000  RDI: ffff9bd0d13d4088
    RBP: ffff9bd0dcfd5d00   R8: 0000000000000000   R9: 000000000100002e
    R10: ffff9bd3d4e3ae10  R11: ffffbbf082f5fcb8  R12: 0000000000000000
    R13: ffff9bd0d13d4088  R14: 0000000000000002  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffbbf082f5fcb8] simple_xattrs_free at ffffffffad64b6e9
 #8 [ffffbbf082f5fce0] kernfs_put at ffffffffad6d3420
 #9 [ffffbbf082f5fd08] evict at ffffffffad639091
#10 [ffffbbf082f5fd28] __dentry_kill at ffffffffad634493
#11 [ffffbbf082f5fd48] shrink_dentry_list at ffffffffad635e32
#12 [ffffbbf082f5fd70] shrink_dcache_parent at ffffffffad636110
#13 [ffffbbf082f5fdb0] vfs_rmdir at ffffffffad623620
#14 [ffffbbf082f5fdd8] do_rmdir at ffffffffad629dfd
#15 [ffffbbf082f5fe38] __x64_sys_rmdir at ffffffffad629ee2
#16 [ffffbbf082f5fe48] do_syscall_64 at ffffffffadeb4461
#17 [ffffbbf082f5ff50] entry_SYSCALL_64_after_hwframe at ffffffffae00012a
    RIP: 00007fc5ffb05ebb  RSP: 00007ffd88211988  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007fc5ffb05ebb
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: 0000555c8e9eb820
    RBP: 0000555c8ea17640   R8: 00007fc5ffebd4ee   R9: 0000000000000007
    R10: 0000555c8e9a49d0  R11: 0000000000000246  R12: 00007fc6000c6f3e
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000555c8e9eb820
    ORIG_RAX: 0000000000000054  CS: 0033  SS: 002b


Happened on logging out of KDE after doing a zypper dup, so I don't expect to be able to repro it.

Produced a 332M vmcore
Kernel crashdump
----------------
dmesg status: saved successfully
vmcore status: saved successfully
Kernel version: 6.7.9-1-default
Crash time: 2024-03-19T12:05:05
Dump time: 2024-03-19T12:05:12
Host: ost
Dump level: 31
Dump format: compressed
Comment 4 Michael Burge 2024-03-19 20:02:47 UTC
Created attachment 873646 [details]
dmesg k6.7.9-1-default systemd Oops
Comment 5 Anthony Iliopoulos 2024-03-20 05:38:32 UTC
please attach the kdump.
Comment 6 Michael Burge 2024-03-20 06:02:28 UTC
(In reply to Anthony Iliopoulos from comment #5)
> please attach the kdump.

Would that be the vmcore file? It is 332MiB..
Comment 7 Anthony Iliopoulos 2024-03-20 07:42:59 UTC
(In reply to Michael Burge from comment #6)
> (In reply to Anthony Iliopoulos from comment #5)
> > please attach the kdump.
> 
> Would that be the vmcore file? It is 332MiB..

yes, at tarball with everything under /var/crash/2024-03-19-12-05 please. If this exceeds the bugzilla attachment limit, then any other file sharing service would be fine (not sure if opensuse.org offers something like that).
Comment 8 Michael Burge 2024-03-20 08:07:43 UTC
(In reply to Anthony Iliopoulos from comment #7)
> (In reply to Michael Burge from comment #6)
> > (In reply to Anthony Iliopoulos from comment #5)
> > > please attach the kdump.
> > 
> > Would that be the vmcore file? It is 332MiB..
> 
> yes, at tarball with everything under /var/crash/2024-03-19-12-05 please. If
> this exceeds the bugzilla attachment limit, then any other file sharing
> service would be fine (not sure if opensuse.org offers something like that).

https://drive.google.com/file/d/1d5pA29GWpGVuks2xNAxN56NfUojMRZLR/view?usp=sharing
Comment 9 Michael Burge 2024-03-26 18:43:54 UTC
Some recent k6.8.1 dumps

https://drive.google.com/drive/folders/1k1MjkZuTX5eVjSgxDlC2j2MtaBDUG8bu
Comment 10 Anthony Iliopoulos 2024-04-17 16:59:37 UTC
I think upstream commit 4207b556e62f ("kernfs: RCU protect kernfs_nodes and avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()") fixes this issue, and this was backported to v6.8.6 which is now available in TW.

Could you please give it a try and see if this is still reproducible?
Comment 11 Michael Burge 2024-04-18 14:49:31 UTC
(In reply to Anthony Iliopoulos from comment #10)
> I think upstream commit 4207b556e62f ("kernfs: RCU protect kernfs_nodes and
> avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()") fixes this
> issue, and this was backported to v6.8.6 which is now available in TW.
> 
> Could you please give it a try and see if this is still reproducible?

I am quite confident this indeed fixes it.

Timeframe is only ~24hrs, but I've observed the signs of the underlying state changes that would've previously crashed it, ie corresponding journalctl entries, software and physical battery indicators, fan powercurve changes, go by uneventfully(for a lack of a better word).

Thank you for the insights and support !
Comment 12 Anthony Iliopoulos 2024-04-19 13:55:59 UTC
*** Bug 1221884 has been marked as a duplicate of this bug. ***
Comment 13 Michael Burge 2024-04-19 14:42:28 UTC
Update:
46hrs uptime!
6.8.6-1-default
The issue is around `simple_xattr_get` this time:

[134171.302114] BUG: unable to handle page fault for address: 000000000000557f
[134171.302118] #PF: supervisor read access in kernel mode
[134171.302120] #PF: error_code(0x0000) - not-present page
[134171.302121] PGD 0 P4D 0 
[134171.302124] Oops: 0000 [#1] PREEMPT SMP PTI
[134171.302126] CPU: 5 PID: 1 Comm: systemd Kdump: loaded Tainted: P           OE      6.8.6-1-default #1 openSUSE Tumbleweed 001ee2b6272db94efdc27082222fee99f65f968f
[134171.302129] Hardware name: ASUSTeK COMPUTER INC. Zephyrus M GU502GV_GU502GV/GU502GV, BIOS GU502GV.311 01/05/2021
[134171.302131] RIP: 0010:simple_xattr_get+0x31/0xa0
[134171.302135] Code: 00 00 41 56 49 89 ce 41 55 4c 8d 6f 08 41 54 49 89 d4 55 48 89 f5 53 48 89 fb 4c 89 ef e8 67 52 89 00 48 8b 1b 48 85 db 74 1b <48> 8b 7b 18 48 89 ee e8 b3 79 86 00 85 c0 78 27 74 2b 48 8b 5b 08
[134171.302137] RSP: 0018:ffffa12bc0043c80 EFLAGS: 00010202
[134171.302139] RAX: 0000000000000000 RBX: 0000000000005567 RCX: 0000000000000067
[134171.302140] RDX: ffff8b36b50c2600 RSI: ffffa12bc0043d50 RDI: ffff8b2f8180d4a0
[134171.302142] RBP: ffffa12bc0043d50 R08: ffff8b36b50c2600 R09: 0000000000000067
[134171.302143] R10: ffffa12bc0043d50 R11: 0000000000000002 R12: ffff8b36b50c2600
[134171.302145] R13: ffff8b2f8180d4a0 R14: 0000000000000067 R15: ffffa12bc0043d50
[134171.302146] FS:  00007f5f23a41900(0000) GS:ffff8b36c7c80000(0000) knlGS:0000000000000000
[134171.302148] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[134171.302149] CR2: 000000000000557f CR3: 0000000150e12003 CR4: 00000000003706f0
[134171.302151] Call Trace:
[134171.302152]  <TASK>
[134171.302155]  ? __die+0x23/0x70
[134171.302158]  ? page_fault_oops+0x14d/0x490
[134171.302161]  ? _copy_to_iter+0x8b/0x630
[134171.302164]  ? seq_printf+0x92/0xb0
[134171.302167]  ? exc_page_fault+0x71/0x160
[134171.302169]  ? asm_exc_page_fault+0x26/0x30
[134171.302174]  ? simple_xattr_get+0x31/0xa0
[134171.302177]  __vfs_getxattr+0x7f/0xb0
[134171.302180]  do_getxattr+0xf0/0x180
[134171.302182]  getxattr+0x7f/0xc0
[134171.302185]  ? __pte_offset_map+0x1b/0x180
[134171.302188]  ? __check_object_size+0x25b/0x2c0
[134171.302191]  ? strncpy_from_user+0x4f/0x150
[134171.302194]  ? user_path_at_empty+0x44/0x60
[134171.302197]  path_getxattr+0x6a/0xc0
[134171.302200]  do_syscall_64+0x83/0x170
[134171.302203]  ? do_user_addr_fault+0x304/0x670
[134171.302206]  ? exc_page_fault+0x71/0x160
[134171.302208]  entry_SYSCALL_64_after_hwframe+0x78/0x80

kdump:
https://drive.google.com/drive/folders/1_B-9l3cAUMsYUZHHSlAAITWH7rCHKotZ?usp=drive_link
Comment 14 Michael Burge 2024-04-21 06:19:12 UTC
Just experienced another crash and the stack trace looks all too familiar:

[121413.471472] RIP: 0010:rb_first+0xf/0x30
...
[121413.471494]  ? __die+0x23/0x70
[121413.471497]  ? page_fault_oops+0x14d/0x490
[121413.471500]  ? __free_one_page+0x62/0x500
[121413.471504]  ? exc_page_fault+0x71/0x160
[121413.471507]  ? asm_exc_page_fault+0x26/0x30
[121413.471511]  ? rb_first+0xf/0x30
[121413.471513]  simple_xattrs_free+0x29/0x90
[121413.471516]  kernfs_free_rcu+0x2f/0x50
[121413.471519]  rcu_do_batch+0x1ef/0x800
[121413.471523]  rcu_core+0x1b5/0x4c0
[121413.471526]  __do_softirq+0xbf/0x2a5
[121413.471529]  ? __pfx_smpboot_thread_fn+0x10/0x10
[121413.471531]  run_ksoftirqd+0x2c/0x40
[121413.471534]  smpboot_thread_fn+0xd9/0x1d0
[121413.471536]  kthread+0xe5/0x120
[121413.471539]  ? __pfx_kthread+0x10/0x10
[121413.471542]  ret_from_fork+0x31/0x50
[121413.471545]  ? __pfx_kthread+0x10/0x10
[121413.471547]  ret_from_fork_asm+0x1b/0x30

I were directed to this post on nvidia's forums
https://forums.developer.nvidia.com/t/series-550-freezes-laptop/284772
and although I make use of CUDA in my daily tasks, have since uninstalled the driver (550.67 since 2024-03-20, don't remember the previous version at the time of filing this issue on 2024-03-02).

Curious if the other person that had filed an issue is using the proprietary NVIDIA driver.

PS Since I had no CUDA to do work, I decided to play a game which, in recent weeks, used to result in an outright restart without any log entries/stack traces/kdumps within minutes of it starting up.
This time it didn't, and it ran for hours, which points to the nvidia driver being the cause..