Bug 1177018 - Rare kernel soft lockup causes PC to freeze
Rare kernel soft lockup causes PC to freeze
Status: NEW
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
Other openSUSE Tumbleweed
: P5 - None : Normal (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-09-27 12:20 UTC by Michael Pujos
Modified: 2022-01-14 14:44 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
output of journalctl (1.53 MB, text/plain)
2020-09-27 12:20 UTC, Michael Pujos
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Pujos 2020-09-27 12:20:00 UTC
Created attachment 841985 [details]
output of journalctl

Using Kernel 5.4.10 and current TW as of the date of this report.


Today I was working normally in Xorg and suddenly my laptop freezed: no keyboard input, mouse cursor moving but clicks inoperant, ssh'ing from another PC impossible. Interestingly, I had audio playing and it still continue to play normally. Fans of the laptop triggered full speed, indicating high CPU usage.

I rebooted the laptop with ALT+SysRq+b and looked at the journal which contains a lot of  "watchdog: BUG: soft lockup - CPU#1 stuck for 22s!"  entries with stack trace that seems to refer to usb.
I have attached the full journal log, look at the end for the "BUG: soft lockup" entries. At that point, the laptop had an uptime of about 3 days with a few suspend in-between.

As far as USB is concerned, I have a Thunderbolt 3 dock connected and audio was playing through it (via USB audio) when it happened. Also had an Android device connected to the laptop directly. There's the Logitech unifying receiver connected to the laptop and a keyboard connected to the dock.
I can include the output of hwinfo if necessary

This freeze is rare in the grand scheme of things but it also happened once 2 weeks ago with a previous 5.4.x kernel. First I blamed it to the NVIDIA driver and did not investigate more, but now I'm not so sure given that the stack trace refer to USB.
Comment 1 Michael Pujos 2020-09-28 15:49:13 UTC
More on this.

This happened again today but not an entire lockup this time.

Was working as usual and suddenly, adb (command line tool to communicate with an Android device over USB) stopped responding and at the same time audio playing via USB to my Thunderbolt dock also stopped intermittently for several seconds. The adb process was unkillable with 'kill -9'.

'top' indicated that culprit is "kworker/7:2+usb_hub_wq" process taking 100% CPU all the time with regular traces below in journal. At that stage the only way the machine was really unstable (Ethernet networking from TB3 dock gone, temporary lockups) and had to force poweroff the machine with power button (as /sbin/poweroff remained stuck). So on my system, USB is going berserk at some point...

Sep 28 17:28:24 p72 kernel: watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kworker/7:2:18895]
Sep 28 17:28:24 p72 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq st sr_mod cdrom lp parport_pc ppdev parport rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat md4 iptable_mangle>
Sep 28 17:28:24 p72 kernel:  mei_wdt iTCO_vendor_support intel_rapl_msr fuse fat mac80211 snd_hda_codec_generic snd_soc_core kvm snd_compress snd_usb_audio snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg libarc4 irqbypass efi_pstore snd_hda_codec btusb >
Sep 28 17:28:24 p72 kernel:  xhci_pci_renesas fb_sys_fops cec xhci_hcd rc_core aesni_intel drm glue_helper usbcore crypto_simd cryptd nvme nvme_core rtsx_pci serio_raw wmi battery pinctrl_cannonlake video pinctrl_intel button btrfs blake2b_generic libcrc>
Sep 28 17:28:24 p72 kernel: CPU: 7 PID: 18895 Comm: kworker/7:2 Kdump: loaded Tainted: P     U  W  OEL    5.8.10-1-default #1 openSUSE Tumbleweed
Sep 28 17:28:24 p72 kernel: Hardware name: LENOVO 20MBCTO1WW/20MBCTO1WW, BIOS N2CET50W (1.33 ) 01/15/2020
Sep 28 17:28:24 p72 kernel: Workqueue: usb_hub_wq hub_event [usbcore]
Sep 28 17:28:24 p72 kernel: RIP: 0010:try_to_grab_pending+0xa0/0x170
Sep 28 17:28:24 p72 kernel: Code: e7 e8 c4 b5 94 00 48 8b 03 a8 04 74 0d 48 25 00 ff ff ff 74 05 4c 39 20 74 64 4c 89 e7 c6 07 00 0f 1f 40 00 48 8b 7d 00 57 9d <0f> 1f 44 00 00 48 8b 13 b8 fe ff ff ff 83 e2 14 48 83 fa 10 74 85
Sep 28 17:28:24 p72 kernel: RSP: 0018:ffffb1ebc642fac0 EFLAGS: 00000286
Sep 28 17:28:24 p72 kernel: RAX: 00000000000001c1 RBX: ffff95d78718f790 RCX: 0000000000000000
Sep 28 17:28:24 p72 kernel: RDX: 0000000000000001 RSI: ffff95d787802518 RDI: 0000000000000286
Sep 28 17:28:24 p72 kernel: RBP: ffffb1ebc642fae8 R08: ffff95db1d3ee000 R09: ffffffff82e5c6d8
Sep 28 17:28:24 p72 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff95db1d3ee000
Sep 28 17:28:24 p72 kernel: R13: ffff95da7f248000 R14: ffff95d78718f020 R15: ffff95d78718f440
Sep 28 17:28:24 p72 kernel: FS:  0000000000000000(0000) GS:ffff95db1d3c0000(0000) knlGS:0000000000000000
Sep 28 17:28:24 p72 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 28 17:28:24 p72 kernel: CR2: 00007f2eec84f300 CR3: 000000019d60a005 CR4: 00000000003606e0
Sep 28 17:28:24 p72 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 28 17:28:24 p72 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 28 17:28:24 p72 kernel: Call Trace:
Sep 28 17:28:24 p72 kernel:  __cancel_work_timer+0x3c/0x190
Sep 28 17:28:24 p72 kernel:  ? _cond_resched+0x16/0x40
Sep 28 17:28:24 p72 kernel:  ? usb_kill_urb.part.0+0x30/0xa0 [usbcore]
Sep 28 17:28:24 p72 kernel:  acm_disconnect+0x13f/0x280 [cdc_acm]
Sep 28 17:28:24 p72 kernel:  usb_unbind_interface+0x8a/0x270 [usbcore]
Sep 28 17:28:24 p72 kernel:  __device_release_driver+0x15c/0x210
Sep 28 17:28:24 p72 kernel:  device_release_driver+0x24/0x30
Sep 28 17:28:24 p72 kernel:  bus_remove_device+0xdb/0x140
Sep 28 17:28:24 p72 kernel:  device_del+0x16f/0x2d0
Sep 28 17:28:24 p72 kernel:  ? kobject_cleanup+0x4f/0x140
Sep 28 17:28:24 p72 kernel:  usb_disable_device+0xc6/0x1f0 [usbcore]
Sep 28 17:28:24 p72 kernel:  usb_disconnect.cold+0x7e/0x20a [usbcore]
Sep 28 17:28:24 p72 kernel:  hub_port_connect+0x8a/0x820 [usbcore]
Sep 28 17:28:24 p72 kernel:  hub_port_connect_change+0xae/0x350 [usbcore]
Sep 28 17:28:24 p72 kernel:  port_event+0x321/0x500 [usbcore]
Sep 28 17:28:24 p72 kernel:  hub_event+0x1db/0x440 [usbcore]
Sep 28 17:28:24 p72 kernel:  process_one_work+0x1e3/0x3b0
Sep 28 17:28:24 p72 kernel:  worker_thread+0x46/0x340
Sep 28 17:28:24 p72 kernel:  ? process_one_work+0x3b0/0x3b0
Sep 28 17:28:24 p72 kernel:  kthread+0x11b/0x140
Sep 28 17:28:24 p72 kernel:  ? __kthread_bind_mask+0x60/0x60
Sep 28 17:28:24 p72 kernel:  ret_from_fork+0x1f/0x30
Comment 2 Takashi Iwai 2020-09-29 16:42:08 UTC
Something went south in cdc-acm driver, as it seems.
Adding Oliver to Cc, as he has worked on this.
Comment 3 Michael Pujos 2020-11-11 11:47:41 UTC
Still happening from time to time when unplugging/plugging my Samsung Galaxy S9 when adb is running. Need to hard poweroff the machine when this happens (poweroff command remain stuck)

Here with kernel 5.9.1:

Nov 11 12:41:24 p72 kernel: watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kworker/7:4:31218]
Nov 11 12:41:24 p72 kernel: Modules linked in: cdc_acm vhost_net vhost tap vhost_iotlb tun snd_seq_dummy snd_hrtimer snd_seq rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip>
Nov 11 12:41:24 p72 kernel:  x86_pkg_temp_thermal intel_powerclamp snd_hda_intel coretemp nls_iso8859_1 snd_intel_dspcfg nls_cp437 kvm_intel snd_hda_codec snd_usb_audio mac80211 vfat fuse kvm fat libarc4 irqbypass btusb snd_usbmidi_lib jo>
Nov 11 12:41:24 p72 kernel:  xhci_pci_renesas drm xhci_hcd aesni_intel nvme glue_helper crypto_simd cryptd usbcore nvme_core serio_raw rtsx_pci wmi battery video pinctrl_cannonlake pinctrl_intel button btrfs blake2b_generic libcrc32c crc3>
Nov 11 12:41:24 p72 kernel: CPU: 7 PID: 31218 Comm: kworker/7:4 Tainted: P S   U  W  OEL    5.9.1-1-default #1 openSUSE Tumbleweed
Nov 11 12:41:24 p72 kernel: Hardware name: LENOVO 20MBCTO1WW/20MBCTO1WW, BIOS N2CET54W (1.37 ) 06/20/2020
Nov 11 12:41:24 p72 kernel: Workqueue: usb_hub_wq hub_event [usbcore]
Nov 11 12:41:24 p72 kernel: RIP: 0010:try_to_grab_pending+0xb8/0x170
Nov 11 12:41:24 p72 kernel: Code: 74 64 4c 89 e7 c6 07 00 0f 1f 40 00 48 8b 7d 00 57 9d 0f 1f 44 00 00 48 8b 13 b8 fe ff ff ff 83 e2 14 48 83 fa 10 74 85 f3 90 <48> 83 c4 08 b8 f5 ff ff ff 5b 5d 41 5c c3 48 8d 7f 20 e8 31 92 07
Nov 11 12:41:24 p72 kernel: RSP: 0018:ffffb447053cbac0 EFLAGS: 00000287
Nov 11 12:41:24 p72 kernel: RAX: 00000000fffffffe RBX: ffff8b5d67785790 RCX: 0000000000000000
Nov 11 12:41:24 p72 kernel: RDX: 0000000000000000 RSI: ffff8b5d478029a8 RDI: 0000000000000286
Nov 11 12:41:24 p72 kernel: RBP: ffffb447053cbae8 R08: ffff8b60dd3ee000 R09: ffffffff9c661c98
Nov 11 12:41:24 p72 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b60dd3ee000
Nov 11 12:41:24 p72 kernel: R13: ffff8b60a7210000 R14: ffff8b5d67785020 R15: ffff8b5d67785440
Nov 11 12:41:24 p72 kernel: FS:  0000000000000000(0000) GS:ffff8b60dd3c0000(0000) knlGS:0000000000000000
Nov 11 12:41:24 p72 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 11 12:41:24 p72 kernel: CR2: 000055cbf00f11fc CR3: 0000000434a0e002 CR4: 00000000003726e0
Nov 11 12:41:24 p72 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 11 12:41:24 p72 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 11 12:41:24 p72 kernel: Call Trace:
Nov 11 12:41:24 p72 kernel:  __cancel_work_timer+0x3c/0x190
Nov 11 12:41:24 p72 kernel:  ? _cond_resched+0x16/0x40
Nov 11 12:41:24 p72 kernel:  ? usb_kill_urb.part.0+0x30/0xa0 [usbcore]
Nov 11 12:41:24 p72 kernel:  acm_disconnect+0x13f/0x280 [cdc_acm]
Nov 11 12:41:24 p72 kernel:  usb_unbind_interface+0x8a/0x270 [usbcore]
Nov 11 12:41:24 p72 kernel:  ? kernfs_find_ns+0x35/0xd0
Nov 11 12:41:24 p72 kernel:  __device_release_driver+0x16b/0x220
Nov 11 12:41:24 p72 kernel:  device_release_driver+0x24/0x30
Nov 11 12:41:24 p72 kernel:  bus_remove_device+0xdb/0x140
Nov 11 12:41:24 p72 kernel:  device_del+0x16f/0x3f0
Nov 11 12:41:24 p72 kernel:  ? kobject_cleanup+0x4f/0x140
Nov 11 12:41:24 p72 kernel:  usb_disable_device+0xc6/0x1f0 [usbcore]
Nov 11 12:41:24 p72 kernel:  usb_disconnect.cold+0x7e/0x20a [usbcore]
Nov 11 12:41:24 p72 kernel:  hub_port_connect+0x8a/0x820 [usbcore]
Nov 11 12:41:24 p72 kernel:  hub_port_connect_change+0xae/0x350 [usbcore]
Nov 11 12:41:24 p72 kernel:  port_event+0x321/0x500 [usbcore]
Nov 11 12:41:24 p72 kernel:  hub_event+0x1db/0x440 [usbcore]
Nov 11 12:41:24 p72 kernel:  process_one_work+0x1e3/0x3b0
Nov 11 12:41:24 p72 kernel:  worker_thread+0x46/0x340
Nov 11 12:41:24 p72 kernel:  ? process_one_work+0x3b0/0x3b0
Nov 11 12:41:24 p72 kernel:  kthread+0x11b/0x140
Nov 11 12:41:24 p72 kernel:  ? __kthread_bind_mask+0x60/0x60
Nov 11 12:41:24 p72 kernel:  ret_from_fork+0x1f/0x30
Comment 4 Philippe Condé 2020-11-19 12:48:10 UTC
Hello,

I'm on tumbleweed  and my system is  updated on each snapshot via "zypper dup"

I see the same problem from time to time: sometimes it occurs when  I try to unlock the sytem , sometimes when I scroll in firefox for a page still in load.

I found this error 
"Nov 19 08:36:57 hpprol2 systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
Nov 19 08:36:57 hpprol2 systemd[1]: systemd-udevd.service: Killing process 618 (systemd-udevd) with signal SIGABRT."

This is followed by these watchdog error repeated more > 25 times.

"Nov 19 08:37:18 hpprol2 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [systemd:1]
Nov 19 08:37:18 hpprol2 kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse pppoe pppox af_packet ppp_generic slhc 8021q garp mrp xt_TCPMSS xt_state nf_nat_tftp n>
Nov 19 08:37:18 hpprol2 kernel:  snd_rawmidi snd_seq_device snd_pcm acpi_ipmi snd_timer snd soundcore ipmi_si thermal ipmi_devintf ipmi_msghandler tiny_power_button nfsd auth_r>
Nov 19 08:37:18 hpprol2 kernel: CPU: 1 PID: 1 Comm: systemd Tainted: G S                5.9.1-2-default #1 openSUSE Tumbleweed
Nov 19 08:37:18 hpprol2 kernel: Hardware name: HP ProLiant ML350p Gen8, BIOS P72 11/14/2013
Nov 19 08:37:18 hpprol2 kernel: RIP: e030:smp_call_function_many_cond+0x299/0x2e0
Nov 19 08:37:18 hpprol2 kernel: Code: 89 fe e8 ba 67 43 00 3b 05 88 c1 83 01 89 c7 0f 83 f9 fd ff ff 48 63 c7 49 8b 16 48 03 14 c5 00 c9 3c 82 8b 42 08 a8 01 74 09 <f3> 90 8b 4>
Nov 19 08:37:18 hpprol2 kernel: RSP: e02b:ffffc9004002fb48 EFLAGS: 00000202
Nov 19 08:37:18 hpprol2 kernel: RAX: 0000000000000011 RBX: ffff88838126f5c8 RCX: 0000000000000009
Nov 19 08:37:18 hpprol2 kernel: RDX: ffff8883814753a0 RSI: 0000000000000000 RDI: 0000000000000009
Nov 19 08:37:18 hpprol2 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000009
Nov 19 08:37:18 hpprol2 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Nov 19 08:37:18 hpprol2 kernel: R13: 0000000000000200 R14: ffff88838126f580 R15: ffff88838126f588
Nov 19 08:37:18 hpprol2 kernel: FS:  00007f02caf64940(0000) GS:ffff888381240000(0000) knlGS:0000000000000000
Nov 19 08:37:18 hpprol2 kernel: CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 19 08:37:18 hpprol2 kernel: CR2: 000055a28e658098 CR3: 00000003773ae000 CR4: 0000000000040660
Nov 19 08:37:18 hpprol2 kernel: Call Trace:
Nov 19 08:37:18 hpprol2 kernel:  ? __flush_tlb_all+0x30/0x30
Nov 19 08:37:18 hpprol2 kernel:  ? __flush_tlb_all+0x30/0x30
Nov 19 08:37:18 hpprol2 kernel:  on_each_cpu+0x2b/0x60
Nov 19 08:37:18 hpprol2 kernel:  __purge_vmap_area_lazy+0x5d/0x670
Nov 19 08:37:18 hpprol2 kernel:  ? do_jit+0xbe6/0x1ca0
Nov 19 08:37:18 hpprol2 kernel:  _vm_unmap_aliases.part.0+0x104/0x140
Nov 19 08:37:18 hpprol2 kernel:  change_page_attr_set_clr+0xb9/0x1c0
Nov 19 08:37:18 hpprol2 kernel:  set_memory_ro+0x26/0x30
Nov 19 08:37:18 hpprol2 kernel:  bpf_int_jit_compile+0x329/0x38f
Nov 19 08:37:18 hpprol2 kernel:  bpf_prog_select_runtime+0x101/0x1a0
Nov 19 08:37:18 hpprol2 kernel:  bpf_prog_load+0x47b/0x8b0
Nov 19 08:37:18 hpprol2 kernel:  ? _cond_resched+0x16/0x40
Nov 19 08:37:18 hpprol2 kernel:  ? slab_pre_alloc_hook.constprop.0+0xd0/0x110
Nov 19 08:37:18 hpprol2 kernel:  ? _kstrtoull+0x35/0xd0
Nov 19 08:37:18 hpprol2 kernel:  __do_sys_bpf+0x405/0x750
Nov 19 08:37:18 hpprol2 kernel:  do_syscall_64+0x33/0x80
Nov 19 08:37:18 hpprol2 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 19 08:37:18 hpprol2 kernel: RIP: 0033:0x7f02cba6357d
Nov 19 08:37:18 hpprol2 kernel: Code: d1 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f>
Nov 19 08:37:18 hpprol2 kernel: RSP: 002b:00007fffdccf74a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
Nov 19 08:37:18 hpprol2 kernel: RAX: ffffffffffffffda RBX: 000055a28e648990 RCX: 00007f02cba6357d
Nov 19 08:37:18 hpprol2 kernel: RDX: 0000000000000070 RSI: 00007fffdccf74b0 RDI: 0000000000000005
Nov 19 08:37:18 hpprol2 kernel: RBP: 0000000000000000 R08: 000055a28e3b6010 R09: 0000000800000008
Nov 19 08:37:18 hpprol2 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000055a28e5d5490
Nov 19 08:37:18 hpprol2 kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 000055a28e641f30
Nov 19 08:37:46 hpprol2 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [systemd:1]
...."

The system is then locked (numlock doesn't respond and changing to VT is not possible) --> I need to do a hard restart.

Regards
Philippe
Comment 5 Miroslav Beneš 2022-01-14 14:44:46 UTC
Philippe, yours seems to be a different bug. If it still persists with the latest TW kernel, could you report it separately, please?

Michael, does the issue still exist with the latest TW kernel, please?