Bugzilla – Bug 1106840
Kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 14032s!
Last modified: 2019-11-21 09:46:16 UTC
Created attachment 781634 [details] Output of hwinfo for virtual guest Once or twice a week we get a series of "BUG: workqueue lockup" messages in the journal. For example, today we had a whole series of them, the last three being: Sep 02 06:01:29 nocc kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 14186s! Sep 02 06:01:29 nocc kernel: Showing busy workqueues and worker pools: Sep 02 06:01:29 nocc kernel: workqueue events: flags=0x0 Sep 02 06:01:29 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 Sep 02 06:01:29 nocc kernel: pending: vmstat_shepherd, cache_reap, wakeup_dirtytime_writeback Sep 02 06:01:29 nocc kernel: workqueue events_long: flags=0x0 Sep 02 06:01:29 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 Sep 02 06:01:29 nocc kernel: pending: gc_worker [nf_conntrack] Sep 02 06:01:29 nocc kernel: workqueue mm_percpu_wq: flags=0x8 Sep 02 06:01:29 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 Sep 02 06:01:29 nocc kernel: pending: vmstat_update Sep 02 06:01:59 nocc kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 14217s! Sep 02 06:01:59 nocc kernel: Showing busy workqueues and worker pools: Sep 02 06:01:59 nocc kernel: workqueue events: flags=0x0 Sep 02 06:01:59 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 Sep 02 06:01:59 nocc kernel: pending: vmstat_shepherd, cache_reap, wakeup_dirtytime_writeback Sep 02 06:01:59 nocc kernel: workqueue events_long: flags=0x0 Sep 02 06:01:59 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 Sep 02 06:01:59 nocc kernel: pending: gc_worker [nf_conntrack] Sep 02 06:01:59 nocc kernel: workqueue mm_percpu_wq: flags=0x8 Sep 02 06:01:59 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 Sep 02 06:01:59 nocc kernel: pending: vmstat_update Sep 02 06:02:30 nocc kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 14247s! Sep 02 06:02:30 nocc kernel: Showing busy workqueues and worker pools: Sep 02 06:02:30 nocc kernel: workqueue events: flags=0x0 Sep 02 06:02:30 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 Sep 02 06:02:30 nocc kernel: pending: vmstat_shepherd, cache_reap, wakeup_dirtytime_writeback Sep 02 06:02:30 nocc kernel: workqueue events_long: flags=0x0 Sep 02 06:02:30 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 Sep 02 06:02:30 nocc kernel: pending: gc_worker [nf_conntrack] Sep 02 06:02:30 nocc kernel: workqueue mm_percpu_wq: flags=0x8 Sep 02 06:02:30 nocc kernel: pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 Sep 02 06:02:30 nocc kernel: pending: vmstat_update So it's been stuck for about 3.5 hours. There are no other messages in the journal in between these messages. It happens on several machines, seemingly at random. The machines are all paravirtualized running on Xen 4.10. We can't find anything in the journal of the virtual host during the time of lockup. The output of hwinfo for the virtual guest is attached.
The same issue appeared in TW Build20190414 https://openqa.opensuse.org/tests/908300#step/curl_https/2 Logs: https://openqa.opensuse.org/tests/908300#downloads [ 132.293565] show_signal_msg: 1 callbacks suppressed [ 132.293568] pool-tracker-st[3722]: segfault at ffffffffffffffb0 ip 00007f3aa471400b sp 00007f3aa132f5f0 error 5 in libtracker-data.so[7f3aa46f1000+47000] [ 132.295635] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48 [ 132.666166] pool-tracker-st[3739]: segfault at ffffffffffffffb0 ip 00007ff94db0a00b sp 00007ff94a725510 error 5 in libtracker-data.so[7ff94dae7000+47000] [ 132.667569] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48 [ 133.145171] pool-tracker-st[3750]: segfault at ffffffffffffffb0 ip 00007feb6eae500b sp 00007feb6b700660 error 5 in libtracker-data.so[7feb6eac2000+47000] [ 133.146591] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48 [ 133.667361] pool-tracker-st[3763]: segfault at ffffffffffffffb0 ip 00007f4c4c8db00b sp 00007f4c494f5be0 error 5 in libtracker-data.so[7f4c4c8b8000+47000] [ 133.668776] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48 [ 134.139026] pool-tracker-st[3777]: segfault at ffffffffffffffb0 ip 00007efef04f300b sp 00007efeed10e660 error 5 in libtracker-data.so[7efef04d0000+47000] [ 134.141086] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48 [ 254.294698] pool-tracker-st[3910]: segfault at ffffffffffffffb0 ip 00007fb0a063500b sp 00007fb09d250660 error 5 in libtracker-data.so[7fb0a0612000+47000] [ 254.296085] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48 [ 754.423393] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 56s! [ 754.424255] Showing busy workqueues and worker pools: [ 754.424791] workqueue events: flags=0x0 [ 754.425205] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256 [ 754.425807] pending: free_work, check_corruption, vmstat_shepherd, cache_reap [ 754.426574] workqueue events_power_efficient: flags=0x80 [ 754.427114] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256 [ 754.427751] pending: ioc_release_fn, fb_flashcursor, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work [ 754.428921] workqueue events_freezable_power_: flags=0x84 [ 754.429484] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 754.430094] pending: disk_events_workfn [ 754.430541] workqueue mm_percpu_wq: flags=0x8 [ 754.430997] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 754.432071] pending: vmstat_update [ 754.432641] workqueue writeback: flags=0x4e [ 754.433288] pwq 2: cpus=0 flags=0x4 nice=0 active=1/256 [ 754.434191] pending: wb_workfn [ 754.434742] workqueue kblockd: flags=0x18 [ 754.435336] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256 [ 754.436265] pending: blk_mq_timeout_work, blk_mq_timeout_work [ 792.133047] BTRFS info (device vda2): balance: start -dusage=0
I do not think that the two reports are caused by the same bug. Many kernel services are using workqueues. The "workqueue lockup" means that the workqueue was not able to finish the currently running work for a long time. There might be different reasons, namely: + The currently procceed work has too much to do. + The workqueue kthread is not scheduled because the CPU is occupied by a higher priority task. + The virtual machine does not have enough time to run because the host is overloaded. We need to know what is running on the system when problem happens. Sigh, this bug is very old. First, do you still see the problems? If yes, could you please trigger "SysRq t" when the system is locked and provide the kernel log?
Thanks Petr for the explanation. No we don't see the problem anymore, and have not since all the affected machines were updated to a newer kernel version. So this bug can be closed as far as I'm concerned.
Theo, thanks a lot for update. Closing this bug based on comment #3.