Bug 1106840 - Kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 14032s!
Kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for...
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.0
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: Petr Mladek
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-09-02 02:10 UTC by Theo Lassonder
Modified: 2019-11-21 09:46 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
pmladek: needinfo? (mjuffermans)


Attachments
Output of hwinfo for virtual guest (86.56 KB, text/plain)
2018-09-02 02:10 UTC, Theo Lassonder
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Theo Lassonder 2018-09-02 02:10:59 UTC
Created attachment 781634 [details]
Output of hwinfo for virtual guest

Once or twice a week we get a series of "BUG: workqueue lockup" messages in the journal. For example, today we had a whole series of them, the last three being:

Sep 02 06:01:29 nocc kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 14186s!
Sep 02 06:01:29 nocc kernel: Showing busy workqueues and worker pools:
Sep 02 06:01:29 nocc kernel: workqueue events: flags=0x0
Sep 02 06:01:29 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
Sep 02 06:01:29 nocc kernel:     pending: vmstat_shepherd, cache_reap, wakeup_dirtytime_writeback
Sep 02 06:01:29 nocc kernel: workqueue events_long: flags=0x0
Sep 02 06:01:29 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
Sep 02 06:01:29 nocc kernel:     pending: gc_worker [nf_conntrack]
Sep 02 06:01:29 nocc kernel: workqueue mm_percpu_wq: flags=0x8
Sep 02 06:01:29 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
Sep 02 06:01:29 nocc kernel:     pending: vmstat_update

Sep 02 06:01:59 nocc kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 14217s!
Sep 02 06:01:59 nocc kernel: Showing busy workqueues and worker pools:
Sep 02 06:01:59 nocc kernel: workqueue events: flags=0x0
Sep 02 06:01:59 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
Sep 02 06:01:59 nocc kernel:     pending: vmstat_shepherd, cache_reap, wakeup_dirtytime_writeback
Sep 02 06:01:59 nocc kernel: workqueue events_long: flags=0x0
Sep 02 06:01:59 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
Sep 02 06:01:59 nocc kernel:     pending: gc_worker [nf_conntrack]
Sep 02 06:01:59 nocc kernel: workqueue mm_percpu_wq: flags=0x8
Sep 02 06:01:59 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
Sep 02 06:01:59 nocc kernel:     pending: vmstat_update

Sep 02 06:02:30 nocc kernel: BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 14247s!
Sep 02 06:02:30 nocc kernel: Showing busy workqueues and worker pools:
Sep 02 06:02:30 nocc kernel: workqueue events: flags=0x0
Sep 02 06:02:30 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
Sep 02 06:02:30 nocc kernel:     pending: vmstat_shepherd, cache_reap, wakeup_dirtytime_writeback
Sep 02 06:02:30 nocc kernel: workqueue events_long: flags=0x0
Sep 02 06:02:30 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
Sep 02 06:02:30 nocc kernel:     pending: gc_worker [nf_conntrack]
Sep 02 06:02:30 nocc kernel: workqueue mm_percpu_wq: flags=0x8
Sep 02 06:02:30 nocc kernel:   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
Sep 02 06:02:30 nocc kernel:     pending: vmstat_update

So it's been stuck for about 3.5 hours.

There are no other messages in the journal in between these messages.

It happens on several machines, seemingly at random. The machines are all paravirtualized running on Xen 4.10. We can't find anything in the journal of the virtual host during the time of lockup.

The output of hwinfo for the virtual guest is attached.
Comment 1 Martin Loviska 2019-04-15 14:59:44 UTC
The same issue appeared in TW Build20190414
https://openqa.opensuse.org/tests/908300#step/curl_https/2
Logs:
https://openqa.opensuse.org/tests/908300#downloads

[  132.293565] show_signal_msg: 1 callbacks suppressed
[  132.293568] pool-tracker-st[3722]: segfault at ffffffffffffffb0 ip 00007f3aa471400b sp 00007f3aa132f5f0 error 5 in libtracker-data.so[7f3aa46f1000+47000]
[  132.295635] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48
[  132.666166] pool-tracker-st[3739]: segfault at ffffffffffffffb0 ip 00007ff94db0a00b sp 00007ff94a725510 error 5 in libtracker-data.so[7ff94dae7000+47000]
[  132.667569] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48
[  133.145171] pool-tracker-st[3750]: segfault at ffffffffffffffb0 ip 00007feb6eae500b sp 00007feb6b700660 error 5 in libtracker-data.so[7feb6eac2000+47000]
[  133.146591] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48
[  133.667361] pool-tracker-st[3763]: segfault at ffffffffffffffb0 ip 00007f4c4c8db00b sp 00007f4c494f5be0 error 5 in libtracker-data.so[7f4c4c8b8000+47000]
[  133.668776] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48
[  134.139026] pool-tracker-st[3777]: segfault at ffffffffffffffb0 ip 00007efef04f300b sp 00007efeed10e660 error 5 in libtracker-data.so[7efef04d0000+47000]
[  134.141086] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48
[  254.294698] pool-tracker-st[3910]: segfault at ffffffffffffffb0 ip 00007fb0a063500b sp 00007fb09d250660 error 5 in libtracker-data.so[7fb0a0612000+47000]
[  254.296085] Code: 41 5c 41 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 49 89 f5 41 54 55 53 48 83 ec 08 48 63 1d b0 de 05 00 48 01 fb <48> 8b 43 30 8b 50 08 85 d2 75 70 48 83 7b 58 00 74 69 49 89 fc 48
[  754.423393] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 56s!
[  754.424255] Showing busy workqueues and worker pools:
[  754.424791] workqueue events: flags=0x0
[  754.425205]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[  754.425807]     pending: free_work, check_corruption, vmstat_shepherd, cache_reap
[  754.426574] workqueue events_power_efficient: flags=0x80
[  754.427114]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  754.427751]     pending: ioc_release_fn, fb_flashcursor, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work
[  754.428921] workqueue events_freezable_power_: flags=0x84
[  754.429484]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  754.430094]     pending: disk_events_workfn
[  754.430541] workqueue mm_percpu_wq: flags=0x8
[  754.430997]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  754.432071]     pending: vmstat_update
[  754.432641] workqueue writeback: flags=0x4e
[  754.433288]   pwq 2: cpus=0 flags=0x4 nice=0 active=1/256
[  754.434191]     pending: wb_workfn
[  754.434742] workqueue kblockd: flags=0x18
[  754.435336]   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256
[  754.436265]     pending: blk_mq_timeout_work, blk_mq_timeout_work
[  792.133047] BTRFS info (device vda2): balance: start -dusage=0
Comment 2 Petr Mladek 2019-11-19 14:20:47 UTC
I do not think that the two reports are caused by the same bug. Many kernel services are using workqueues. The "workqueue lockup" means that the workqueue was not able to finish the currently running work for a long time. There might be different reasons, namely:

  + The currently procceed work has too much to do.

  + The workqueue kthread is not scheduled because the CPU is occupied
    by a higher priority task.

  + The virtual machine does not have enough time to run because the host
    is overloaded.


We need to know what is running on the system when problem happens.

Sigh, this bug is very old. First, do you still see the problems?

If yes, could you please trigger "SysRq t" when the system is locked and provide the kernel log?
Comment 3 Theo Lassonder 2019-11-19 22:22:38 UTC
Thanks Petr for the explanation. No we don't see the problem anymore, and have not since all the affected machines were updated to a newer kernel version. So this bug can be closed as far as I'm concerned.
Comment 4 Petr Mladek 2019-11-21 09:46:16 UTC
Theo, thanks a lot for update.

Closing this bug based on comment #3.