Bug 1178911 - BUG: workqueue lockup - pool, MSI Bravo 17 kworker stuck
BUG: workqueue lockup - pool, MSI Bravo 17 kworker stuck
Status: RESOLVED NORESPONSE
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 openSUSE Tumbleweed
: P5 - None : Critical (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-11-17 18:02 UTC by hanta
Modified: 2022-01-14 14:49 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
tiwai: needinfo? (busshanta)


Attachments
cpuinfo (23.60 KB, text/x-log)
2020-11-17 18:02 UTC, hanta
Details
dmesg during session (84.97 KB, text/plain)
2020-11-17 18:03 UTC, hanta
Details
lspci (3.16 KB, text/plain)
2020-11-17 18:04 UTC, hanta
Details
glxinfo (146.25 KB, text/x-log)
2020-11-17 18:04 UTC, hanta
Details
new dmesg log (620.71 KB, application/x-xz)
2020-11-18 13:41 UTC, hanta
Details

Note You need to log in before you can comment on or make changes to this bug.
Description hanta 2020-11-17 18:02:02 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0
Build Identifier: 

I have MSI Bravo 17. I use opensuse Tumbleweed with kde. the updates are up to date.
See specs in files
Everytime i play a unity engine game or i watch a video in youtube via firefox, i get a syslog warn message.

The message is in the syslog workqueue lockup.
I use Opensuse tumbleweed that was upgraded from leap, because i could not install tumbleweed on its own.

I use steam for playing games on my laptop. whwn i start a unity game sometimes i get the syslog message.



Reproducible: Always

Steps to Reproduce:
1.start steam
2.play a game or watch video
3. see th
Actual Results:  
I get a syslog message poping up and telling me kworker is stuck for xxy seconds ans sometime the system freezes entirely

Expected Results:  
not freeze the system/ kworker should recatch its queue
Comment 1 hanta 2020-11-17 18:02:42 UTC
Created attachment 843682 [details]
cpuinfo
Comment 2 hanta 2020-11-17 18:03:33 UTC
Created attachment 843683 [details]
dmesg during session
Comment 3 hanta 2020-11-17 18:04:11 UTC
Created attachment 843684 [details]
lspci
Comment 4 hanta 2020-11-17 18:04:52 UTC
Created attachment 843685 [details]
glxinfo
Comment 5 hanta 2020-11-17 18:05:54 UTC
uname :Linux 5.9.1-2-default #1 SMP Mon Oct 26 07:02:23 UTC 2020 (435e92d) x86_64 x86_64 x86_64 GNU/Linux
Comment 6 Takashi Iwai 2020-11-18 09:35:18 UTC
Is the kernel Oops stack trace recorded in the early log?  The attached dmesg output doesn't include that part, so it's very difficult to know what went wrong.

If you can reproduce and get the dmesg output containing it or if you find the relevant Oops from the early messages, please upload it.

Last but not least, try the newer 5.9.x kernel in OBS Kernel:stable repo.  It might have been already addressed.
Comment 7 hanta 2020-11-18 13:41:48 UTC
Created attachment 843704 [details]
new dmesg log

look at 2020-11-15T08:58:08.358937+01:00 or line 21264 for report
Comment 8 hanta 2020-11-18 18:38:21 UTC
I added the updated dmesg in the attachment.
Comment 9 hanta 2020-11-19 16:07:52 UTC
(In reply to Takashi Iwai from comment #6)
> Is the kernel Oops stack trace recorded in the early log?  The attached
> dmesg output doesn't include that part, so it's very difficult to know what
> went wrong.
> 
> If you can reproduce and get the dmesg output containing it or if you find
> the relevant Oops from the early messages, please upload it.
> 
> Last but not least, try the newer 5.9.x kernel in OBS Kernel:stable repo. 
> It might have been already addressed.

Dmesg has been  updated
Comment 10 Takashi Iwai 2020-11-23 08:04:28 UTC
There are a few stack traces found in the last log, showing AMDGPU-related problems.

The first stack trace is triggered by the AMDGPU reset, which leads to a kernel panic at mm/slub.c.

2020-11-15T01:03:55.878624+01:00 linux kernel: [ 3627.675607] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
2020-11-15T01:03:55.902815+01:00 linux kernel: [ 3627.699590] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
2020-11-15T01:03:55.902837+01:00 linux kernel: [ 3627.699602] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x00000036, smu fw if version = 0x00000037, smu fw version = 0x00351f00 (53.3
1.0)
2020-11-15T01:03:55.902842+01:00 linux kernel: [ 3627.699604] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
2020-11-15T01:03:58.696931+01:00 linux kernel: [ 3630.497375] amdgpu 0000:03:00.0: amdgpu: failed send message:     RunBtc (58)         param: 0x00000000 response 0xffffffc2
2020-11-15T01:03:58.696958+01:00 linux kernel: [ 3630.497379] amdgpu 0000:03:00.0: amdgpu: RunBtc failed!
2020-11-15T01:03:58.696961+01:00 linux kernel: [ 3630.497382] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
2020-11-15T01:03:58.696964+01:00 linux kernel: [ 3630.497580] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
2020-11-15T01:03:58.696967+01:00 linux kernel: [ 3630.497715] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-62).
2020-11-15T01:03:58.722615+01:00 linux kernel: [ 3630.519623] snd_hda_intel 0000:03:00.1: refused to change power state from D3hot to D0
2020-11-15T01:03:58.823491+01:00 linux kernel: [ 3630.624390] snd_hda_intel 0000:03:00.1: CORB reset timeout#2, CORBRP = 65535
2020-11-15T01:04:08.886729+01:00 linux kernel: [ 3640.683786] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=154, emitted seq=155
2020-11-15T01:04:08.886752+01:00 linux kernel: [ 3640.683947] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
2020-11-15T01:04:08.886755+01:00 linux kernel: [ 3640.683954] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
2020-11-15T01:04:08.886760+01:00 linux kernel: [ 3640.683984] ------------[ cut here ]------------
2020-11-15T01:04:08.886763+01:00 linux kernel: [ 3640.683986] kernel BUG at mm/slub.c:304!
2020-11-15T01:04:08.886765+01:00 linux kernel: [ 3640.683994] invalid opcode: 0000 [#1] SMP NOPTI
2020-11-15T01:04:08.886769+01:00 linux kernel: [ 3640.684002] CPU: 10 PID: 2135 Comm: kworker/10:3 Not tainted 5.9.1-2-default #1 openSUSE Tumbleweed
2020-11-15T01:04:08.886774+01:00 linux kernel: [ 3640.684006] Hardware name: Micro-Star International Co., Ltd. Bravo 17 A4DDR/MS-17FK, BIOS E17FKAMS.116 07/10/2020
2020-11-15T01:04:08.886776+01:00 linux kernel: [ 3640.684015] Workqueue: events drm_sched_job_timedout [gpu_sched]
2020-11-15T01:04:08.886780+01:00 linux kernel: [ 3640.684023] RIP: 0010:__slab_free+0x1f8/0x360
2020-11-15T01:04:08.886783+01:00 linux kernel: [ 3640.684027] Code: 44 24 20 e8 2a fc ff ff 44 8b 44 24 20 85 c0 0f 85 4a fe ff ff eb c6 41 f7 46 08 00 0d 21 00 0f 85 2c ff ff ff e9 1e ff ff ff <0f> 0b 80 4c 24 5b 80 45 31 c9 e9 8b fe ff ff 48 8d 65 d8 4c 89 e6
2020-11-15T01:04:08.886787+01:00 linux kernel: [ 3640.684033] RSP: 0018:ffff9a22431afc90 EFLAGS: 00010246
2020-11-15T01:04:08.886789+01:00 linux kernel: [ 3640.684036] RAX: ffff8cbd7f9332d0 RBX: 000000008080007e RCX: ffff8cbd7f9332c0
2020-11-15T01:04:08.886792+01:00 linux kernel: [ 3640.684039] RDX: ffff8cbd7f9332c0 RSI: ffffd8edc7fe4cc0 RDI: ffff8cbc87c43c00
2020-11-15T01:04:08.886796+01:00 linux kernel: [ 3640.684042] RBP: ffff9a22431afd30 R08: 0000000000000001 R09: ffffffffc05c4056
2020-11-15T01:04:08.886798+01:00 linux kernel: [ 3640.684045] R10: 0000000000000000 R11: 0000000000000001 R12: ffffd8edc7fe4cc0
2020-11-15T01:04:08.886801+01:00 linux kernel: [ 3640.684047] R13: ffff8cbd7f9332c0 R14: ffff8cbc87c43c00 R15: ffff8cbd7f9332c0
2020-11-15T01:04:08.886846+01:00 linux kernel: [ 3640.684051] FS:  0000000000000000(0000) GS:ffff8cbda7680000(0000) knlGS:0000000000000000
2020-11-15T01:04:08.886849+01:00 linux kernel: [ 3640.684054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-11-15T01:04:08.886853+01:00 linux kernel: [ 3640.684057] CR2: 0000557206f798d8 CR3: 000000009a40e000 CR4: 0000000000350ee0
2020-11-15T01:04:08.886856+01:00 linux kernel: [ 3640.684060] Call Trace:
2020-11-15T01:04:08.886859+01:00 linux kernel: [ 3640.684067]  ? _cond_resched+0x16/0x40
2020-11-15T01:04:08.886862+01:00 linux kernel: [ 3640.684073]  ? start_flush_work.constprop.0+0x18/0x1b0
2020-11-15T01:04:08.886866+01:00 linux kernel: [ 3640.684077]  ? __flush_work.isra.0+0x35/0x80
2020-11-15T01:04:08.886869+01:00 linux kernel: [ 3640.684088]  ? bus_find_device+0x95/0xc0
2020-11-15T01:04:08.886872+01:00 linux kernel: [ 3640.684247]  kfd_gtt_sa_free+0x56/0x80 [amdgpu]
2020-11-15T01:04:08.886876+01:00 linux kernel: [ 3640.684417]  stop_cpsch+0x96/0xc0 [amdgpu]
2020-11-15T01:04:08.886879+01:00 linux kernel: [ 3640.684576]  kgd2kfd_suspend.part.0+0x2f/0x40 [amdgpu]
2020-11-15T01:04:08.886882+01:00 linux kernel: [ 3640.684727]  kgd2kfd_pre_reset+0x35/0x50 [amdgpu]
2020-11-15T01:04:08.886885+01:00 linux kernel: [ 3640.684921]  amdgpu_device_gpu_recover.cold+0x1ec/0x67f [amdgpu]
2020-11-15T01:04:08.886889+01:00 linux kernel: [ 3640.685088]  amdgpu_job_timedout+0x11c/0x140 [amdgpu]
2020-11-15T01:04:08.886892+01:00 linux kernel: [ 3640.685098]  drm_sched_job_timedout+0x66/0xf0 [gpu_sched]
2020-11-15T01:04:08.886895+01:00 linux kernel: [ 3640.685108]  process_one_work+0x1e3/0x3b0
2020-11-15T01:04:08.886898+01:00 linux kernel: [ 3640.685116]  worker_thread+0x46/0x340
2020-11-15T01:04:08.886902+01:00 linux kernel: [ 3640.685122]  ? process_one_work+0x3b0/0x3b0
2020-11-15T01:04:08.886905+01:00 linux kernel: [ 3640.685129]  kthread+0x11b/0x140
2020-11-15T01:04:08.886908+01:00 linux kernel: [ 3640.685136]  ? __kthread_bind_mask+0x60/0x60
2020-11-15T01:04:08.886911+01:00 linux kernel: [ 3640.685143]  ret_from_fork+0x22/0x30

Other stack traces are also triggered by AMDGPU reset, resulting in the NULL dereference at mutex_unlock or kq_uninitialize.

So the conclusion is that this is a bug in GPU reset of AMDGPU driver.

Could you check whether the latest 5.9.x kernel in OBS Kernel:stable repo still shows the same behavior?

Last but not least, please don't touch "Priority" field in Bugzilla.  This is the entry for only developer side.  Thanks.
Comment 11 Miroslav Beneš 2022-01-14 14:49:32 UTC
No response, closing.