Bug 1179632 - Oops in handle_cmd_completion+0x7e9/0x1120
Oops in handle_cmd_completion+0x7e9/0x1120
Status: NEW
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.2
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-12-05 18:25 UTC by Jeff Mahoney
Modified: 2022-01-24 09:01 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
dmesg.txt (167.24 KB, text/plain)
2020-12-05 18:25 UTC, Jeff Mahoney
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff Mahoney 2020-12-05 18:25:10 UTC
Created attachment 844147 [details]
dmesg.txt

I have a UAS device that disconnects occasionally, causing a mess of I/O failures, SCSI disconnects, and USB resets.  Eventually, it oopses:

BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD db5b067 P4D db5b067 PUD db59067 PMD 0
Oops: 0000 [#1] SMP NOPTI
CPU: 7 PID: 16121 Comm: duplicacy Kdump: loaded Not tainted 5.3.18-lp152.50-default #1 openSUSE Leap 15.2
Hardware name: System manufacturer System Product Name/PRIME Z370-A II, BIOS 1202 08/15/2019
RIP: 0010:handle_cmd_completion+0x7e9/0x1120 [xhci_hcd]
Code: 0f 85 ec 06 00 00 45 0f b7 6c 24 0e 8b 04 24 48 89 ef 41 83 e5 1f 41 8d 55 ff 4c 8d 6c c5 00 49 8b 85 98 01 00 00 89 54 24 10 <48> 8b 70 08 e8 de 82 ff ff 48 89 c1 0f 1f 44 00 00 89 d9 48 c7 c2
RSP: 0000:ffffb20280270e28 EFLAGS: 00010002
RAX: 0000000000000000 RBX: 000000000000000b RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000fffffffe0 RDI: ffff90ac0764e270
RBP: ffff90ac0764e270 R08: ffff90ab3f842a70 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff90ab3f840810
R13: ffff90ac0764e2c8 R14: ffff90ae07eee920 R15: ffff90ac0764e320
FS:  00007f2145068700(0000) GS:ffff90b34edc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000004c9e006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 ? try_to_wake_up+0x460/0x4d0
 xhci_irq+0x2c7/0x440 [xhci_hcd]
 ? run_timer_softirq+0x75/0x440
 __handle_irq_event_percpu+0x46/0x1a0
 handle_irq_event_percpu+0x30/0x80
 handle_irq_event+0x3c/0x60
 handle_edge_irq+0x9b/0x1e0
 handle_irq+0x1f/0x30
 do_IRQ+0x49/0xd0
 common_interrupt+0xf/0xf
 </IRQ>


The RIP resolves to drivers/usb/host/xhci-ring.c:1157

1148 static void xhci_handle_cmd_reset_ep(struct xhci_hcd *xhci, int slot_id,
1149                 union xhci_trb *trb, u32 cmd_comp_code)
1150 {
1151         struct xhci_virt_device *vdev;
1152         struct xhci_ep_ctx *ep_ctx;
1153         unsigned int ep_index;
1154
1155         ep_index = TRB_TO_EP_INDEX(le32_to_cpu(trb->generic.field[3]));
1156         vdev = xhci->devs[slot_id];
1157         ep_ctx = xhci_get_ep_ctx(xhci, vdev->out_ctx, ep_index);
1158         trace_xhci_handle_cmd_reset_ep(ep_ctx);
1159
1160         /* This command will only fail if the endpoint wasn't halted,
1161          * but we don't care.
1162          */
1163         xhci_dbg_trace(xhci, trace_xhci_dbg_reset_ep,
1164                 "Ignoring reset ep completion code of %u", cmd_comp_code);
1165

I expect that xhci->devs[slot_id] is NULL and is getting cleared via xhci_free_virt_device when the device disappears.
Comment 1 Jeff Mahoney 2020-12-05 18:26:07 UTC
Unfortunately, the kdump environment OOMed while capturing the vmcore so all I have is the dmesg.
Comment 2 Takashi Iwai 2020-12-05 20:32:31 UTC
As a mandatory question: is it a regression from the earlier releases?  If yes, which was the last working kernel?

Through a quick glance, dmesg showed quite lots of other errors beforehand, e.g. 
[28884.413438] sd 2:0:0:0: [sdz] tag#25 data cmplt err -32 uas-tag 22 inflight: CMD 
[28884.413443] sd 2:0:0:0: [sdz] tag#25 CDB: Read(16) 88 00 00 00 00 00 6b 8e 0b 00 00 00 01 00 00 00

but those don't seem always leading to the Oops.  Do such errors happen occasionally in the past, too?
Comment 3 Jeff Mahoney 2020-12-05 20:58:21 UTC
I don’t know if it’s a regression.  I don’t use these devices often and I started using them this week to migrate an array to larger disks so they’re seeing heavy I/O activity. I don’t expect the SCSI errors are necessarily part of it.  I think it’s the usb device disconnect in combination with the usb-level errors that are causing a use-after-free.

This is also not the most recent kernel.  I’ve installed the latest update, though nothing stood out in the changelog that looked like a fix.  I’ve also adjusted the memory reserved for kdump to (hopefully) capture a dump next time.
Comment 4 Miroslav Beneš 2022-01-21 12:21:27 UTC
Jeff, have you encountered the issue since then? It has been a while.
Comment 5 Jeff Mahoney 2022-01-22 22:16:51 UTC
I haven't, but that's more a reflection of my only using these devices when I'm migrating data between disks.
Comment 6 Miroslav Beneš 2022-01-24 09:01:12 UTC
Ok, so let's keep this open for a while and wait if you see a problem next time.