Bug 1213578

Summary: OOPS in amdgpu
Product: [openSUSE] openSUSE Distribution Reporter: Andreas Jaeger <aj>
Component: KernelAssignee: Takashi Iwai <tiwai>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: aj, mge
Version: Leap 15.5   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: the two oops I could find in /var/log/messages
dmesg from home:tiwai:bsc1213578-2
dmesg from home:tiwai:bsc1213578-3

Description Andreas Jaeger 2023-07-24 06:49:26 UTC
I get an OOPs with both 5.14.21-150500.55.7-default and also with Takachi's  5.14.21-150500.3.g62ee467-default
Comment 1 Andreas Jaeger 2023-07-24 06:51:32 UTC
Created attachment 868390 [details]
the two oops I could find in /var/log/messages
Comment 2 Andreas Jaeger 2023-07-24 06:52:15 UTC
hwinfo --gfxcard
31: PCI 500.0: 0300 VGA compatible controller (VGA)             
  [Created at pci.386]
  Unique ID: Ddhb.uZbpCsxmrO5
  Parent ID: JZZT.nyyq4tDu6x8
  SysFS ID: /devices/pci0000:00/0000:00:08.1/0000:05:00.0
  SysFS BusID: 0000:05:00.0
  Hardware Class: graphics card
  Model: "ATI Picasso"
  Vendor: pci 0x1002 "ATI Technologies Inc"
  Device: pci 0x15d8 "Picasso"
  SubVendor: pci 0x17aa "Lenovo"
  SubDevice: pci 0x5127 
  Revision: 0xd1
  Driver: "amdgpu"
  Driver Modules: "amdgpu"
  Memory Range: 0xc0000000-0xcfffffff (ro,non-prefetchable)
  Memory Range: 0xd0000000-0xd01fffff (ro,non-prefetchable)
  I/O Ports: 0x1000-0x1fff (rw)
  Memory Range: 0xd0500000-0xd057ffff (rw,non-prefetchable)
  IRQ: 50 (no events)
  Module Alias: "pci:v00001002d000015D8sv000017AAsd00005127bc03sc00i00"
  Driver Info #0:
    Driver Status: amdgpu is active
    Driver Activation Cmd: "modprobe amdgpu"
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #25 (PCI bridge)

Primary display adapter: #31

# hwinfo --monitor
35: None 00.0: 10002 LCD Monitor                                
  [Created at monitor.125]
  Unique ID: rdCR.mQXMLz_WQq5
  Parent ID: Ddhb.uZbpCsxmrO5
  Hardware Class: monitor
  Model: "AUO LCD Monitor"
  Vendor: AUO "AUO"
  Device: eisa 0x573d 
  Serial ID: "0"
  Resolution: 1920x1080@60Hz
  Size: 309x174 mm
  Year of Manufacture: 2018
  Week of Manufacture: 0
  Detailed Timings #0:
     Resolution: 1920x1080
     Horizontal: 1920 1936 1952 2080 (+16 +32 +160) -hsync
       Vertical: 1080 1083 1088 1142 (+3 +8 +62) -vsync
    Frequencies: 142.60 MHz, 68.56 kHz, 60.03 Hz
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #25 (VGA compatible controller)

36: None 01.0: 10002 LCD Monitor
  [Created at monitor.125]
  Unique ID: wkFv.zdQ3vHfjlr1
  Parent ID: Ddhb.uZbpCsxmrO5
  Hardware Class: monitor
  Model: "DELL U2419H"
  Vendor: DEL "DELL"
  Device: eisa 0x4148 "DELL U2419H"
  Serial ID: "5ZC7SS2"
  Resolution: 720x400@70Hz
  Resolution: 640x480@60Hz
  Resolution: 640x480@75Hz
  Resolution: 800x600@60Hz
  Resolution: 800x600@75Hz
  Resolution: 1024x768@60Hz
  Resolution: 1024x768@75Hz
  Resolution: 1280x1024@75Hz
  Resolution: 1152x864@75Hz
  Resolution: 1280x1024@60Hz
  Resolution: 1600x900@60Hz
  Resolution: 1920x1080@60Hz
  Size: 527x296 mm
  Year of Manufacture: 2019
  Week of Manufacture: 44
  Detailed Timings #0:
     Resolution: 1920x1080
     Horizontal: 1920 2008 2052 2200 (+88 +132 +280) +hsync
       Vertical: 1080 1084 1089 1125 (+4 +9 +45) +vsync
    Frequencies: 148.50 MHz, 67.50 kHz, 60.00 Hz
  Driver Info #0:
    Max. Resolution: 1920x1080
    Vert. Sync Range: 56-76 Hz
    Hor. Sync Range: 30-83 kHz
    Bandwidth: 148 MHz
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #25 (VGA compatible controller)

37: None 02.0: 10002 LCD Monitor
  [Created at monitor.125]
  Unique ID: +rIN.8N48X7gRWVA
  Parent ID: Ddhb.uZbpCsxmrO5
  Hardware Class: monitor
  Model: "DELL U2414H"
  Vendor: DEL "DELL"
  Device: eisa 0xa0b2 "DELL U2414H"
  Serial ID: "X4J717CQ18UL"
  Resolution: 720x400@70Hz
  Resolution: 640x480@60Hz
  Resolution: 640x480@75Hz
  Resolution: 800x600@60Hz
  Resolution: 800x600@75Hz
  Resolution: 1024x768@60Hz
  Resolution: 1024x768@75Hz
  Resolution: 1280x1024@75Hz
  Resolution: 1152x864@75Hz
  Resolution: 1280x1024@60Hz
  Resolution: 1600x900@60Hz
  Resolution: 1600x1200@60Hz
  Resolution: 1920x1080@60Hz
  Size: 527x296 mm
  Year of Manufacture: 2017
  Week of Manufacture: 52
  Detailed Timings #0:
     Resolution: 1920x1080
     Horizontal: 1920 2008 2052 2200 (+88 +132 +280) +hsync
       Vertical: 1080 1084 1089 1125 (+4 +9 +45) +vsync
    Frequencies: 148.50 MHz, 67.50 kHz, 60.00 Hz
  Driver Info #0:
    Max. Resolution: 1920x1080
    Vert. Sync Range: 56-76 Hz
    Hor. Sync Range: 30-83 kHz
    Bandwidth: 148 MHz
  Config Status: cfg=new, avail=yes, need=no, active=unknown
  Attached to: #25 (VGA compatible controller)
Comment 3 Takashi Iwai 2023-07-24 08:06:08 UTC
Thanks.  This looks like the upstream issue
  https://gitlab.freedesktop.org/drm/amd/-/issues/2314

I'm building yet another test kernel with some backports in OBS home:tiwai:bsc1213578.  Please give it a try later once after the build finishes.
Comment 4 Takashi Iwai 2023-07-24 08:40:58 UTC
And, I'm building yet two more test kernels in OBS home:tiwai:bsc1213578-2 and home:tiwai:bsc1213578-3 repos.

The first one is another upstream fix, and please test it in anyway to check whether it gives more regression or not.

The latter one is a downstream fix for NULL dereferences, and this should work around the Oops, at least.  If the previous two kernels don't work, please check this one.  If this is the only one that works, I'll add this workaround for the next update.
Comment 6 Andreas Jaeger 2023-07-24 08:54:23 UTC
Thanks, Takashi! Waiting for the builds now...
Comment 7 Andreas Jaeger 2023-07-24 09:21:36 UTC
Booted kernel-default-5.14.21-150500.1.1.g0e39bed.x86_64 from https://build.opensuse.org/repositories/home:tiwai:bsc1213578 - crashed when starting X11. No oops after reboot.

Now to the next one..
Comment 8 Andreas Jaeger 2023-07-24 09:25:05 UTC
I meant: No OOPS in /var/log/messages found
Comment 9 Andreas Jaeger 2023-07-24 10:06:58 UTC
Created attachment 868394 [details]
dmesg from home:tiwai:bsc1213578-2

 home:tiwai:bsc1213578-2 crashed when connecting external monitors, attaching dmesg output.
Comment 10 Andreas Jaeger 2023-07-24 10:09:10 UTC
Created attachment 868395 [details]
dmesg from home:tiwai:bsc1213578-3

home:tiwai:bsc1213578-3 produces an OOPS as well, see dmesg attachment.

BUT: I report this now from the system with two external monitors attached, so it recovered. I booted up without external monitors and then connected them.

$ uname -a
Linux t495s 5.14.21-150500.1.g06f3d0e-default #1 SMP PREEMPT_DYNAMIC Mon Jul 24 08:36:58 UTC 2023 (06f3d0e) x86_64 x86_64 x86_64 GNU/Linux
Comment 11 Takashi Iwai 2023-07-24 12:58:34 UTC
(In reply to Andreas Jaeger from comment #10)
> Created attachment 868395 [details]
> dmesg from home:tiwai:bsc1213578-3
> 
> home:tiwai:bsc1213578-3 produces an OOPS as well, see dmesg attachment.

Those are no real crash but just kernel WARNINGs from ASSERT() macros.
To be fixed, of course.

> BUT: I report this now from the system with two external monitors attached,
> so it recovered. I booted up without external monitors and then connected
> them.
> 
> $ uname -a
> Linux t495s 5.14.21-150500.1.g06f3d0e-default #1 SMP PREEMPT_DYNAMIC Mon Jul
> 24 08:36:58 UTC 2023 (06f3d0e) x86_64 x86_64 x86_64 GNU/Linux

So, how is the behavior of *-3 kernel except for those kernel warnings?
Does it still show other breakage?
Comment 12 Andreas Jaeger 2023-07-24 13:29:46 UTC
The latest kernel had initial a network connection problem and gnome-shell started without any extensions which I was later able to enable. After that I worked fine for an hour until I rebooted.

I don't know whether the network and gnome-shell problems were related to the kernel.

Let me try that kernel again...
Comment 13 Andreas Jaeger 2023-07-24 13:36:08 UTC
Rebooted, all fine. Will use it for the next 2 hours and report if any problems arise.

No OOPS/assert - booted this time with external monitors attached directly.

uname -a
Linux t495s 5.14.21-150500.1.g06f3d0e-default #1 SMP PREEMPT_DYNAMIC Mon Jul 24 08:36:58 UTC 2023 (06f3d0e) x86_64 x86_64 x86_64 GNU/Linux
Comment 22 Takashi Iwai 2023-08-07 12:01:58 UTC
Is there more bug to be fixed with the latest SLE15-SP5 kernel?  (At best check with the kernel in OBS Kernel:SLE15-SP5 repo.)

If yes, could you elaborate how to trigger it?
Comment 23 Andreas Jaeger 2023-08-07 13:05:18 UTC
Ok, download kernel from OBS Kernel:SLE15-SP5, uname -a reports:

Linux t495s 5.14.21-150500.158.g6eb8d8a-default #1 SMP PREEMPT_DYNAMIC Thu Aug 3 12:29:06 UTC 2023 (6eb8d8a) x86_64 x86_64 x86_64 GNU/Linux

Booted up fine, I'll run it now for some time and will then report back.

Thanks, Takashi!
Comment 24 Andreas Jaeger 2023-08-08 05:47:40 UTC
Looking still fine!
Comment 25 Takashi Iwai 2023-08-08 06:47:41 UTC
OK, then let's close now.  Feel free to reopen if you hit the same bug (but maybe better to open another entry as it can be a different problem).
Comment 33 Maintenance Automation 2023-08-14 08:30:30 UTC
SUSE-SU-2023:3302-1: An update that solves 28 vulnerabilities, contains two features and has 115 fixes can now be installed.

Category: security (important)
Bug References: 1150305, 1187829, 1193629, 1194869, 1206418, 1207129, 1207894, 1207948, 1208788, 1210335, 1210565, 1210584, 1210627, 1210780, 1210825, 1210853, 1211014, 1211131, 1211243, 1211738, 1211811, 1211867, 1212051, 1212256, 1212265, 1212301, 1212445, 1212456, 1212502, 1212525, 1212603, 1212604, 1212685, 1212766, 1212835, 1212838, 1212842, 1212846, 1212848, 1212861, 1212869, 1212892, 1212901, 1212905, 1212961, 1213010, 1213011, 1213012, 1213013, 1213014, 1213015, 1213016, 1213017, 1213018, 1213019, 1213020, 1213021, 1213024, 1213025, 1213032, 1213034, 1213035, 1213036, 1213037, 1213038, 1213039, 1213040, 1213041, 1213059, 1213061, 1213087, 1213088, 1213089, 1213090, 1213092, 1213093, 1213094, 1213095, 1213096, 1213098, 1213099, 1213100, 1213102, 1213103, 1213104, 1213105, 1213106, 1213107, 1213108, 1213109, 1213110, 1213111, 1213112, 1213113, 1213114, 1213116, 1213134, 1213167, 1213205, 1213206, 1213226, 1213233, 1213245, 1213247, 1213252, 1213258, 1213259, 1213263, 1213264, 1213272, 1213286, 1213287, 1213304, 1213417, 1213493, 1213523, 1213524, 1213533, 1213543, 1213578, 1213585, 1213586, 1213588, 1213601, 1213620, 1213632, 1213653, 1213705, 1213713, 1213715, 1213747, 1213756, 1213759, 1213777, 1213810, 1213812, 1213856, 1213857, 1213863, 1213867, 1213870, 1213871, 1213872
CVE References: CVE-2022-40982, CVE-2023-0459, CVE-2023-1829, CVE-2023-20569, CVE-2023-20593, CVE-2023-21400, CVE-2023-2156, CVE-2023-2166, CVE-2023-2430, CVE-2023-2985, CVE-2023-3090, CVE-2023-31083, CVE-2023-3111, CVE-2023-3117, CVE-2023-31248, CVE-2023-3212, CVE-2023-3268, CVE-2023-3389, CVE-2023-3390, CVE-2023-35001, CVE-2023-3567, CVE-2023-3609, CVE-2023-3611, CVE-2023-3776, CVE-2023-3812, CVE-2023-38409, CVE-2023-3863, CVE-2023-4004
Jira References: PED-4718, PED-4758
Sources used:
openSUSE Leap 15.5 (src): kernel-livepatch-SLE15-SP5-RT_Update_3-1-150500.11.5.1, kernel-syms-rt-5.14.21-150500.13.11.1, kernel-source-rt-5.14.21-150500.13.11.1
SUSE Linux Enterprise Live Patching 15-SP5 (src): kernel-livepatch-SLE15-SP5-RT_Update_3-1-150500.11.5.1
SUSE Real Time Module 15-SP5 (src): kernel-syms-rt-5.14.21-150500.13.11.1, kernel-source-rt-5.14.21-150500.13.11.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 34 Maintenance Automation 2023-08-14 16:30:32 UTC
SUSE-SU-2023:3311-1: An update that solves 15 vulnerabilities and has 27 fixes can now be installed.

Category: security (important)
Bug References: 1206418, 1207129, 1207948, 1210627, 1210780, 1210825, 1211131, 1211738, 1211811, 1212445, 1212502, 1212604, 1212766, 1212901, 1213167, 1213272, 1213287, 1213304, 1213417, 1213578, 1213585, 1213586, 1213588, 1213601, 1213620, 1213632, 1213653, 1213713, 1213715, 1213747, 1213756, 1213759, 1213777, 1213810, 1213812, 1213856, 1213857, 1213863, 1213867, 1213870, 1213871, 1213872
CVE References: CVE-2022-40982, CVE-2023-0459, CVE-2023-20569, CVE-2023-21400, CVE-2023-2156, CVE-2023-2166, CVE-2023-31083, CVE-2023-3268, CVE-2023-3567, CVE-2023-3609, CVE-2023-3611, CVE-2023-3776, CVE-2023-38409, CVE-2023-3863, CVE-2023-4004
Sources used:
openSUSE Leap 15.5 (src): kernel-syms-5.14.21-150500.55.19.1, kernel-default-base-5.14.21-150500.55.19.1.150500.6.6.4, kernel-livepatch-SLE15-SP5_Update_3-1-150500.11.3.4, kernel-source-5.14.21-150500.55.19.1, kernel-obs-qa-5.14.21-150500.55.19.1, kernel-obs-build-5.14.21-150500.55.19.1
Basesystem Module 15-SP5 (src): kernel-default-base-5.14.21-150500.55.19.1.150500.6.6.4, kernel-source-5.14.21-150500.55.19.1
Development Tools Module 15-SP5 (src): kernel-obs-build-5.14.21-150500.55.19.1, kernel-syms-5.14.21-150500.55.19.1, kernel-source-5.14.21-150500.55.19.1
SUSE Linux Enterprise Live Patching 15-SP5 (src): kernel-livepatch-SLE15-SP5_Update_3-1-150500.11.3.4

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 35 Maintenance Automation 2023-08-22 16:30:07 UTC
SUSE-SU-2023:3376-1: An update that solves 15 vulnerabilities and has 27 fixes can now be installed.

Category: security (important)
Bug References: 1206418, 1207129, 1207948, 1210627, 1210780, 1210825, 1211131, 1211738, 1211811, 1212445, 1212502, 1212604, 1212766, 1212901, 1213167, 1213272, 1213287, 1213304, 1213417, 1213578, 1213585, 1213586, 1213588, 1213601, 1213620, 1213632, 1213653, 1213713, 1213715, 1213747, 1213756, 1213759, 1213777, 1213810, 1213812, 1213856, 1213857, 1213863, 1213867, 1213870, 1213871, 1213872
CVE References: CVE-2022-40982, CVE-2023-0459, CVE-2023-20569, CVE-2023-21400, CVE-2023-2156, CVE-2023-2166, CVE-2023-31083, CVE-2023-3268, CVE-2023-3567, CVE-2023-3609, CVE-2023-3611, CVE-2023-3776, CVE-2023-38409, CVE-2023-3863, CVE-2023-4004
Sources used:
openSUSE Leap 15.5 (src): kernel-syms-azure-5.14.21-150500.33.14.1, kernel-source-azure-5.14.21-150500.33.14.1
Public Cloud Module 15-SP5 (src): kernel-syms-azure-5.14.21-150500.33.14.1, kernel-source-azure-5.14.21-150500.33.14.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.