Bug 1228093 - [amdgpu] Secondary monitor does not come up with 6.10
Summary: [amdgpu] Secondary monitor does not come up with 6.10
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: x86-64 Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Jiri Slaby
QA Contact: E-mail List
URL: https://gitlab.freedesktop.org/drm/am...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-18 06:33 UTC by Jiri Slaby
Modified: 2024-10-05 05:25 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
patch (6.49 KB, patch)
2024-07-25 11:15 UTC, Jiri Slaby
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jiri Slaby 2024-07-18 06:33:07 UTC
I bisected the issue to:
commit 8b2cb32cf0c613fd937ebb49a331798985f50826
Author: Hersen Wu <hersenxs.wu@amd.com>
Date:   Mon Mar 11 18:18:34 2024 -0400

    drm/amd/display: FEC overhead should be checked once for mst slot nums

Now going to revert in stable temporarily and report to upstream.
Comment 1 Jiri Slaby 2024-07-18 06:38:53 UTC
The monitor simply does not come up in wayland-plasma6 (it does in console). It appears as if it was there (windows open there and mouse cursor can go there), but the monitor is DPMS off.

There is no difference in dmesg regarding [drm].

Reverting the above commit on the top of 6.10 makes it work again.

git bisect log for reference:
> # bad: [0c3836482481200ead7b416ca80c68a29cfdaabd] Linux 6.10
> # good: [a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6] Linux 6.9
> git bisect start 'v6.10' 'v6.9' '--' 'drivers/gpu/drm/amd/'
> # bad: [27e718ac8b8194d13eee5738c4d3fd247736186e] drm/amd/display: fix disable otg wa logic in DCN316
> git bisect bad 27e718ac8b8194d13eee5738c4d3fd247736186e
> # good: [20fd14460f45a01b9ec63aa7b12e6c3c66e54fa7] drm/amdgpu: Fix 'fw_name' buffer size to prevent truncations in amdgpu_mes_init_microcode
> git bisect good 20fd14460f45a01b9ec63aa7b12e6c3c66e54fa7
> # bad: [14f9db4271ef5c78ae87237af844f03fb192d139] drm/amd/display: Enable DTBCLK DTO earlier in the sequence
> git bisect bad 14f9db4271ef5c78ae87237af844f03fb192d139
> # good: [1c5c36530a573de1a4b647b7d8c36f3b298e60ed] drm/amd/display: Set DCN351 BB and IP the same as DCN35
> git bisect good 1c5c36530a573de1a4b647b7d8c36f3b298e60ed
> # good: [d045f4ad7700c271fa1278b78ef7722f833a8068] drm/amd/swsmu: Update smu v14.0.0 headers to be 14.0.1 compatible
> git bisect good d045f4ad7700c271fa1278b78ef7722f833a8068
> # good: [029faefb7302f1079173410697b0e14d2e56e19a] drm/amdgpu: implement IRQ_STATE_ENABLE for SDMA v4.4.2
> git bisect good 029faefb7302f1079173410697b0e14d2e56e19a
> # bad: [b7a1a0ef12b81957584fef7b61e2d5ec049c7209] drm/amd/amdgpu: add pipe1 hardware support
> git bisect bad b7a1a0ef12b81957584fef7b61e2d5ec049c7209
> # bad: [60df5628144b59d5876f8ceac624a7661c336665] drm/amd/display: handle invalid connector indices
> git bisect bad 60df5628144b59d5876f8ceac624a7661c336665
> # bad: [8b2cb32cf0c613fd937ebb49a331798985f50826] drm/amd/display: FEC overhead should be checked once for mst slot nums
> git bisect bad 8b2cb32cf0c613fd937ebb49a331798985f50826
> # first bad commit: [8b2cb32cf0c613fd937ebb49a331798985f50826] drm/amd/display: FEC overhead should be checked once for mst slot nums
Comment 2 Jiri Slaby 2024-07-18 06:44:15 UTC
The external monitor is connected via Lenovo dock (via Thunderbolt) by an HDMI cable.

The card in question:
64:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 [1002:15bf] (rev dd) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device [17aa:50da]
        Flags: bus master, fast devsel, latency 0, IRQ 57, IOMMU group 16
        Memory at 2400000000 (64-bit, prefetchable) [size=256M]
        Memory at 78000000 (64-bit, prefetchable) [size=2M]
        I/O ports at 1000 [size=256]
        Memory at 78500000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, IntMsgNum 0
        Capabilities: [a0] MSI: Enable- Count=1/4 Maskable- 64bit+
        Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270] Secondary PCI Express
        Capabilities: [2a0] Access Control Services
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [410] Physical Layer 16.0 GT/s <?>
        Capabilities: [450] Lane Margining at the Receiver
        Kernel driver in use: amdgpu
Comment 3 Daniel Schemp 2024-07-19 18:43:47 UTC
Hi, I've been made aware of this ticket in a forum topic I opened. Hopefully I can help but my issue is linked to Kernel 6.9.3+ and not 6.10 and comes up already at boot.

Topic: https://forums.opensuse.org/t/system-crashes-when-second-daisy-chained-monitor-is-attached-with-amd-gpu-with-kernel-6-9-3/176886

System:
  Kernel: 6.9.7-1-default arch: x86_64 bits: 64 compiler: gcc v: 13.3.0
    clocksource: tsc avail: hpet,acpi_pm
    parameters: initrd=\opensuse-tumbleweed\6.9.7-1-default\initrd-78cac3084ea8018dc0df08f7fd3831a49a0967c4
    root=UUID=[REDACTED] splash=silent quiet
    security=apparmor mitigations=auto
    systemd.machine_id=[REDACTED]
  Desktop: KDE Plasma v: 6.1.2 tk: Qt v: N/A info: frameworks v: 6.3.0
    wm: kwin_x11 tools: avail: xscreensaver vt: 2 dm: SDDM Distro: openSUSE
    Tumbleweed 20240712
Graphics:
  Device-1: AMD Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT]
    vendor: XFX driver: amdgpu v: kernel arch: RDNA-2 code: Navi-2x
    process: TSMC n7 (7nm) built: 2020-22 pcie: gen: 4 speed: 16 GT/s
    lanes: 16 ports: active: DP-4 empty: DP-1, DP-2, DP-3, DP-5, HDMI-A-1,
    Writeback-1 bus-ID: 2d:00.0 chip-ID: 1002:73df class-ID: 0300
  Display: x11 server: X.Org v: 21.1.12 with: Xwayland v: 24.1.0
    compositor: kwin_x11 driver: X: loaded: modesetting unloaded: fbdev,vesa
    dri: radeonsi gpu: amdgpu display-ID: :0 screens: 1
  Screen-1: 0 s-res: 2560x1440 s-dpi: 96 s-size: 677x381mm (26.65x15.00")
    s-diag: 777mm (30.58")
  Monitor-1: DP-4 model: HP Z27u G3 serial: <filter> built: 2021
    res: 2560x1440 hz: 60 dpi: 109 gamma: 1.2 size: 597x336mm (23.5x13.23")
    diag: 685mm (27") ratio: 16:9 modes: max: 2560x1440 min: 720x400
  API: EGL v: 1.5 hw: drv: amd radeonsi platforms: device: 0 drv: radeonsi
    device: 1 drv: swrast surfaceless: drv: radeonsi x11: drv: radeonsi
    inactive: gbm,wayland
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.1.3 glx-v: 1.4
    direct-render: yes renderer: AMD Radeon RX 6700 XT (radeonsi navi22 LLVM
    18.1.8 DRM 3.57 6.9.7-1-default) device-ID: 1002:73df memory: 11.72 GiB
    unified: no
  API: Vulkan v: 1.3.283 layers: 5 device: 0 type: discrete-gpu name: AMD
    Radeon RX 6700 XT (RADV NAVI22) driver: N/A device-ID: 1002:73df
    surfaces: xcb,xlib
Comment 4 Jiri Slaby 2024-07-22 06:48:00 UTC
(In reply to Daniel Schemp from comment #3)
> Topic:
> https://forums.opensuse.org/t/system-crashes-when-second-daisy-chained-
> monitor-is-attached-with-amd-gpu-with-kernel-6-9-3/176886

That'd be a different issue. This bug is in 6.10 only. Please create a new bug. Ideally at https://gitlab.freedesktop.org/drm/amd/-/issues, so that upstream devs are made aware of the issue (or you will be pointed to some preexisting bug).
Comment 5 Jiri Slaby 2024-07-22 06:49:54 UTC
(In reply to Jiri Slaby from comment #4)
> (In reply to Daniel Schemp from comment #3)
> > Topic:
> > https://forums.opensuse.org/t/system-crashes-when-second-daisy-chained-
> > monitor-is-attached-with-amd-gpu-with-kernel-6-9-3/176886
> 
> That'd be a different issue. This bug is in 6.10 only. Please create a new
> bug. Ideally at https://gitlab.freedesktop.org/drm/amd/-/issues, so that
> upstream devs are made aware of the issue (or you will be pointed to some
> preexisting bug).

And it might be worth testing 6.10 first. E.g. from:
https://download.opensuse.org/repositories/Kernel:/stable/standard/
(this bug is fixed there)
Comment 6 OBSbugzilla Bot 2024-07-22 07:15:04 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1188940 Factory / kernel-source
Comment 7 OBSbugzilla Bot 2024-07-25 06:15:06 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1189502 Factory / kernel-source
Comment 8 Jiri Slaby 2024-07-25 11:15:21 UTC
Created attachment 876262 [details]
patch
Comment 9 OBSbugzilla Bot 2024-07-26 08:45:05 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1189731 Factory / kernel-source
Comment 10 Jiri Slaby 2024-07-26 10:58:19 UTC
Pushed to stable + master.
Comment 11 Vlastimil Babka 2024-07-31 08:51:20 UTC
(In reply to Jiri Slaby from comment #8)
> Created attachment 876262 [details]
> patch

I think this is what crashed my 6.11-rc1 from master on the T14s gen3 laptop with a dock and external monitor. See the oops:

https://paste.opensuse.org/pastes/d8a33a929c71

excerpt:

BUG: unable to handle page fault for address: 00000000000012b8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 6 UID: 0 PID: 2541 Comm: Xorg.bin Not tainted 6.11.0-rc1-1.gc7e21a2-default #1 openSUSE Tumbleweed (unreleased) 59ccf8feca6c7>
Hardware name: LENOVO 21CRS0K63K/21CRS0K63K, BIOS R22ET70W (1.40 ) 03/21/2024
RIP: 0010:compute_mst_dsc_configs_for_link+0x577/0xa90 [amdgpu]
Code: 63 56 20 48 8d 2c c7 48 b8 cf f7 53 e3 a5 9b c4 20 48 69 d2 ee 03 00 00 48 c1 ea 03 48 f7 e2 49 8b 45 40 48 89 d1 48 c1 e9 0>
RSP: 0018:ffffa37d4121f698 EFLAGS: 00010216
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000dad5a0
RDX: 000000000dad5a00 RSI: 000000000047747f RDI: ffffa37d4121f9e8
RBP: ffffa37d4121f9e8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
R13: ffffa37d4121f790 R14: ffffa37d4121f748 R15: 0000000000000000
FS:  00007f1ad194edc0(0000) GS:ffff88e6aed00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000012b8 CR3: 000000011675a000 CR4: 0000000000750ef0

objdump tells me (with RIP being 340627):

/usr/src/debug/kernel-default-6.11~rc1/linux-6.11-rc1/linux-obj/../include/linux/math64.h:29
  340620:       48 89 d1                mov    %rdx,%rcx
  340623:       48 c1 e9 04             shr    $0x4,%rcx
kbps_to_peak_pbn():
/usr/src/debug/kernel-default-6.11~rc1/linux-6.11-rc1/linux-obj/../drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.c:814
  340627:       80 b8 b8 12 00 00 00    cmpb   $0x0,0x12b8(%rax)
  34062e:       0f 85 00 00 00 00       jne    340634 <compute_mst_dsc_configs_for_link+0x584>
                        340630: R_X86_64_PC32   .text.unlikely+0x285f7
/usr/src/debug/kernel-default-6.11~rc1/linux-6.11-rc1/linux-obj/../drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.c:819
  340634:       48 c1 e1 06             shl    $0x6,%rcx

That's kbps_to_peak_pbn() on line if (aconnector->is_synaptics_cascaded), RAX is zero and pahole tells me is_synaptics_cascaded is indeed at offset 0x12b8. So aconnector is null.

I don't know yet which of the several callsites to kbps_to_peak_pbn() this is.
Comment 12 Vlastimil Babka 2024-07-31 08:53:59 UTC
(In reply to Vlastimil Babka from comment #11)
> I don't know yet which of the several callsites to kbps_to_peak_pbn() this
> is.

The closest preceding one in the objdump (unless it's too shuffled) is

try_disable_dsc():
/usr/src/debug/kernel-default-6.11~rc1/linux-6.11-rc1/linux-obj/../drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.c:1048

vars[next_index].pbn = kbps_to_peak_pbn(params[next_index].bw_range.stream_kbps, params[i].aconnector);
Comment 13 Jiri Slaby 2024-07-31 09:39:48 UTC
Let's resort back to the revert which upstream is likely going to do:
https://lore.kernel.org/all/CO6PR12MB5489857D91F3CDC7F7517D02FCB02@CO6PR12MB5489.namprd12.prod.outlook.com/
Comment 14 Jiri Slaby 2024-07-31 09:49:39 UTC
Pushed to master+stable.
Comment 15 OBSbugzilla Bot 2024-08-05 05:15:04 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1191566 Factory / kernel-source
Comment 16 Michal Kubeček 2024-08-19 07:13:10 UTC
Mainline commit 338567d17627 ("drm/amd/display: Fix MST BW calculation
Regression") in 6.11-rc4 which is supposed to revert commit 8b2cb32cf0c6
looks similar to this patch but there are differences. Someone who was
affected by this issue should probably check that everything is OK with
current master branch snapshot (based on v6.11-rc4).
Comment 17 Jiri Slaby 2024-08-19 07:51:03 UTC
(In reply to Michal Kubeček from comment #16)
> Mainline commit 338567d17627 ("drm/amd/display: Fix MST BW calculation
> Regression") in 6.11-rc4 which is supposed to revert commit 8b2cb32cf0c6
> looks similar to this patch but there are differences. Someone who was
> affected by this issue should probably check that everything is OK with
> current master branch snapshot (based on v6.11-rc4).

I was Reported-by in there but got no notification, weird.

The change appears to be wrong: 
-+                      vars[next_index].pbn = kbps_to_peak_pbn(params[next_index].bw_range.max_kbps, fec_overhead_multiplier_x1000);
++                      vars[next_index].pbn = kbps_to_peak_pbn(params[next_index].bw_range.stream_kbps, fec_overhead_multiplier_x1000);
Comment 18 Jiri Slaby 2024-08-19 08:20:24 UTC
(In reply to Jiri Slaby from comment #17)
> The change appears to be wrong: 
> -+                      vars[next_index].pbn =
> kbps_to_peak_pbn(params[next_index].bw_range.max_kbps,
> fec_overhead_multiplier_x1000);
> ++                      vars[next_index].pbn =
> kbps_to_peak_pbn(params[next_index].bw_range.stream_kbps,
> fec_overhead_multiplier_x1000);

Fixed exactly by:
https://lore.kernel.org/all/20240815224525.3077505-13-Roman.Li@amd.com/
Comment 19 OBSbugzilla Bot 2024-08-30 06:05:02 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1197685 Factory / kernel-source
Comment 20 OBSbugzilla Bot 2024-09-05 06:25:04 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1198865 Factory / kernel-source
Comment 21 OBSbugzilla Bot 2024-09-23 09:05:05 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1202559 Factory / kernel-source
Comment 22 OBSbugzilla Bot 2024-09-24 16:55:06 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1203029 Factory / kernel-source
Comment 23 OBSbugzilla Bot 2024-09-26 06:05:03 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1203745 Factory / kernel-source
Comment 24 OBSbugzilla Bot 2024-10-05 05:25:04 UTC
This is an autogenerated message for OBS integration:
This bug (1228093) was mentioned in
https://build.opensuse.org/request/show/1205774 Factory / kernel-source