Bug 1222082

Summary: [SD-149531] After updating openSUSE to 15.5 Thunderbolt Dock 4 isn't working anymore
Product: [openSUSE] PUBLIC SUSE Linux Enterprise Desktop 15 SP5 Reporter: ralph roth <ralph.roth>
Component: KernelAssignee: Kernel Bugs <kernel-bugs>
Status: NEW --- QA Contact:
Severity: Major    
Priority: P3 - Medium CC: oneukum, ralph.roth, shawn.lee, shung-hsi.yu, tiwai
Version: unspecified   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE Leap 15.5   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Bug Depends on:    
Bug Blocks: 1222236    

Description ralph roth 2024-03-27 15:49:40 UTC
SD-Ticket:  https://sd.suse.com/servicedesk/customer/portal/1/SD-149531

After troubleshooting the SD-Ticket I was referred to open this BSC

Upgrade of openSUSE 15.4 to 15.5 on my Lenovo P16 notebook went well. After the upgrade the Thunderbolt docking station is only recognized “half”, means network (wired) and 2nd monitor isn’t working anymore.

Known firmware issue with openSUSE? BIOS and Thunderbolt Dock 4 (TBD4) firmware meanwhile updated to latest.

LENOVO BIOS N3FET38W (1.23 ) 09/27/2023

 ? Lenovo ThinkPad Thunderbolt 4 Dock
   ?? type:          peripheral
   ?? name:          ThinkPad Thunderbolt 4 Dock
   ?? vendor:        Lenovo
   ?? uuid:          b0010080-0170-7c9c-01a0-c8db2884c808
   ?? generation:    USB4
   ?? status:        authorized
   ?  ?? domain:     d4030000-0091-8d08-22ec-7f0b12843120
   ?  ?? rx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   ?  ?? tx speed:   40 Gb/s = 2 lanes * 20 Gb/s
   ?  ?? authflags:  none
   ?? authorized:    Wed Mar 27 10:42:08 2024
   ?? connected:     Wed Mar 27 10:42:08 2024
   ?? stored:        Fri Feb 16 15:23:36 2024
      ?? policy:     auto
      ?? key:        no
Comment 1 ralph roth 2024-03-27 15:51:12 UTC
See also https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1942999
Comment 3 ralph roth 2024-03-27 16:00:43 UTC
Just for the record:  I meanwhile cold installed openSUSE 15.5 on this machine, problems are exactly the same.

dmesg is from this new installation

Old 15.4 Kernel with openSUSE 15.5 worked, but I meanwhile lost this kernel
Comment 4 Takashi Iwai 2024-03-27 16:07:50 UTC
Please verify whether it's a regression in the recent SP5 kernel updates.
That is, try to downgrade to the older SP5 kernels, and confirm that the problem persists.  You can start from SP5 GA kernel, for example.

Judging from the attached log, the igc device got probed, but it was detached later, spewing kernel warnings:

[ 2526.650192] igc 0000:6d:00.0 eth0: PCIe link lost, device now detached
[ 2526.650215] ------------[ cut here ]------------
[ 2526.650216] igc: Failed to read reg 0xc030!
[ 2526.650225] WARNING: CPU: 2 PID: 2974 at ../drivers/net/ethernet/intel/igc/igc_main.c:6470 igc_rd32+0x94/0xa0 [igc]

So it's likely an issue in PCIe core in Thunderbolt, I suppose.

(In reply to ralph roth from comment #1)
> See also https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1942999

Do you see this problem?  That is, an invalid MAC address?
The patch from Ubuntu wasn't taken to the upsteram in the end, AFAIK.
Comment 5 Takashi Iwai 2024-03-27 16:12:00 UTC
(In reply to ralph roth from comment #3)
> Just for the record:  I meanwhile cold installed openSUSE 15.5 on this
> machine, problems are exactly the same.

Leap 15.5 and SLE15-SP5 use the very same binaries, so no wonder :)

> Old 15.4 Kernel with openSUSE 15.5 worked, but I meanwhile lost this kernel

You can get the one in OBS, e.g.
  http://download.opensuse.org/update/leap/15.4/sle/x86_64/
Comment 6 ralph roth 2024-03-27 16:27:34 UTC
(In reply to Takashi Iwai from comment #4)
> Please verify whether it's a regression in the recent SP5 kernel updates.
> That is, try to downgrade to the older SP5 kernels, and confirm that the
> problem persists.  You can start from SP5 GA kernel, for example.

zypper install --oldpackage kernel-default-5.14.21-150500.53.2.x86_64

The following 2 NEW packages are going to be installed:
  dracut-mkinitrd-deprecated kernel-default-5.14.21-150500.53.2
Comment 7 ralph roth 2024-03-27 16:41:16 UTC
(In reply to ralph roth from comment #6)
> (In reply to Takashi Iwai from comment #4)
> > Please verify whether it's a regression in the recent SP5 kernel updates.
> > That is, try to downgrade to the older SP5 kernels, and confirm that the
> > problem persists.  You can start from SP5 GA kernel, for example.
> 
> zypper install --oldpackage kernel-default-5.14.21-150500.53.2.x86_64
> 
> The following 2 NEW packages are going to be installed:
>   dracut-mkinitrd-deprecated kernel-default-5.14.21-150500.53.2

NO, same errors with:

Linux p16s23 5.14.21-150500.53-default #1 SMP PREEMPT_DYNAMIC Wed May 10 07:56:26 UTC 2023 (b630043) x86_64 x86_64 x86_64 GNU/Linux
Comment 8 Shung-Hsi Yu 2024-03-28 09:46:18 UTC
(In reply to Takashi Iwai from comment #4)
> (In reply to ralph roth from comment #1)
> > See also https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1942999
> 
> Do you see this problem?  That is, an invalid MAC address?

I tried to help Ralph a bit (off Bugzilla), and the "invalid MAC address" is seem when the machine is booted with kernel config `iommu=off` as a suggested blink try (after trying `bolt authorize ...` to authorize the dock, which also didn't help).
Comment 9 Takashi Iwai 2024-03-28 10:40:32 UTC
OK, then it's a regression that has been present from SLE15-SP5 GA kernel.
The next question would be to check with a newer releases, e.g. SLE15-SP6 kernel.  Could you check the one in OBS Kernel:SLE15-SP6 repo?
  http://download.opensuse.org/repositories/Kernel:/SLE15-SP6/pool/

Also, the recent 6.8.x kernel from OBS Kernel:stable:Backport, too:
  http://download.opensuse.org/repositories/Kernel:/stable:/Backport/standard/
Comment 10 Takashi Iwai 2024-03-28 10:43:17 UTC
(In reply to Shung-Hsi Yu from comment #8)
> (In reply to Takashi Iwai from comment #4)
> > (In reply to ralph roth from comment #1)
> > > See also https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1942999
> > 
> > Do you see this problem?  That is, an invalid MAC address?
> 
> I tried to help Ralph a bit (off Bugzilla), and the "invalid MAC address" is
> seem when the machine is booted with kernel config `iommu=off` as a
> suggested blink try (after trying `bolt authorize ...` to authorize the
> dock, which also didn't help).

Hm, in such an unusual situation, some workaround might be still needed.
FWIW, the patch submitted to the upstream was
  https://patchwork.kernel.org/project/netdevbpf/patch/20210702045120.22855-2-aaron.ma@canonical.com/#24308349

that just adds a delay of 600ms.  We can give it a try, too.
Comment 13 Takashi Iwai 2024-03-28 13:23:32 UTC
Good to know that SP6 kernel works.  So it's a regression specific to SP5.

The problem about the second monitor can be rather an issue of amdgpu driver.  OTOH, the detached Ethernet device can be likely a PCIe or Thunderbolt problem.

Adding Oliver to Cc.
Comment 14 Takashi Iwai 2024-03-28 13:36:11 UTC
Skimming over the net his the following:
  https://github.com/fwupd/firmware-lenovo/issues/191
mentioning that this problem can be worked around by a BIOS setup change.

Try to go BIOS setup menu, and change
  Bios -> Config -> Thunderbolt 4 -> PCIe Tunneling
to OFF, reboot and retest.
Comment 16 Takashi Iwai 2024-03-28 14:29:07 UTC
For the monitor problem, please open another bugzilla entry.  It's a different from the Ethernet stuff.

The workaround via BIOS setup might be effective for the monitor, too.  Please check it.
Comment 17 ralph roth 2024-04-02 12:10:43 UTC
Workaround fixes the Ethernet NIC problem with 15.5 and 15.6 Kernel.

Monitor still *not* working at all (DP and/or HDMI cable).  

As this is a workaround eliminating the root cause would be nice....
Comment 18 Takashi Iwai 2024-04-02 12:14:25 UTC
Please open another bug report for the graphics problem.
Comment 19 Takashi Iwai 2024-04-02 12:15:40 UTC
Also test with 6.8.x kernel from OBS Kernel:stable:Backport repo.
If the problem persists, test with 6.9-rc kernel from OBS Kernel:HEAD:Backport repo, too.
Comment 20 ralph roth 2024-04-02 12:29:32 UTC
Please provide the URLs for zypper ar -f

Backport kernel Mid-March 24 didn't work
Comment 21 ralph roth 2024-04-02 12:32:36 UTC
(In reply to Takashi Iwai from comment #18)
> Please open another bug report for the graphics problem.


Bug 1222236 - Monitor BSC - [SD-149531] After updating openSUSE to 15.5 Thunderbolt Dock 4 isn't working anymore
Comment 22 Takashi Iwai 2024-04-02 12:39:02 UTC
(In reply to ralph roth from comment #20)
> Please provide the URLs for zypper ar -f

It's deduced to http://donwload.openuse.org/repositories/...., separating with each colon.

OBS Kernel:stable:Backport
  http://download.opensuse.org/repositories/Kernel:/stable:/Backport/standard/
OBS Kernel:HEAD:Backport
  http://download.opensuse.org/repositories/Kernel:/HEAD:/Backport/standard/

And, you don't need to add repo at each time. Just download kernel-default.rpm from the URL and install it directly, too.

> Backport kernel Mid-March 24 didn't work

Which one...?  Please elaborate.
Comment 23 ralph roth 2024-04-02 13:10:39 UTC
(In reply to Takashi Iwai from comment #19)
> Also test with 6.8.x kernel from OBS Kernel:stable:Backport repo.

$ uname -a ; cat /proc/cmdline 
Linux p16s23 6.8.2-lp155.3.g2daf2c2-default #1 SMP PREEMPT_DYNAMIC Thu Mar 28 07:04:20 UTC 2024 (2daf2c2) x86_64 x86_64 x86_64 GNU/Linux

BOOT_IMAGE=/boot/vmlinuz-6.8.2-lp155.3.g2daf2c2-default root=/dev/mapper/system-root preempt=full quiet security=apparmor drm.debug=0x1e log_buf_len=16M mitigations=auto


eth: OK
Monitor: Failed
Comment 25 Takashi Iwai 2024-04-02 13:15:05 UTC
For this bug report, drop drm.debug option.  It'll give a lot of noises that are irrelevant from the Ethernet driver problem.

For another bug report (bsc#1222236), please upload the dmesg output with the drm.debug=0x1e boot option instead.
Comment 26 Takashi Iwai 2024-04-02 13:16:21 UTC
... and please check 6.9-rc kernel, too.  If the problem persists there, it's basically an upstream problem and the upstream devs should be involved.
Comment 27 ralph roth 2024-04-02 13:23:15 UTC
(In reply to Takashi Iwai from comment #26)
> ... and please check 6.9-rc kernel, too.  If the problem persists there,
> it's basically an upstream problem and the upstream devs should be involved.

$ uname -a; cat /proc/cmdline 
Linux p16s23 6.9.0-rc2-lp155.2.g0788112-default #1 SMP PREEMPT_DYNAMIC Sun Mar 31 23:08:51 UTC 2024 (0788112) x86_64 x86_64 x86_64 GNU/Linux
BOOT_IMAGE=/boot/vmlinuz-6.9.0-rc2-lp155.2.g0788112-default root=/dev/mapper/system-root preempt=full quiet security=apparmor drm.debug=0x1e log_buf_len=16M mitigations=auto

eth: OK
2nd Monitor: Failed
Comment 28 Takashi Iwai 2024-04-02 13:45:02 UTC
Just to be sure: eth0 test is with the BIOS setup workaround?  Or does 6.8.x / 6.9-rc kernels pass even after reverting the BIOS setup?
Comment 29 ralph roth 2024-04-02 13:46:52 UTC
(In reply to Takashi Iwai from comment #28)
> Just to be sure: eth0 test is with the BIOS setup workaround?  Or does 6.8.x
> / 6.9-rc kernels pass even after reverting the BIOS setup?

ethX:  Currently with the BIOS workaround. If needed I can check that tomorrow
Comment 30 Takashi Iwai 2024-04-02 13:54:36 UTC
Yes, please test without BIOS workaround.  If the problem persists there, it might be worth to report to the upstream, too.  OTOH, if the problem is fixed in the upstream, we can look for some materials to backport to SLE15-SP6 kernel.
Comment 31 ralph roth 2024-04-02 14:42:41 UTC
(In reply to Takashi Iwai from comment #30)
> Yes, please test without BIOS workaround.  If the problem persists there, it
> might be worth to report to the upstream, too.  OTOH, if the problem is
> fixed in the upstream, we can look for some materials to backport to
> SLE15-SP6 kernel.

Without the BIOS workaround the NIC won't work. Tested with 15.4 L&G Kernel and 6.9.0rc2 Kernel

Also:  15.4 with L&G Kernel didn't work anymore after the cold install of openSUSE Leap 15.5 :-(

Linux p16s23 5.14.21-150400.22-default #1 SMP PREEMPT_DYNAMIC Wed May 11 06:57:18 UTC 2022 (49db222) x86_64 x86_64 x86_64 GNU/Linux
Comment 32 Takashi Iwai 2024-04-02 15:38:17 UTC
Have you upgraded BIOS on Dock, or applied anything since Leap 15.4?  Just wondering what triggers the breakage.
Comment 33 ralph roth 2024-04-02 15:41:38 UTC
(In reply to Takashi Iwai from comment #32)
> Have you upgraded BIOS on Dock, or applied anything since Leap 15.4?  Just
> wondering what triggers the breakage.

Nothing that I am aware of. But I had done a mistake and downloaded the 15.4 GA Kernel. I will give later the L&G 15.4 a try.
Comment 34 ralph roth 2024-04-15 08:01:07 UTC
Tried all available Kernels so far.
Any idea how to proceed?
Comment 35 Takashi Iwai 2024-04-19 09:47:30 UTC
If the issue is still seen with the latest upstream kernel, you should report to the upstream devs and let them fix the bugs.  Care to report from your side?  It's a hardware-specific issue and better to involve you directly.

For the graphics issue of AMDGPU, it'd be gitlab.freedesktop.org Issues.
For the network, you can try to report to bugzilla.kernel.org, too.
Comment 36 ralph roth 2024-04-29 08:58:55 UTC
I don't know how to do that.

Also according to co-workers, the Ubuntu Kernel (6.5) works fine with that hardware constellation.
Comment 37 Takashi Iwai 2024-04-29 14:05:34 UTC
You need to send a bug report to the upstream bug tracker.  e.g. for the graphics issue, at best, gitlab.freedesktop.org; choose the right component (e.g. DRM/Intel or such).  For other issues (igc and PCI core), maybe you can report to bugzilla.kernel.org.

The report from your side is the best as it's pretty much device-specific problem and you are the one who owns and has tested / suffered from the issue.

After reporting, just let us know the URL, and I can join to the reports later for assisting from the distro side.