Bug 1213491

Summary: Realtek ethernet adpater stops working after update to 6.4.2 and 6.4.3
Product: [openSUSE] openSUSE Tumbleweed Reporter: Ferdinando Vivacqua <ferdinando.vivacqua>
Component: KernelAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: ferdinando.vivacqua, luigi.tarenga, tiwai
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: openSUSE Tumbleweed   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=217596
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Ferdinando Vivacqua 2023-07-19 15:41:00 UTC
With kernel version 6.4.2 and 6.4.3 ethernet adapter card RTL8111/8168/8411 on the laptop stops to working after sometime, around 1 hour. No issue with the same controller on desktop pc.
Seems to be related to a power management issue with aspm.
There is already a bug track upstream: https://bugzilla.kernel.org/show_bug.cgi?id=217596
Reboot the pc temporary solves the problem re-enabling the ethernet, but it will come back.

The only workaround is to not use 6.4.2 and 6.4.3 kernel
Comment 1 Takashi Iwai 2023-07-19 16:06:50 UTC
The bug you suggested in the upstream bug tracker should have hit already with 6.4 release, so it might be a different bug.

In anyway, please try pcie_aspm=force boot option with 6.4.3 kernel.  This should be a workaround for that bug.
Comment 2 Ferdinando Vivacqua 2023-07-19 16:58:18 UTC
(In reply to Takashi Iwai from comment #1)
> The bug you suggested in the upstream bug tracker should have hit already
> with 6.4 release, so it might be a different bug.
> 
> In anyway, please try pcie_aspm=force boot option with 6.4.3 kernel.  This
> should be a workaround for that bug.

As said in the upstream bug, this problem is still present with 6.4.3, and the pcie_aspm=force seems to cause a sensible performance degrade.
Comment 3 Takashi Iwai 2023-07-20 06:04:28 UTC
Well, it's still doubtful why 6.4.2 worked, then.  The buggy commit 2ab19de62d67e403105ba860971e5ff0d511ad15
    r8169: remove ASPM restrictions now that ASPM is disabled during NAPI poll
is already included in 6.4 release.  So, 6.4.2 should have hit the same problem if that's the cause.

And, more puzzling is that there is really only few changes between 6.4.2 and 6.4.3 kernels.  Most of them are only about the VM fixes, and irrelevant with the Realtek Ethernet driver.

I asked to test with pcie_aspm=force option for confirming whether the above is the cause or not.  It's of course no solution, per se.
Comment 4 Ferdinando Vivacqua 2023-07-20 10:47:18 UTC
(In reply to Takashi Iwai from comment #3)
> Well, it's still doubtful why 6.4.2 worked, then.  The buggy commit
> 2ab19de62d67e403105ba860971e5ff0d511ad15
>     r8169: remove ASPM restrictions now that ASPM is disabled during NAPI
> poll
> is already included in 6.4 release.  So, 6.4.2 should have hit the same
> problem if that's the cause.
> 
> And, more puzzling is that there is really only few changes between 6.4.2
> and 6.4.3 kernels.  Most of them are only about the VM fixes, and irrelevant
> with the Realtek Ethernet driver.
> 
> I asked to test with pcie_aspm=force option for confirming whether the above
> is the cause or not.  It's of course no solution, per se.

Just to clarify, 6.4.2 doesn't work neither.

Tried 6.4.3 with pcie_aspm=force: unexpected outcome.
I've no found performance degradation, but it stopped again, after around 3 hours. Not sure if the timing is something relevant or not.

It is clear that something in 6.4.x kernel broke the ethernet adapter
Comment 5 Takashi Iwai 2023-07-20 10:59:01 UTC
Ah, then I totally misunderstood the description.  The workaround is to go back to 6.3.x...

(In reply to Ferdinando Vivacqua from comment #4)
> Tried 6.4.3 with pcie_aspm=force: unexpected outcome.
> I've no found performance degradation, but it stopped again, after around 3
> hours. Not sure if the timing is something relevant or not.

Interesting.

To verify whether it's the same problem, I'm building a test kernel with the revert of the commit.  It's being built in OBS home:tiwai:bsc1213491 repo.
Once after the build finishes (takes an hour or so), the package will be available at:
  http://download.opensuse.org/repositories/home:/tiwai:/bsc1213491/standard/

Could you give it a try later?
Comment 6 Ferdinando Vivacqua 2023-07-20 12:34:37 UTC
(In reply to Takashi Iwai from comment #5)
> Ah, then I totally misunderstood the description.  The workaround is to go
> back to 6.3.x...
> 
> (In reply to Ferdinando Vivacqua from comment #4)
> > Tried 6.4.3 with pcie_aspm=force: unexpected outcome.
> > I've no found performance degradation, but it stopped again, after around 3
> > hours. Not sure if the timing is something relevant or not.
> 
> Interesting.
> 
> To verify whether it's the same problem, I'm building a test kernel with the
> revert of the commit.  It's being built in OBS home:tiwai:bsc1213491 repo.
> Once after the build finishes (takes an hour or so), the package will be
> available at:
>   http://download.opensuse.org/repositories/home:/tiwai:/bsc1213491/standard/
> 
> Could you give it a try later?

Is it the kernel kernel-default-6.4.4-1.1.g903492f.x86_64.rpm? Not able to boot, as it stops with error: ..../efi/linux.c:168 you need to load the kernel first
Comment 7 Takashi Iwai 2023-07-20 12:37:22 UTC
If Secure Boot is enabled on your BIOS, turn it off and retest.
Comment 8 Ferdinando Vivacqua 2023-07-20 15:31:27 UTC
(In reply to Takashi Iwai from comment #7)
> If Secure Boot is enabled on your BIOS, turn it off and retest.

It seems it does work! After more than 3 hours of working without problems.
Comment 9 Takashi Iwai 2023-07-20 15:34:55 UTC
OK, thanks, then this is indeed the same problem as in the upstream bugzilla.

Let's see whether there will be any development in the upstream.  If nothing happens, I'll put a temporary revert patch as a regression workaround.
Comment 10 Ferdinando Vivacqua 2023-07-20 15:36:00 UTC
(In reply to Takashi Iwai from comment #9)
> OK, thanks, then this is indeed the same problem as in the upstream bugzilla.
> 
> Let's see whether there will be any development in the upstream.  If nothing
> happens, I'll put a temporary revert patch as a regression workaround.

Thank you!
Comment 11 Takashi Iwai 2023-07-23 10:16:37 UTC
The upstream took three fix commits regarding r8169, landed in Linus tree now:
162d626f3013215b82b6514ca14f20932c7ccce5
  r8169: fix ASPM-related problem for chip version 42 and 43
cf2ffdea0839398cb0551762af7f5efb0a6e0fea
  r8169: revert 2ab19de62d67 ("r8169: remove ASPM restrictions now that ASPM is disabled during NAPI poll")
e31a9fedc7d8d80722b19628e66fcb5a36981780
  Revert "r8169: disable ASPM during NAPI poll"

I backported those to TW stable branch.
Comment 12 Takashi Iwai 2023-07-23 10:18:48 UTC
... and another test kernel is being built in OBS home:tiwai:bsc1213491-2 repo.
You can test it later once after the build finishes.
Comment 13 Luigi Tarenga 2023-07-23 20:15:48 UTC
thanks Takashi.
in the mean time I'm testing 6.4.1-1.g6fd2851-default and after 10h uptime all fine. tomorrow I will try to test the new build.
Comment 14 Luigi Tarenga 2023-07-24 06:19:52 UTC
this night I still hit a problem with the first custom kernel. I yet have to test your second build. here the log:

Jul 24 01:28:53 alfred kernel: ------------[ cut here ]------------
Jul 24 01:28:53 alfred kernel: NETDEV WATCHDOG: eno1 (r8169): transmit queue 0 timed out 6437 ms
Jul 24 01:28:53 alfred kernel: WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x21e/0x230
Jul 24 01:28:53 alfred kernel: Modules linked in: ccm af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink qrtr msr ext4 nls_iso8859_1 nls_cp437 mbcache vfat jbd2 fat iwlmvm snd_hda_codec_hdmi snd_sof_pci_intel_icl snd_sof_intel_hda_common mac80211 snd_hda_codec_realtek soundwire_intel soundwire_cadence snd_sof_intel_hda_mlink snd_hda_codec_generic snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp ledtrig_audio snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core libarc4 snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus snd_soc_core snd_compress snd_pcm_dmaengine x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_intel snd_intel_dspcfg kvm_intel snd_intel_sdw_acpi snd_hda_codec spi_pxa2xx_platform dw_dmac spi_nor snd_hda_core ee1004 mei_pxp mei_hdcp mtd kvm intel_rapl_msr snd_hwdep iwlwifi snd_pcm btusb irqbypass processor_thermal_device_pci_legacy
Jul 24 01:28:53 alfred kernel:  btrtl snd_timer processor_thermal_device btbcm processor_thermal_rfim pcspkr btintel processor_thermal_mbox i2c_i801 r8169 snd btmtk processor_thermal_rapl wmi_bmof cfg80211 bluetooth soundcore intel_rapl_common spi_intel_pci i2c_smbus realtek int340x_thermal_zone spi_intel mei_me mdio_devres libphy intel_lpss_pci intel_lpss ecdh_generic joydev mei rfkill idma64 intel_soc_dts_iosf fan tiny_power_button thermal acpi_tad intel_pmc_core acpi_pad button fuse efi_pstore configfs dmi_sysfs ip_tables x_tables uas usb_storage hid_logitech_hidpp hid_logitech_dj hid_generic crct10dif_pclmul crc32_pclmul usbhid polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 i915 nvme xhci_pci xhci_pci_renesas xhci_hcd aesni_intel crypto_simd cryptd sdhci_pci cqhci wdat_wdt i2c_algo_bit sdhci drm_buddy nvme_core drm_display_helper usbcore mmc_core cec rc_core ttm video wmi pinctrl_jasperlake btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
Jul 24 01:28:53 alfred kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: G     U             6.4.1-1.g6fd2851-default #1 openSUSE Tumbleweed (unreleased) a74e0f0a6765b1b2b400108eb36c99233f07085b
Jul 24 01:28:53 alfred kernel: Hardware name: Intel(R) Client Systems NUC11ATKC4/NUC11ATBC4, BIOS ATJSLCPX.0039.2023.0221.1502 02/21/2023
Jul 24 01:28:53 alfred kernel: RIP: 0010:dev_watchdog+0x21e/0x230
Jul 24 01:28:53 alfred kernel: Code: ff ff ff 48 89 df c6 05 d5 85 fd 00 01 e8 9a 3e fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 90 90 8a b8 e8 12 1d 5f ff <0f> 0b e9 2d ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90
Jul 24 01:28:53 alfred kernel: RSP: 0018:ffffa568c017cea0 EFLAGS: 00010286
Jul 24 01:28:53 alfred kernel: RAX: 0000000000000000 RBX: ffff88fa8af00000 RCX: 000000000000083f
Jul 24 01:28:53 alfred kernel: RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
Jul 24 01:28:53 alfred kernel: RBP: ffff88fa8af004c8 R08: 0000000000000000 R09: ffffa568c017cd48
Jul 24 01:28:53 alfred kernel: R10: 0000000000000003 R11: ffffffffb8b58cc8 R12: ffff88fa81798000
Jul 24 01:28:53 alfred kernel: R13: ffff88fa8af0041c R14: 0000000000000000 R15: 0000000000001925
Jul 24 01:28:53 alfred kernel: FS:  0000000000000000(0000) GS:ffff88fdefe80000(0000) knlGS:0000000000000000
Jul 24 01:28:53 alfred kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 24 01:28:53 alfred kernel: CR2: 00007f983affdd58 CR3: 0000000438c36000 CR4: 0000000000350ee0
Jul 24 01:28:53 alfred kernel: Call Trace:
Jul 24 01:28:53 alfred kernel:  <IRQ>
Jul 24 01:28:53 alfred kernel:  ? dev_watchdog+0x21e/0x230
Jul 24 01:28:53 alfred kernel:  ? __warn+0x81/0x130
Jul 24 01:28:53 alfred kernel:  ? dev_watchdog+0x21e/0x230
Jul 24 01:28:53 alfred kernel:  ? report_bug+0x171/0x1a0
Jul 24 01:28:53 alfred kernel:  ? native_write_msr+0xa/0x30
Jul 24 01:28:53 alfred kernel:  ? handle_bug+0x3c/0x80
Jul 24 01:28:53 alfred kernel:  ? exc_invalid_op+0x17/0x70
Jul 24 01:28:53 alfred kernel:  ? asm_exc_invalid_op+0x1a/0x20
Jul 24 01:28:53 alfred kernel:  ? dev_watchdog+0x21e/0x230
Jul 24 01:28:53 alfred kernel:  ? __pfx_dev_watchdog+0x10/0x10
Jul 24 01:28:53 alfred kernel:  ? __pfx_dev_watchdog+0x10/0x10
Jul 24 01:28:53 alfred kernel:  call_timer_fn+0x24/0x130
Jul 24 01:28:53 alfred kernel:  __run_timers.part.0+0x1d8/0x280
Jul 24 01:28:53 alfred kernel:  ? __hrtimer_run_queues+0x121/0x2b0
Jul 24 01:28:53 alfred kernel:  ? ktime_get+0x39/0xa0
Jul 24 01:28:53 alfred kernel:  run_timer_softirq+0x2a/0x50
Jul 24 01:28:53 alfred kernel:  __do_softirq+0xc7/0x2a5
Jul 24 01:28:53 alfred kernel:  __irq_exit_rcu+0xae/0xe0
Jul 24 01:28:53 alfred kernel:  sysvec_apic_timer_interrupt+0x72/0x90
Jul 24 01:28:53 alfred kernel:  </IRQ>
Jul 24 01:28:53 alfred kernel:  <TASK>
Jul 24 01:28:53 alfred kernel:  asm_sysvec_apic_timer_interrupt+0x1a/0x20
Jul 24 01:28:53 alfred kernel: RIP: 0010:cpuidle_enter_state+0xcc/0x440
Jul 24 01:28:53 alfred kernel: Code: 1a 35 48 ff e8 d5 f1 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 03 42 47 ff 45 84 ff 0f 85 56 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
Jul 24 01:28:53 alfred kernel: RSP: 0018:ffffa568c012fe90 EFLAGS: 00000246
Jul 24 01:28:53 alfred kernel: RAX: ffff88fdefeba040 RBX: ffff88fdefec5700 RCX: 0000000000000000
Jul 24 01:28:53 alfred kernel: RDX: 0000000000000001 RSI: fffffff64d245137 RDI: 0000000000000000
Jul 24 01:28:53 alfred kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: 00000000401a41a4
Jul 24 01:28:53 alfred kernel: R10: ffff88fdefeb8a44 R11: 000000000000bc50 R12: ffffffffb8c25c40
Jul 24 01:28:53 alfred kernel: R13: 00002cef36771d18 R14: 0000000000000003 R15: 0000000000000000
Jul 24 01:28:53 alfred kernel:  cpuidle_enter+0x2d/0x40
Jul 24 01:28:53 alfred kernel:  do_idle+0x20d/0x270
Jul 24 01:28:53 alfred kernel:  cpu_startup_entry+0x1d/0x20
Jul 24 01:28:53 alfred kernel:  start_secondary+0x12e/0x150
Jul 24 01:28:53 alfred kernel:  secondary_startup_64_no_verify+0x10b/0x10b
Jul 24 01:28:53 alfred kernel:  </TASK>
Jul 24 01:28:53 alfred kernel: ---[ end trace 0000000000000000 ]---
Jul 24 01:28:55 alfred kernel: pcieport 0000:00:1c.7: Data Link Layer Link Active not set in 1000 msec
Jul 24 01:28:55 alfred kernel: r8169 0000:02:00.0 eno1: Can't reset secondary PCI bus, detach NIC
Comment 16 Ferdinando Vivacqua 2023-07-28 18:51:48 UTC
(In reply to Takashi Iwai from comment #12)
> ... and another test kernel is being built in OBS home:tiwai:bsc1213491-2
> repo.
> You can test it later once after the build finishes.

Hi Takashi, sorry for being late.
Do you still need I test this second custom kernel? However, in the upstream bug tracker seems that we need to wait the 6.5 branch, right?
If so, can we anticipate the fixing reverting patch in openSUSE TW kernel?
thank you!
Comment 17 Takashi Iwai 2023-07-29 07:11:48 UTC
The all revert and fix patches for r8169 have been already merged in TW stable git branch, and it'll be eventually included in TW release later.

So, could you rather confirm that the kernel in OBS Kernel:stable repo works?
If the bug isn't still fixed there, we'll need to report to the upstream.
Comment 18 Ferdinando Vivacqua 2023-07-29 07:26:05 UTC
(In reply to Takashi Iwai from comment #17)
> The all revert and fix patches for r8169 have been already merged in TW
> stable git branch, and it'll be eventually included in TW release later.
> 
> So, could you rather confirm that the kernel in OBS Kernel:stable repo works?
> If the bug isn't still fixed there, we'll need to report to the upstream.

Ok, I'm going to test the kernel-default-6.4.6-3.1.g74a8144.x86_64.rpm and let you know.
thanks!
Comment 19 Ferdinando Vivacqua 2023-07-30 18:05:01 UTC
(In reply to Takashi Iwai from comment #17)
> The all revert and fix patches for r8169 have been already merged in TW
> stable git branch, and it'll be eventually included in TW release later.
> 
> So, could you rather confirm that the kernel in OBS Kernel:stable repo works?
> If the bug isn't still fixed there, we'll need to report to the upstream.

Hi! After more than 30 hours without any issue, I think your kernel kernel-default-6.4.6-3.1.g74a8144.x86_64.rpm is ok. It works!
Comment 20 Takashi Iwai 2023-07-31 06:26:09 UTC
OK, let's close now.