|
Bugzilla – Full Text Bug Listing |
| Summary: | Unable to boot kernels 6.5.2 or 6.5.3 unless acpi=noirq specified | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Stuart Rogers <stuart> |
| Component: | Kernel | Assignee: | Jiri Slaby <jslaby> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Major | ||
| Priority: | P5 - None | CC: | bhs_suse, jslaby, o-takashi, stuart |
| Version: | Current | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE Tumbleweed | ||
| See Also: | https://bugzilla.kernel.org/show_bug.cgi?id=217994 | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
dmidecode and journal output
Journal output hwinfo lspci -vvnnxxx lspci -tv Journal output from booting 11.1 kernel dmesg output from booting kernel 11.1 Section of video of kernel 14.1 booting Photo of Firewire PCI-E card |
||
Created attachment 869568 [details]
Journal output
Just to add the /proc/cmdline is BOOT_IMAGE=/boot/vmlinuz-6.5.3-1-default root=UUID=190442a2-32b3-47e7-bac7-a39e6841236f splash=silent resume=/dev/disk/by-id/nvme-KINGSTON_SA2000M8250G_50026B728266A552-part3 quiet acpi=noirq mitigations=auto (In reply to Stuart Rogers from comment #2) > Just to add the /proc/cmdline is > > BOOT_IMAGE=/boot/vmlinuz-6.5.3-1-default > root=UUID=190442a2-32b3-47e7-bac7-a39e6841236f splash=silent > resume=/dev/disk/by-id/nvme-KINGSTON_SA2000M8250G_50026B728266A552-part3 > quiet acpi=noirq mitigations=auto If you pass neither quiet, nor acpi=noirq and instead, you pass loglevel=7, can you actually see something? Tried that loglevel=7 without quiet and acpi-noirq but all I get is a load of messages flashing past on screen but too fast to see clearly. Looks from memory the same as when I do it without quiet and acpi. Then it just reboots. On next good boot with acpi=noirq I can see no messages anywhere relating to this fail boot, nothing in dmesg or journal relating to the fail. Where would messages be expected to be saved if any? (In reply to Stuart Rogers from comment #4) > Tried that loglevel=7 without quiet and acpi-noirq but all I get is a load > of messages flashing past on screen but too fast to see clearly. Looks from > memory the same as when I do it without quiet and acpi. Then it just reboots. Hm, we might need to enable CONFIG_BOOT_PRINTK_DELAY and add boot_delay=30 to kernel commandline. (It will wait 30 ms after each message printed.) Or, according to the dmidecode dump, your motherboard is B450-A PRO MAX which is supposed to have: Internal Connectors ... * 1x Serial port connector Maybe you can capture serial console output? console=ttyS0 would write there. > On next good boot with acpi=noirq I can see no messages anywhere relating to > this fail boot, nothing in dmesg or journal relating to the fail. > > Where would messages be expected to be saved if any? Unfortunately nowhere as disks are apparently not up quite yet. (In reply to Jiri Slaby from comment #5) > (In reply to Stuart Rogers from comment #4) > > Tried that loglevel=7 without quiet and acpi-noirq but all I get is a load > > of messages flashing past on screen but too fast to see clearly. Looks from > > memory the same as when I do it without quiet and acpi. Then it just reboots. > > Hm, we might need to enable CONFIG_BOOT_PRINTK_DELAY and add boot_delay=30 > to kernel commandline. (It will wait 30 ms after each message printed.) Kernel with the config set is building at: https://build.opensuse.org/project/monitor/home:jirislaby:stable-boot_delay I downloaded kernel-default-6.5.4-3.1.x86_64.rpm when it finished and it boots OK when I use acpi=noirq, however when I remove quiet and acpi=noirq and add boot_delay it waits a ling time before the messages start but they still go through too quickly to really see, even when I used a delay of 100. I was not sure what the devel and vdso kernels were for. Is it worth tying the 6.6 kernel as mentioned in bug 1215328 to see if a vanilla kernel still fails? (In reply to Stuart Rogers from comment #7) > I downloaded kernel-default-6.5.4-3.1.x86_64.rpm when it finished and it > boots OK when I use acpi=noirq, however when I remove quiet and acpi=noirq > and add boot_delay it waits a ling time before the messages start You might add earlycon=efifb to see the messages before the real console comes up. > but they > still go through too quickly to really see, even when I used a delay of 100. You might need to pass lpj=12310136 too. As the default is 1 million and the calibration happens later in the boot cycle. Then boot_delay=100 should be better (12 times slower than without lpj=). Maybe you can capture a video? > I was not sure what the devel and vdso kernels were for. You can ignore those. > Is it worth tying the 6.6 kernel as mentioned in bug 1215328 to see if a > vanilla kernel still fails? You might try to confirm whether it helps or not. (In reply to Jiri Slaby from comment #8) > > but they > > still go through too quickly to really see, even when I used a delay of 100. Oh, wait. If the reboot happens *after* init is started, boot_delay has no effect. Please pass sysctl.kernel.printk_delay=100 (or 500 or 1000) instead (no need for lpj and boot_delay then). (In reply to Jiri Slaby from comment #9) > (In reply to Jiri Slaby from comment #8) > > > but they > > > still go through too quickly to really see, even when I used a delay of 100. > > Oh, wait. If the reboot happens *after* init is started, boot_delay has no > effect. Please pass sysctl.kernel.printk_delay=100 (or 500 or 1000) instead > (no need for lpj and boot_delay then). One more thing: add also initcall_debug parameter. That will dump which driver is being initialized at which point. (In reply to Jiri Slaby from comment #10) > (In reply to Jiri Slaby from comment #9) > > (In reply to Jiri Slaby from comment #8) > > > > but they > > > > still go through too quickly to really see, even when I used a delay of 100. > > > > Oh, wait. If the reboot happens *after* init is started, boot_delay has no > > effect. Please pass sysctl.kernel.printk_delay=100 (or 500 or 1000) instead > > (no need for lpj and boot_delay then). > > One more thing: add also initcall_debug parameter. That will dump which > driver is being initialized at which point. Sorry for spamming, but initcall_debug in turn needs ignore_loglevel as klog.service lowers the loglevel, so you need to add all three: sysctl.kernel.printk_delay=100 initcall_debug ignore_loglevel Well by the time I read your update I had already removed your kernel, so I tried those options with the 6.5.3 kernel but the messages still flew by too fast. Which is best now to try the 6.6 kernel or reload your 6.5.4 kernel and try those options with that? (In reply to Stuart Rogers from comment #12) > Well by the time I read your update I had already removed your kernel, so I > tried those options with the 6.5.3 kernel but the messages still flew by too > fast. Hm, other than boot_delay should work fine even with stock kernel. It also makes me wonder why the messages are still printed fast. > Which is best now to try the 6.6 kernel or reload your 6.5.4 kernel and try > those options with that? You can install both (using "rpm -i" to keep all of the kernels) and test them one after each other in a row. So I would boot 6.6 first to se if it is OK. Then I would boot my kernel with all: sysctl.kernel.printk_delay=100 initcall_debug ignore_loglevel lpj=12310136 boot_delay=100 earlycon=efifb (You can add this directly to /boot/grub2/grub.cfg to my kernel, it will be removed automatically by uninstalling the kernel. Note that it is autogenerated/overwritten upon each kernel installation/removal.) Well I just tried the 6.6 kernel with the sysctl etc additions and removing acpi & quiet but it still fails to boot exactly like the others. The messages are still to fast for the naked eye so I might try a video with my phone! Anyway 6.6 boots OK with acpi=noirq. Is it still worth trying your 6.5.4? I have managed to video the messages on screen with all those parameters added and on viewing it I can see no obvious errors. The last message which shows prior to it rebooting relates to the firewire interface on the motherboard it starts with initcall fw_ohci_int and has [firewire_ohci] as part of it. I'll try to extract the full message in the morning as it is a bit blurred! Anyway the screen then goes blank and it reboots to the hardware messages about entering BIOS etc and the goes to the standard GRUB menu. (In reply to Stuart Rogers from comment #15) > I have managed to video the messages on screen with all those parameters > added and on viewing it I can see no obvious errors. The last message which > shows prior to it rebooting relates to the firewire interface on the > motherboard it starts with initcall fw_ohci_int and has [firewire_ohci] as It's a shot in the dark, but if it is the last, you can try to add: module_blacklist=firewire_ohci Note that modules are loaded in parallel. So it can be any of the last initcalls which has not dumped "finished" yet. Can you upload the video somewhere (perhaps attach here if not too large)? Well adding that module_blacklist=firewire_ohci works and the system boots up fine. This was using the standard kernel 6.5.4.1-1. Where do we go from here? I do occasionally use firewire on this desktop as I have an old Camcorder which has that interface. (In reply to Stuart Rogers from comment #17) > Well adding that module_blacklist=firewire_ohci works and the system boots > up fine. This was using the standard kernel 6.5.4.1-1. Perfect! We narrowed the problem a heap. > Where do we go from here? I do occasionally use firewire on this desktop as > I have an old Camcorder which has that interface. Don't worry, we need to find out the root cause. There are ~ 10 commits in firewire code between 6.4 and 6.5. Let me ask upstream devs what they think (I do not see anything wrong on the commits on first glance). BTW this commit from 6.5: commit 06f45435d985d60d7d2fe2424fbb9909d177a63d Author: Takashi Sakamoto <o-takashi@sakamocchi.jp> Date: Sun Jun 4 16:02:55 2023 +0900 firewire: core: obsolete usage of GFP_ATOMIC at building node tree was reverted in 6.6-rc2 and 6.5.5. Have you booted any of those yet? I tried kernel-default-6.6~rc2-1.1.g8a1f7fd.x86_64.rpm but it failed as 6.5.4 does without the blacklist. Could you try the kernel from: https://build.opensuse.org/project/monitor/home:jirislaby:stable-boot_delay again (once it builds)? It reverts all the 6.5 firewire commits. Will do, I also just tried 6.6 RC3 on my test system and that still only works with the blacklist added. Just tested your latest build 6.5.5 and it works fine without the blacklist. Hi, I'm current maintainer of Linux FireWire subsystem. I realized your issue by receiving a message from Jiri Slaby[1], and apologize your inconvenience. This morning I installed OpenSUSE tumbleweed into my virtual machine (x86_64) on host machine (AMD Ryzen 5 2400G/Gigabyte AX370-Gaming 5, BIOS F51h). 1394 OHCI hardware is bind to the virtual machine by vfio-pci in host OS (Ubuntu 23.04 amd64). As a result, I have no issue in the virtual machine. (In guest system) ~> cat /etc/os-release NAME="openSUSE Tumbleweed" # VERSION="20230922" ID="opensuse-tumbleweed" ID_LIKE="opensuse suse" VERSION_ID="20230922" PRETTY_NAME="openSUSE Tumbleweed" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:opensuse:tumbleweed:20230922" BUG_REPORT_URL="https://bugzilla.opensuse.org" SUPPORT_URL="https://bugs.opensuse.org" HOME_URL="https://www.opensuse.org" DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed" LOGO="distributor-logo-Tumbleweed" ~> cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.5.4-1-default root=UUID=ca1f9882-855b-4588-955d-8adf468e4fbb splash=silent mitigations=auto quiet security=apparmor ~> uname -r 6.5.4-1-default ~> sudo lspci -vvvv 08:01.0 FireWire (IEEE 1394): Texas Instruments XIO2213A/B/XIO2221 IEEE-1394b OHCI Controller [Cheetah Express] (rev 01) (prog-if 10 [OHCI]) Subsystem: Device 3412:7856 Physical Slot: 1 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32 (500ns min, 1000ns max), Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 23 Region 0: Memory at c1804000 (32-bit, non-prefetchable) [size=2K] Region 1: Memory at c1800000 (32-bit, non-prefetchable) [size=16K] Capabilities: [44] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: firewire_ohci Kernel modules: firewire_ohci ~> LC_ALL=C sudo -E journalctl -k | grep ohci Sep 27 08:11:06 localhost kernel: firewire_ohci 0000:08:01.0: added OHCI v1.10 device as card 0, 8 IR + 8 IT contexts, quirks 0x2 ~> udevadm info -e ... P: /devices/pci0000:00/0000:00:02.6/0000:07:00.0/0000:08:01.0 M: 0000:08:01.0 R: 0 U: pci V: firewire_ohci E: DEVPATH=/devices/pci0000:00/0000:00:02.6/0000:07:00.0/0000:08:01.0 E: SUBSYSTEM=pci E: DRIVER=firewire_ohci E: PCI_CLASS=C0010 E: PCI_ID=104C:823F E: PCI_SUBSYS_ID=3412:7856 E: PCI_SLOT_NAME=0000:08:01.0 E: MODALIAS=pci:v0000104Cd0000823Fsv00003412sd00007856bc0Csc00i10 E: USEC_INITIALIZED=4075041 E: ID_PCI_CLASS_FROM_DATABASE=Serial bus controller E: ID_PCI_SUBCLASS_FROM_DATABASE=FireWire (IEEE 1394) E: ID_PCI_INTERFACE_FROM_DATABASE=OHCI E: ID_VENDOR_FROM_DATABASE=Texas Instruments E: ID_MODEL_FROM_DATABASE=XIO2213A/B/XIO2221 IEEE-1394b OHCI Controller [Cheetah Express] E: ID_PATH=pci-0000:08:01.0 E: ID_PATH_TAG=pci-0000_08_01_0 P: /devices/pci0000:00/0000:00:02.6/0000:07:00.0/0000:08:01.0/fw0 M: fw0 R: 0 U: firewire D: c 243:0 N: fw0 L: 0 E: DEVPATH=/devices/pci0000:00/0000:00:02.6/0000:07:00.0/0000:08:01.0/fw0 E: SUBSYSTEM=firewire E: DEVNAME=/dev/fw0 E: MAJOR=243 E: MINOR=0 ... I think the issued 1394 OHCI hardware seems to bring the issue as I mentioned in reply to the message[2]. My hardware integrates PCIe-PCI bridge as well as PCI-1394-bus bridge (OHCI), like: (in host system) ~> sudo lspci -v ... 01:00.0 PCI bridge: Texas Instruments XIO2213A/B/XIO2221 PCI Express to PCI Bridge [Cheetah Express] (rev 01) (prog-if 00 [Normal decode]) Subsystem: Device 3412:7856 Flags: bus master, fast devsel, latency 0, IOMMU group 8 Memory at fce00000 (32-bit, non-prefetchable) [size=4K] Bus: primary=01, secondary=02, subordinate=02, sec-latency=32 I/O behind bridge: [disabled] [32-bit] Memory behind bridge: fcd00000-fcdfffff [size=1M] [32-bit] Prefetchable memory behind bridge: [disabled] [64-bit] Capabilities: [50] Power Management version 3 Capabilities: [60] MSI: Enable- Count=1/16 Maskable- 64bit+ Capabilities: [80] Subsystem: Device 3412:7856 Capabilities: [90] Express PCI-Express to PCI/PCI-X Bridge, MSI 00 Capabilities: [100] Advanced Error Reporting 02:00.0 FireWire (IEEE 1394): Texas Instruments XIO2213A/B/XIO2221 IEEE-1394b OHCI Controller [Cheetah Express] (rev 01) (prog-if 10 [OHCI]) Subsystem: Device 3412:7856 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 119, IOMMU group 8 Memory at fcd04000 (32-bit, non-prefetchable) [size=2K] Memory at fcd00000 (32-bit, non-prefetchable) [size=16K] Capabilities: [44] Power Management version 3 Kernel driver in use: vfio-pci Kernel modules: firewire_ohci If the issued hardware has different design related to bus bridge, it seems to be a hint of the issue. If not, I'm going to investigate the issue without the virtual environment. [1] https://lore.kernel.org/lkml/ZRLFP3UAX288JgAK@krava/ [2] https://lore.kernel.org/lkml/20230926140922.GA6538@workstation.local/ Regards @Takashi san, thanks for jumping in. @Stuart: could you retest the kernel once it builds? Now, it contains only these reverts: 0001-Revert-firewire-net-fix-use-after-free-in-fwnet_fini.patch 0002-Revert-firewire-ohci-release-buffer-for-AR-req-resp-.patch 0003-Revert-firewire-ohci-use-devres-for-content-of-confi.patch 0004-Revert-firewire-ohci-use-devres-for-IT-IR-AT-receive.patch 0005-Revert-firewire-ohci-use-devres-for-list-of-isochron.patch 0006-Revert-firewire-ohci-use-devres-for-requested-IRQ.patch 0007-Revert-firewire-ohci-use-devres-for-misc-DMA-buffer.patch 0008-Revert-firewire-ohci-use-devres-for-MMIO-region-mapp.patch 0009-Revert-firewire-ohci-use-devres-for-PCI-related-reso.patch 0010-Revert-firewire-ohci-use-devres-for-memory-object-of.patch 0011-Revert-firewire-fix-warnings-to-generate-UAPI-docume.patch Also, can you upload here outputs of below commands from a working kernel? hwinfo lspci -vvnnxxx lspci -tv Created attachment 869781 [details]
hwinfo
Created attachment 869782 [details]
lspci -vvnnxxx
Created attachment 869783 [details]
lspci -tv
Tested kernel-default-6.5.5-7.1.x86_64.rpm this morning and it fails to boot if I remove the blacklist. The command outputs are from a working kernel and are added as attachments. (In reply to Stuart Rogers from comment #29) > Tested kernel-default-6.5.5-7.1.x86_64.rpm this morning and it fails to boot > if I remove the blacklist. The command outputs are from a working kernel and > are added as attachments. Huh, that's sort of unexpected. So it is one of cdev patches, I reverted 6 more: 0012-Revert-firewire-fix-build-failure-due-to-missing-mod.patch 0013-Revert-firewire-cdev-implement-new-event-relevant-to.patch 0014-Revert-firewire-cdev-add-new-event-to-notify-phy-pac.patch 0015-Revert-firewire-cdev-code-refactoring-to-dispatch-ev.patch 0016-Revert-firewire-cdev-implement-new-event-to-notify-r.patch 0017-Revert-firewire-cdev-add-new-event-to-notify-respons.patch Could you test once built? Just tested kernel-default-6.5.5-8.1.x86_64.rpm and it fails to boot with no blacklist, boots OK if I leave blacklist in. I note that the hardware is the combination of ASM1083/1085 and VT6306/7/8. I just went to look this morning and found kernel-default-6.5.5-8.2.x86_64.rpm so I tested that as well just in case but it still fails if I remove the blacklist. As long As I tested Linux FireWire stack in actual machine, it works well. https://lore.kernel.org/lkml/0ed4012a-83a7-4849-92c4-87a86e1bbb84@app.fastmail.com/ As supplements: -> journalctl -k kernel: smpboot: CPU0: AMD Ryzen 5 2400G with Radeon Vega Graphics (family: 0x17, model: 0x11, stepping: 0x0) ... kernel: DMI: Gigabyte Technology Co., Ltd. AX370-Gaming 5/AX370-Gaming 5, BIOS F51h 02/09/2023 My issue is that any kernel after 6.4.12 fails unless I blacklist firewire. Now at some point in kernel 6.5.x a fix was applied which stopped this working. So I need help to determine what the change was and what in that change was causing it to fail so completely that the system rebooted. If I can get some assistance to get that far and it turns out to be an issue with my hardware/BIOS I can take it up with MSI who make my motherboard. I stress that up to and including 6.4.12 I had no issues with it at all. I do not have the information or experience to move this further bisecting the kernel without some help from someone with the relevant knowledge of the fixes etc. Checked again this evening and found kernel-default-6.5.5-8.4.x86_64.rpm which I thought I'd try, still no go with blacklist removed sadly. I really appreciate the help I'm getting here. (In reply to Stuart Rogers from comment #36) > Checked again this evening and found kernel-default-6.5.5-8.4.x86_64.rpm > which I thought I'd try, still no go with blacklist removed sadly. OK, so it still fails, I enabled more reverts: 0018-Revert-firewire-cdev-code-refactoring-to-operate-eve.patch 0019-Revert-firewire-core-implement-variations-to-send-re.patch 0020-Revert-firewire-core-use-union-for-callback-of-trans.patch These remain to test after this step (if it still fails): #0021-Revert-firewire-cdev-implement-new-event-to-notify-r.patch #0022-Revert-firewire-cdev-add-new-event-to-notify-request.patch #0023-Revert-firewire-cdev-add-new-version-of-ABI-to-notif.patch #0024-Revert-firewire-add-KUnit-test-to-check-layout-of-UA.patch OK success this time it booted fine without the blacklist using kernel-default-6.5.5-9.1.x86_64.rpm, so I'm guessing one of these last reverts is the issue. In that case, it's one of the previous: 0018-Revert-firewire-cdev-code-refactoring-to-operate-eve.patch 0019-Revert-firewire-core-implement-variations-to-send-re.patch 0020-Revert-firewire-core-use-union-for-callback-of-trans.patch I disabled the last now. Once it has built I'll test it later today as I have to go out now. So kernel-default-6.5.5-9.1.x86_64.rpm earlier today worked fine without the blacklist. Now kernel-default-6.5.5-10.1.x86_64.rpm fails to boot without the blacklist. So it is caused by this commit (if I made no mistake): commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb Author: Takashi Sakamoto <o-takashi@sakamocchi.jp> Date: Tue May 30 08:12:40 2023 +0900 firewire: core: use union for callback of transaction completion But I fail to see the cause. Takashi? (In reply to Jiri Slaby from comment #42) > So it is caused by this commit (if I made no mistake): > commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb > Author: Takashi Sakamoto <o-takashi@sakamocchi.jp> > Date: Tue May 30 08:12:40 2023 +0900 > > firewire: core: use union for callback of transaction completion > > But I fail to see the cause. Takashi? I applied a debug patch. Could you test? Well I just tested the 11.1 kernel with all the extra parameters to slow down booting so I could watch the messages. Still went too fast for me but it did boot to desktop eventually so I have capture the journal and dmesg output which I will upload. Created attachment 869871 [details]
Journal output from booting 11.1 kernel
Created attachment 869872 [details]
dmesg output from booting kernel 11.1
So one of those get_cycle_time() triggers the reboot. Let's dump the users and add some delay. It's there building... Initial test boot of 14.1 kernel did not boot, so tried again with all the delay and other added parameters and it still did not boot, loads of messages but I blinked and missed the final ones. Anyway I'll try again later with phone ready to capture the messages and see what I find. Managed to video the messages this time and around 4:11:16 into the video you can see the message saying the firewire_ohci was added but the a whole bunch more messages before it gives up and reboots. I'll upload the video in case anything can be seen in it of more value. Video asis is too large to upload and I'm not sure I have anywhere I can upload it to, it's 1.3gb. Is there anything I should look out for in the messages which might help and perhaps I can reduce the video to just the relevant part to upload? Created attachment 869879 [details]
Section of video of kernel 14.1 booting
I'm had a go at reducing the video to where it says it is booting paravirtualized kernel up to the point it reboots. Hope this might give what you need. I'm out of ideas (I will communicate with Takashi). I thought it would be the first get_cycle_time() to crash. But apparently, there are several calls to it and it goes on. Until "something" happens. One final check from me -- I let get_cycle_time() to always dump a single line (not a stack trace as before) and return 0 without accessing the timestamp reg. Could you check this really avoids the problem? Ah, I received an e-mail from Takashi in the meantime: https://lore.kernel.org/all/20231004002407.GA48535@workstation.local/ Stating: ===== > ... it looks > to be an issue specific to the reporter's 1394 OHCI hardware. I suspect > a quirk specific to it related to accessing to CYCLE_TIME register in > early time after powering on. It is the reason that I can regenerate the > issue in my set of hardware. I suppose so. (I believe you wanted to write "cannot" in there.) > Would I ask you to request the reporter to inform the detail of > hardware? If possible, let the reporter open PC box and take some picture > of the hardware so that we can identify the ICs on the hardware? > > Via pci.ids, we can see both 'ASM1083/1085' and 'VT6306/7/8' are used, > while I need to identify the IC to purchase an alternative so that I can > regenerate the issue. @Stuart: are you willing to open the box? Yes that's no problem as it is a desktop I assembled myself so no warranty worries. It might be a short while before I get the chance so will update when done with photos. OK so I opened the PC case and to be honest I'd forgotten this was a PCI-E add-on firewire card, anyway I have photo graphed it so you can see the two chips. It was a purchase back in 2020 and was described as PCI-e 1X IEEE 1394A 4 Port (3+1) Firewire Card Adapter. I will upload the photo. Created attachment 869905 [details]
Photo of Firewire PCI-E card
I discovered two new kernels 16.1 and 17.1 so decided to test them to see what happened. Both booted successfully to the desktop without the blacklist for firewire. (In reply to Stuart Rogers from comment #58) > I discovered two new kernels 16.1 and 17.1 so decided to test them to see > what happened. Both booted successfully to the desktop without the blacklist > for firewire. Great, that confirms the reads from the timestamp register causes the reboot. Weird, but Takashi presumed that. I hope he will come up with something. You can keep kernel 6.4.* locked if you need to use firewire, so that it's not uninstalled. (Until we have a fix/quirk available for 6.5.) I note that an issue is filed to kernel.org and I added a comment to it. * https://bugzilla.kernel.org/show_bug.cgi?id=217994#c5 I'm under investigation, while currently I think the issue relates to any hardware quirk in Asmedia ASM1083 and ASM1085. I just replaced my Firewire card with one which does not have the Asmedia chip on it and the system boots perfectly, this card has a VIA chip. I can test any fix easily by replacing my old card in the PC when required. I can confirm Mr. Rogers finding. Removing a firewire card visually identical to comment 57, with matching VIA and Asmedia chips, allows my MSI motherboard with AMD 3600 CPU to boot current tumbleweed for the first time since kernel 6.4.12-1. So there is an upstream patch available: https://lore.kernel.org/all/20240102110150.244475-1-o-takashi@sakamocchi.jp/ Do you still have the HW to test the above? Yes I still have my old firewire card which causes the issue. My problem is that I am unfamiliar with the methods of patching having never done this or compiled a kernel. If there is a kernel to test then I can certainly do that. I've downloaded it and installed OK. Rebooted with new card to make sure it runs OK which it does. Next I will install the previous card to see if it fixes the issue. May take a few hours before I can do that. I will update again once tested. Pushed the patch to the stable branch. Hi,
The change for 1394 OHCI driver, aimed at suppressing the unexpected
system reboot in AMD Ryzen machine[1], has been merged into Linux kernel
v6.7[2]. It has also been applied to the following releases of stable and
longterm kernels.
* 6.6.11[3]
* 6.1.72[4]
* 5.15.147[5]
* 5.10.208[6]
* 5.4.267[7]
* 4.19.305[8]
* 4.14.336[9]
Once the downstream distribution project provides the corresponding kernel
packages, you should no longer encounter the unexpected system reboot.
Note that the following combination of hardware is not necessarily suitable,
depending on your use case:
* Any type of AMD Ryzen machine
* 1394 OHCI hardware consists of:
* Asmedia ASM1083/1085
* VIA VT6306/6307/6308
When working with time-aware protocol, such as audio sample processing, it
is advisable to avoid the combination. The change accompanies a functional
limitation that the software stack does not provides precise hardware time
in this case.
If you choose to continue using AMD Ryzen machine, the recommendation is
to replace the 1394 OHCI hardware with another one. Conversely, if you
choose to continue using the 1394 OHCI hardware, the recommendation is to
use the machine provided by vendors other than AMD.
Thanks for your report and long patience.
[1] https://git.kernel.org/torvalds/linux/c/ac9184fbb847
[2] https://lore.kernel.org/lkml/CAHk-=widprp4XoHUcsDe7e16YZjLYJWra-dK0hE1MnfPMf6C3Q@mail.gmail.com/
[3] https://lore.kernel.org/lkml/2024011058-sheep-thrower-d2f8@gregkh/
[4] https://lore.kernel.org/lkml/2024011052-unsightly-bronze-e628@gregkh/
[5] https://lore.kernel.org/lkml/2024011541-defective-scuff-c55e@gregkh/
[6] https://lore.kernel.org/lkml/2024011532-lustiness-hybrid-fc72@gregkh/
[7] https://lore.kernel.org/lkml/2024011519-mating-tag-1f62@gregkh/
[8] https://lore.kernel.org/lkml/2024011508-shakiness-resonant-f15e@gregkh/
[9] https://lore.kernel.org/lkml/2024011046-ecology-tiptoeing-ce50@gregkh/
Thanks
Takashi Sakamoto
|
Created attachment 869567 [details] dmidecode and journal output On doing a zypper dup the other day it installed kernel 6.5.2 and immediately the system failed to boot, the grub menu appeared and when the timeout expired the display went black for a while and the system then rebooted to the grub menu again. I was able to switch back to kernel 6.4.12 and my system booted normally as it has done for months on kernel 6.4.x. After trying nomodeset to no avail I tried acpi=noirq on the grub linux command and my system would the boot OK. I also tried acpi_enforce_resources=lax but again the system failed to boot.So I'm left having to run with acpi=noirq to get a working system. The CPU is an AMD Ryzen 5 3400G with Radeon Vega Graphics on an MSI B450-A PRO MAX motherboard with the latest available BIOS from July 2023. I have run dmidecode and journal -b and captured the data which is appended here.