Bug 1215436

Summary: Unable to boot kernels 6.5.2 or 6.5.3 unless acpi=noirq specified
Product: [openSUSE] openSUSE Tumbleweed Reporter: Stuart Rogers <stuart>
Component: KernelAssignee: Jiri Slaby <jslaby>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: bhs_suse, jslaby, o-takashi, stuart
Version: Current   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE Tumbleweed   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=217994
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: dmidecode and journal output
Journal output
hwinfo
lspci -vvnnxxx
lspci -tv
Journal output from booting 11.1 kernel
dmesg output from booting kernel 11.1
Section of video of kernel 14.1 booting
Photo of Firewire PCI-E card

Description Stuart Rogers 2023-09-18 09:50:30 UTC
Created attachment 869567 [details]
dmidecode and journal output

On doing a zypper dup the other day it installed kernel 6.5.2 and immediately the system failed to boot, the grub menu appeared and when the timeout expired the display went black for a while and the system then rebooted to the grub menu again. I was able to switch back to kernel 6.4.12 and my system booted normally as it has done for months on kernel 6.4.x. After trying nomodeset to no avail I tried acpi=noirq on the grub linux command and my system would the boot OK. I also tried acpi_enforce_resources=lax but again the system failed to boot.So I'm left having to run with acpi=noirq to get a working system.

The CPU is an AMD Ryzen 5 3400G with Radeon Vega Graphics on an MSI B450-A PRO MAX motherboard with the latest available BIOS from July 2023. I have run dmidecode and journal -b and captured the data which is appended here.
Comment 1 Stuart Rogers 2023-09-18 09:52:17 UTC
Created attachment 869568 [details]
Journal output
Comment 2 Stuart Rogers 2023-09-18 09:57:13 UTC
Just to add the /proc/cmdline is

BOOT_IMAGE=/boot/vmlinuz-6.5.3-1-default root=UUID=190442a2-32b3-47e7-bac7-a39e6841236f splash=silent resume=/dev/disk/by-id/nvme-KINGSTON_SA2000M8250G_50026B728266A552-part3 quiet acpi=noirq mitigations=auto
Comment 3 Jiri Slaby 2023-09-20 07:55:03 UTC
(In reply to Stuart Rogers from comment #2)
> Just to add the /proc/cmdline is
> 
> BOOT_IMAGE=/boot/vmlinuz-6.5.3-1-default
> root=UUID=190442a2-32b3-47e7-bac7-a39e6841236f splash=silent
> resume=/dev/disk/by-id/nvme-KINGSTON_SA2000M8250G_50026B728266A552-part3
> quiet acpi=noirq mitigations=auto

If you pass neither quiet, nor acpi=noirq and instead, you pass loglevel=7, can you actually see something?
Comment 4 Stuart Rogers 2023-09-20 09:43:50 UTC
Tried that loglevel=7 without quiet and acpi-noirq but all I get is a load of messages flashing past on screen but too fast to see clearly. Looks from memory the same as when I do it without quiet and acpi. Then it just reboots.

On next good boot with acpi=noirq I can see no messages anywhere relating to this fail boot, nothing in dmesg or journal relating to the fail.

Where would messages be expected to be saved if any?
Comment 5 Jiri Slaby 2023-09-20 10:01:56 UTC
(In reply to Stuart Rogers from comment #4)
> Tried that loglevel=7 without quiet and acpi-noirq but all I get is a load
> of messages flashing past on screen but too fast to see clearly. Looks from
> memory the same as when I do it without quiet and acpi. Then it just reboots.

Hm, we might need to enable CONFIG_BOOT_PRINTK_DELAY and add boot_delay=30 to kernel commandline. (It will wait 30 ms after each message printed.)

Or, according to the dmidecode dump, your motherboard is B450-A PRO MAX which is supposed to have:
Internal Connectors
...
* 1x Serial port connector

Maybe you can capture serial console output? console=ttyS0 would write there.

> On next good boot with acpi=noirq I can see no messages anywhere relating to
> this fail boot, nothing in dmesg or journal relating to the fail.
> 
> Where would messages be expected to be saved if any?

Unfortunately nowhere as disks are apparently not up quite yet.
Comment 6 Jiri Slaby 2023-09-20 10:06:45 UTC
(In reply to Jiri Slaby from comment #5)
> (In reply to Stuart Rogers from comment #4)
> > Tried that loglevel=7 without quiet and acpi-noirq but all I get is a load
> > of messages flashing past on screen but too fast to see clearly. Looks from
> > memory the same as when I do it without quiet and acpi. Then it just reboots.
> 
> Hm, we might need to enable CONFIG_BOOT_PRINTK_DELAY and add boot_delay=30
> to kernel commandline. (It will wait 30 ms after each message printed.)

Kernel with the config set is building at:
https://build.opensuse.org/project/monitor/home:jirislaby:stable-boot_delay
Comment 7 Stuart Rogers 2023-09-20 15:53:45 UTC
I downloaded kernel-default-6.5.4-3.1.x86_64.rpm when it finished and it boots OK when I use acpi=noirq, however when I remove quiet and acpi=noirq and add boot_delay it waits a ling time before the messages start but they still go through too quickly to really see, even when I used a delay of 100. I was not sure what the devel and vdso kernels were for.

Is it worth tying the 6.6 kernel as mentioned in bug 1215328 to see if a vanilla kernel still fails?
Comment 8 Jiri Slaby 2023-09-21 08:28:09 UTC
(In reply to Stuart Rogers from comment #7)
> I downloaded kernel-default-6.5.4-3.1.x86_64.rpm when it finished and it
> boots OK when I use acpi=noirq, however when I remove quiet and acpi=noirq
> and add boot_delay it waits a ling time before the messages start

You might add earlycon=efifb to see the messages before the real console comes up.

> but they
> still go through too quickly to really see, even when I used a delay of 100.

You might need to pass lpj=12310136 too. As the default is 1 million and the calibration happens later in the boot cycle. Then boot_delay=100 should be better (12 times slower than without lpj=).

Maybe you can capture a video?

> I was not sure what the devel and vdso kernels were for.

You can ignore those.

> Is it worth tying the 6.6 kernel as mentioned in bug 1215328 to see if a
> vanilla kernel still fails?

You might try to confirm whether it helps or not.
Comment 9 Jiri Slaby 2023-09-21 08:39:38 UTC
(In reply to Jiri Slaby from comment #8)
> > but they
> > still go through too quickly to really see, even when I used a delay of 100.

Oh, wait. If the reboot happens *after* init is started, boot_delay has no effect. Please pass sysctl.kernel.printk_delay=100 (or 500 or 1000) instead (no need for lpj and boot_delay then).
Comment 10 Jiri Slaby 2023-09-21 08:55:05 UTC
(In reply to Jiri Slaby from comment #9)
> (In reply to Jiri Slaby from comment #8)
> > > but they
> > > still go through too quickly to really see, even when I used a delay of 100.
> 
> Oh, wait. If the reboot happens *after* init is started, boot_delay has no
> effect. Please pass sysctl.kernel.printk_delay=100 (or 500 or 1000) instead
> (no need for lpj and boot_delay then).

One more thing: add also initcall_debug parameter. That will dump which driver is being initialized at which point.
Comment 11 Jiri Slaby 2023-09-21 08:59:35 UTC
(In reply to Jiri Slaby from comment #10)
> (In reply to Jiri Slaby from comment #9)
> > (In reply to Jiri Slaby from comment #8)
> > > > but they
> > > > still go through too quickly to really see, even when I used a delay of 100.
> > 
> > Oh, wait. If the reboot happens *after* init is started, boot_delay has no
> > effect. Please pass sysctl.kernel.printk_delay=100 (or 500 or 1000) instead
> > (no need for lpj and boot_delay then).
> 
> One more thing: add also initcall_debug parameter. That will dump which
> driver is being initialized at which point.

Sorry for spamming, but initcall_debug in turn needs ignore_loglevel as klog.service lowers the loglevel, so you need to add all three:
sysctl.kernel.printk_delay=100 initcall_debug ignore_loglevel
Comment 12 Stuart Rogers 2023-09-21 09:58:28 UTC
Well by the time I read your update I had already removed your kernel, so I tried those options with the 6.5.3 kernel but the messages still flew by too fast.

Which is best now to try the 6.6 kernel or reload your 6.5.4 kernel and try those options with that?
Comment 13 Jiri Slaby 2023-09-21 10:15:26 UTC
(In reply to Stuart Rogers from comment #12)
> Well by the time I read your update I had already removed your kernel, so I
> tried those options with the 6.5.3 kernel but the messages still flew by too
> fast.

Hm, other than boot_delay should work fine even with stock kernel. It also makes me wonder why the messages are still printed fast.

> Which is best now to try the 6.6 kernel or reload your 6.5.4 kernel and try
> those options with that?

You can install both (using "rpm -i" to keep all of the kernels) and test them one after each other in a row.

So I would boot 6.6 first to se if it is OK. Then I would boot my kernel with all:
sysctl.kernel.printk_delay=100 initcall_debug ignore_loglevel lpj=12310136 boot_delay=100 earlycon=efifb
(You can add this directly to /boot/grub2/grub.cfg to my kernel, it will be removed automatically by uninstalling the kernel. Note that it is autogenerated/overwritten upon each kernel installation/removal.)
Comment 14 Stuart Rogers 2023-09-22 08:53:00 UTC
Well I just tried the 6.6 kernel with the sysctl etc additions and removing acpi & quiet but it still fails to boot exactly like the others. The messages are still to fast for the naked eye so I might try a video with my phone!

Anyway 6.6 boots OK with acpi=noirq. Is it still worth trying your 6.5.4?
Comment 15 Stuart Rogers 2023-09-25 21:57:25 UTC
I have managed to video the messages on screen with all those parameters added and on viewing it I can see no obvious errors. The last message which shows prior to it rebooting relates to the firewire interface on the motherboard it starts with initcall fw_ohci_int and has [firewire_ohci] as part of it. I'll try to extract the full message in the morning as it is a bit blurred! Anyway the screen then goes blank and it reboots to the hardware messages about entering BIOS etc and the goes to the standard GRUB menu.
Comment 16 Jiri Slaby 2023-09-26 05:02:42 UTC
(In reply to Stuart Rogers from comment #15)
> I have managed to video the messages on screen with all those parameters
> added and on viewing it I can see no obvious errors. The last message which
> shows prior to it rebooting relates to the firewire interface on the
> motherboard it starts with initcall fw_ohci_int and has [firewire_ohci] as

It's a shot in the dark, but if it is the last, you can try to add:
module_blacklist=firewire_ohci

Note that modules are loaded in parallel. So it can be any of the last initcalls which has not dumped "finished" yet.

Can you upload the video somewhere (perhaps attach here if not too large)?
Comment 17 Stuart Rogers 2023-09-26 08:54:23 UTC
Well adding that module_blacklist=firewire_ohci works and the system boots up fine. This was using the standard kernel 6.5.4.1-1.

Where do we go from here? I do occasionally use firewire on this desktop as I have an old Camcorder which has that interface.
Comment 18 Jiri Slaby 2023-09-26 08:59:38 UTC
(In reply to Stuart Rogers from comment #17)
> Well adding that module_blacklist=firewire_ohci works and the system boots
> up fine. This was using the standard kernel 6.5.4.1-1.

Perfect! We narrowed the problem a heap.
 
> Where do we go from here? I do occasionally use firewire on this desktop as
> I have an old Camcorder which has that interface.

Don't worry, we need to find out the root cause. There are ~ 10 commits in firewire code between 6.4 and 6.5. Let me ask upstream devs what they think (I do not see anything wrong on the commits on first glance).
Comment 19 Jiri Slaby 2023-09-26 09:06:44 UTC
BTW this commit from 6.5:
commit 06f45435d985d60d7d2fe2424fbb9909d177a63d
Author: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Date:   Sun Jun 4 16:02:55 2023 +0900

    firewire: core: obsolete usage of GFP_ATOMIC at building node tree

was reverted in 6.6-rc2 and 6.5.5. Have you booted any of those yet?
Comment 20 Stuart Rogers 2023-09-26 09:56:09 UTC
I tried kernel-default-6.6~rc2-1.1.g8a1f7fd.x86_64.rpm but it failed as 6.5.4 does without the blacklist.
Comment 21 Jiri Slaby 2023-09-26 10:44:23 UTC
Could you try the kernel from:
https://build.opensuse.org/project/monitor/home:jirislaby:stable-boot_delay
again (once it builds)? It reverts all the 6.5 firewire commits.
Comment 22 Stuart Rogers 2023-09-26 10:52:16 UTC
Will do, I also just tried 6.6 RC3 on my test system and that still only works with the blacklist added.
Comment 23 Stuart Rogers 2023-09-26 14:45:58 UTC
Just tested your latest build 6.5.5 and it works fine without the blacklist.
Comment 24 Takashi Sakamoto 2023-09-26 23:40:36 UTC
Hi,

I'm current maintainer of Linux FireWire subsystem. I realized your issue by receiving a message from Jiri Slaby[1], and apologize your inconvenience.

This morning I installed OpenSUSE tumbleweed into my virtual machine (x86_64) on host machine (AMD Ryzen 5 2400G/Gigabyte  AX370-Gaming 5, BIOS F51h). 1394 OHCI hardware is bind to the virtual machine by vfio-pci in host OS (Ubuntu 23.04 amd64).

As a result, I have no issue in the virtual machine.

(In guest system)

~> cat /etc/os-release
NAME="openSUSE Tumbleweed"
# VERSION="20230922"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20230922"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20230922"
BUG_REPORT_URL="https://bugzilla.opensuse.org"
SUPPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed"
LOGO="distributor-logo-Tumbleweed"

~> cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.5.4-1-default root=UUID=ca1f9882-855b-4588-955d-8adf468e4fbb splash=silent mitigations=auto quiet security=apparmor

~> uname -r
6.5.4-1-default

~> sudo lspci -vvvv
08:01.0 FireWire (IEEE 1394): Texas Instruments XIO2213A/B/XIO2221 IEEE-1394b OHCI Controller [Cheetah Express] (rev 01) (prog-if 10 [OHCI])
        Subsystem: Device 3412:7856
        Physical Slot: 1
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 32 (500ns min, 1000ns max), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 23
        Region 0: Memory at c1804000 (32-bit, non-prefetchable) [size=2K]
        Region 1: Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [44] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: firewire_ohci
        Kernel modules: firewire_ohci

~> LC_ALL=C sudo -E journalctl -k | grep ohci
Sep 27 08:11:06 localhost kernel: firewire_ohci 0000:08:01.0: added OHCI v1.10 device as card 0, 8 IR + 8 IT contexts, quirks 0x2

~> udevadm info -e
...
P: /devices/pci0000:00/0000:00:02.6/0000:07:00.0/0000:08:01.0
M: 0000:08:01.0
R: 0
U: pci
V: firewire_ohci
E: DEVPATH=/devices/pci0000:00/0000:00:02.6/0000:07:00.0/0000:08:01.0
E: SUBSYSTEM=pci
E: DRIVER=firewire_ohci
E: PCI_CLASS=C0010
E: PCI_ID=104C:823F
E: PCI_SUBSYS_ID=3412:7856
E: PCI_SLOT_NAME=0000:08:01.0
E: MODALIAS=pci:v0000104Cd0000823Fsv00003412sd00007856bc0Csc00i10
E: USEC_INITIALIZED=4075041
E: ID_PCI_CLASS_FROM_DATABASE=Serial bus controller
E: ID_PCI_SUBCLASS_FROM_DATABASE=FireWire (IEEE 1394)
E: ID_PCI_INTERFACE_FROM_DATABASE=OHCI
E: ID_VENDOR_FROM_DATABASE=Texas Instruments
E: ID_MODEL_FROM_DATABASE=XIO2213A/B/XIO2221 IEEE-1394b OHCI Controller [Cheetah Express]
E: ID_PATH=pci-0000:08:01.0
E: ID_PATH_TAG=pci-0000_08_01_0

P: /devices/pci0000:00/0000:00:02.6/0000:07:00.0/0000:08:01.0/fw0
M: fw0
R: 0
U: firewire
D: c 243:0
N: fw0
L: 0
E: DEVPATH=/devices/pci0000:00/0000:00:02.6/0000:07:00.0/0000:08:01.0/fw0
E: SUBSYSTEM=firewire
E: DEVNAME=/dev/fw0
E: MAJOR=243
E: MINOR=0
...


I think the issued 1394 OHCI hardware seems to bring the issue as I mentioned in reply to the message[2]. My hardware integrates PCIe-PCI bridge as well as PCI-1394-bus bridge (OHCI), like:


(in host system)
~> sudo lspci -v
...
01:00.0 PCI bridge: Texas Instruments XIO2213A/B/XIO2221 PCI Express to PCI Bridge [Cheetah Express] (rev 01) (prog-if 00 [Normal decode])
        Subsystem: Device 3412:7856
        Flags: bus master, fast devsel, latency 0, IOMMU group 8
        Memory at fce00000 (32-bit, non-prefetchable) [size=4K]
        Bus: primary=01, secondary=02, subordinate=02, sec-latency=32
        I/O behind bridge: [disabled] [32-bit]
        Memory behind bridge: fcd00000-fcdfffff [size=1M] [32-bit]
        Prefetchable memory behind bridge: [disabled] [64-bit]
        Capabilities: [50] Power Management version 3
        Capabilities: [60] MSI: Enable- Count=1/16 Maskable- 64bit+
        Capabilities: [80] Subsystem: Device 3412:7856
        Capabilities: [90] Express PCI-Express to PCI/PCI-X Bridge, MSI 00
        Capabilities: [100] Advanced Error Reporting

02:00.0 FireWire (IEEE 1394): Texas Instruments XIO2213A/B/XIO2221 IEEE-1394b OHCI Controller [Cheetah Express] (rev 01) (prog-if 10 [OHCI])
        Subsystem: Device 3412:7856
        Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 119, IOMMU group 8
        Memory at fcd04000 (32-bit, non-prefetchable) [size=2K]
        Memory at fcd00000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [44] Power Management version 3
        Kernel driver in use: vfio-pci
        Kernel modules: firewire_ohci

If the issued hardware has different design related to bus bridge, it seems to be a hint of the issue. If not, I'm going to investigate the issue without the virtual environment.

[1] https://lore.kernel.org/lkml/ZRLFP3UAX288JgAK@krava/
[2] https://lore.kernel.org/lkml/20230926140922.GA6538@workstation.local/


Regards
Comment 25 Jiri Slaby 2023-09-27 04:57:41 UTC
@Takashi san, thanks for jumping in.

@Stuart: could you retest the kernel once it builds? Now, it contains only these reverts:
0001-Revert-firewire-net-fix-use-after-free-in-fwnet_fini.patch
0002-Revert-firewire-ohci-release-buffer-for-AR-req-resp-.patch
0003-Revert-firewire-ohci-use-devres-for-content-of-confi.patch
0004-Revert-firewire-ohci-use-devres-for-IT-IR-AT-receive.patch
0005-Revert-firewire-ohci-use-devres-for-list-of-isochron.patch
0006-Revert-firewire-ohci-use-devres-for-requested-IRQ.patch
0007-Revert-firewire-ohci-use-devres-for-misc-DMA-buffer.patch
0008-Revert-firewire-ohci-use-devres-for-MMIO-region-mapp.patch
0009-Revert-firewire-ohci-use-devres-for-PCI-related-reso.patch
0010-Revert-firewire-ohci-use-devres-for-memory-object-of.patch
0011-Revert-firewire-fix-warnings-to-generate-UAPI-docume.patch

Also, can you upload here outputs of below commands from a working kernel?
hwinfo
lspci -vvnnxxx
lspci -tv
Comment 26 Stuart Rogers 2023-09-27 08:34:59 UTC
Created attachment 869781 [details]
hwinfo
Comment 27 Stuart Rogers 2023-09-27 08:35:29 UTC
Created attachment 869782 [details]
lspci -vvnnxxx
Comment 28 Stuart Rogers 2023-09-27 08:36:00 UTC
Created attachment 869783 [details]
lspci -tv
Comment 29 Stuart Rogers 2023-09-27 08:43:59 UTC
Tested kernel-default-6.5.5-7.1.x86_64.rpm this morning and it fails to boot if I remove the blacklist. The command outputs are from a working kernel and are added as attachments.
Comment 30 Jiri Slaby 2023-09-27 08:51:59 UTC
(In reply to Stuart Rogers from comment #29)
> Tested kernel-default-6.5.5-7.1.x86_64.rpm this morning and it fails to boot
> if I remove the blacklist. The command outputs are from a working kernel and
> are added as attachments.

Huh, that's sort of unexpected. So it is one of cdev patches, I reverted 6 more:
0012-Revert-firewire-fix-build-failure-due-to-missing-mod.patch
0013-Revert-firewire-cdev-implement-new-event-relevant-to.patch
0014-Revert-firewire-cdev-add-new-event-to-notify-phy-pac.patch
0015-Revert-firewire-cdev-code-refactoring-to-dispatch-ev.patch
0016-Revert-firewire-cdev-implement-new-event-to-notify-r.patch
0017-Revert-firewire-cdev-add-new-event-to-notify-respons.patch

Could you test once built?
Comment 31 Stuart Rogers 2023-09-27 10:07:38 UTC
Just tested kernel-default-6.5.5-8.1.x86_64.rpm and it fails to boot with no blacklist, boots OK if I leave blacklist in.
Comment 32 Takashi Sakamoto 2023-09-27 12:26:26 UTC
I note that the hardware is the combination of ASM1083/1085 and VT6306/7/8.
Comment 33 Stuart Rogers 2023-09-30 08:35:50 UTC
I just went to look this morning and found kernel-default-6.5.5-8.2.x86_64.rpm so I tested that as well just in case but it still fails if I remove the blacklist.
Comment 34 Takashi Sakamoto 2023-10-01 05:43:41 UTC
As long As I tested Linux FireWire stack in actual machine, it works well.

https://lore.kernel.org/lkml/0ed4012a-83a7-4849-92c4-87a86e1bbb84@app.fastmail.com/

As supplements:

-> journalctl -k
kernel: smpboot: CPU0: AMD Ryzen 5 2400G with Radeon Vega Graphics (family: 0x17, model: 0x11, stepping: 0x0)
...
kernel: DMI: Gigabyte Technology Co., Ltd. AX370-Gaming 5/AX370-Gaming 5, BIOS F51h 02/09/2023
Comment 35 Stuart Rogers 2023-10-01 08:44:04 UTC
My issue is that any kernel after 6.4.12 fails unless I blacklist firewire. Now at some point in kernel 6.5.x a fix was applied which stopped this working. So I need help to determine what the change was and what in that change was causing it to fail so completely that the system rebooted. If I can get some assistance to get that far and it turns out to be an issue with my hardware/BIOS I can take it up with MSI who make my motherboard. I stress that up to and including 6.4.12 I had no issues with it at all.

I do not have the information or experience to move this further bisecting the kernel without some help from someone with the relevant knowledge of the fixes etc.
Comment 36 Stuart Rogers 2023-10-01 21:20:50 UTC
Checked again this evening and found kernel-default-6.5.5-8.4.x86_64.rpm which I thought I'd try, still no go with blacklist removed sadly. I really appreciate the help I'm getting here.
Comment 37 Jiri Slaby 2023-10-02 05:11:43 UTC
(In reply to Stuart Rogers from comment #36)
> Checked again this evening and found kernel-default-6.5.5-8.4.x86_64.rpm
> which I thought I'd try, still no go with blacklist removed sadly.

OK, so it still fails, I enabled more reverts:

0018-Revert-firewire-cdev-code-refactoring-to-operate-eve.patch
0019-Revert-firewire-core-implement-variations-to-send-re.patch
0020-Revert-firewire-core-use-union-for-callback-of-trans.patch

These remain to test after this step (if it still fails):
#0021-Revert-firewire-cdev-implement-new-event-to-notify-r.patch
#0022-Revert-firewire-cdev-add-new-event-to-notify-request.patch
#0023-Revert-firewire-cdev-add-new-version-of-ABI-to-notif.patch
#0024-Revert-firewire-add-KUnit-test-to-check-layout-of-UA.patch
Comment 38 Stuart Rogers 2023-10-02 08:19:38 UTC
OK success this time it booted fine without the blacklist using kernel-default-6.5.5-9.1.x86_64.rpm, so I'm guessing one of these last reverts is the issue.
Comment 39 Jiri Slaby 2023-10-02 08:57:08 UTC
In that case, it's one of the previous:
0018-Revert-firewire-cdev-code-refactoring-to-operate-eve.patch
0019-Revert-firewire-core-implement-variations-to-send-re.patch
0020-Revert-firewire-core-use-union-for-callback-of-trans.patch

I disabled the last now.
Comment 40 Stuart Rogers 2023-10-02 09:25:30 UTC
Once it has built I'll test it later today as I have to go out now.
Comment 41 Stuart Rogers 2023-10-02 11:24:58 UTC
So kernel-default-6.5.5-9.1.x86_64.rpm earlier today worked fine without the blacklist.

Now kernel-default-6.5.5-10.1.x86_64.rpm fails to boot without the blacklist.
Comment 42 Jiri Slaby 2023-10-03 05:08:27 UTC
So it is caused by this commit (if I made no mistake):
commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb
Author: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Date:   Tue May 30 08:12:40 2023 +0900

    firewire: core: use union for callback of transaction completion

But I fail to see the cause. Takashi?
Comment 43 Jiri Slaby 2023-10-03 07:31:09 UTC
(In reply to Jiri Slaby from comment #42)
> So it is caused by this commit (if I made no mistake):
> commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb
> Author: Takashi Sakamoto <o-takashi@sakamocchi.jp>
> Date:   Tue May 30 08:12:40 2023 +0900
> 
>     firewire: core: use union for callback of transaction completion
> 
> But I fail to see the cause. Takashi?

I applied a debug patch. Could you test?
Comment 44 Stuart Rogers 2023-10-03 10:39:38 UTC
Well I just tested the 11.1 kernel with all the extra parameters to slow down booting so I could watch the messages. Still went too fast for me but it did boot to desktop eventually so I have capture the journal and dmesg output which I will upload.
Comment 45 Stuart Rogers 2023-10-03 10:40:42 UTC
Created attachment 869871 [details]
Journal output from booting 11.1 kernel
Comment 46 Stuart Rogers 2023-10-03 10:41:17 UTC
Created attachment 869872 [details]
dmesg output from booting kernel 11.1
Comment 47 Jiri Slaby 2023-10-03 11:22:56 UTC
So one of those get_cycle_time() triggers the reboot. Let's dump the users and add some delay. It's there building...
Comment 48 Stuart Rogers 2023-10-03 12:26:48 UTC
Initial test boot of 14.1 kernel did not boot, so tried again with all the delay and other added parameters and it still did not boot, loads of messages but I blinked and missed the final ones. Anyway I'll try again later with phone ready to capture the messages and see what I find.
Comment 49 Stuart Rogers 2023-10-03 13:28:49 UTC
Managed to video the messages this time and around 4:11:16 into the video you can see the message saying the firewire_ohci was added but the a whole bunch more messages before it gives up and reboots. I'll upload the video in case anything can be seen in it of more value.
Comment 50 Stuart Rogers 2023-10-03 13:41:47 UTC
Video asis is too large to upload and I'm not sure I have anywhere I can upload it to, it's 1.3gb. Is there anything I should look out for in the messages which might help and perhaps I can reduce the video to just the relevant part to upload?
Comment 51 Stuart Rogers 2023-10-03 13:49:22 UTC
Created attachment 869879 [details]
Section of video of kernel 14.1 booting
Comment 52 Stuart Rogers 2023-10-03 13:50:38 UTC
I'm had a go at reducing the video to where it says it is booting paravirtualized kernel up to the point it reboots. Hope this might give what you need.
Comment 53 Jiri Slaby 2023-10-04 08:19:49 UTC
I'm out of ideas (I will communicate with Takashi). I thought it would be the first get_cycle_time() to crash. But apparently, there are several calls to it and it goes on. Until "something" happens.

One final check from me -- I let get_cycle_time() to always dump a single line (not a stack trace as before) and return 0 without accessing the timestamp reg. Could you check this really avoids the problem?
Comment 54 Jiri Slaby 2023-10-04 08:28:06 UTC
Ah, I received an e-mail from Takashi in the meantime:
  https://lore.kernel.org/all/20231004002407.GA48535@workstation.local/
Stating:
=====
> ... it looks
> to be an issue specific to the reporter's 1394 OHCI hardware. I suspect
> a quirk specific to it related to accessing to CYCLE_TIME register in
> early time after powering on. It is the reason that I can regenerate the
> issue in my set of hardware.

I suppose so. (I believe you wanted to write "cannot" in there.)

> Would I ask you to request the reporter to inform the detail of
> hardware? If possible, let the reporter open PC box and take some picture
> of the hardware so that we can identify the ICs on the hardware?
>
> Via pci.ids, we can see both 'ASM1083/1085' and 'VT6306/7/8' are used,
> while I need to identify the IC to purchase an alternative so that I can
> regenerate the issue.

@Stuart: are you willing to open the box?
Comment 55 Stuart Rogers 2023-10-04 08:31:47 UTC
Yes that's no problem as it is a desktop I assembled myself so no warranty worries. It might be a short while before I get the chance so will update when done with photos.
Comment 56 Stuart Rogers 2023-10-04 12:01:02 UTC
OK so I opened the PC case and to be honest I'd forgotten this was a PCI-E add-on firewire card, anyway I have photo graphed it so you can see the two chips. It was a purchase back in 2020 and was described as PCI-e 1X IEEE 1394A 4 Port (3+1) Firewire Card Adapter. I will upload the photo.
Comment 57 Stuart Rogers 2023-10-04 12:02:32 UTC
Created attachment 869905 [details]
Photo of Firewire PCI-E card
Comment 58 Stuart Rogers 2023-10-05 11:57:09 UTC
I discovered two new kernels 16.1 and 17.1 so decided to test them to see what happened. Both booted successfully to the desktop without the blacklist for firewire.
Comment 59 Jiri Slaby 2023-10-06 06:48:57 UTC
(In reply to Stuart Rogers from comment #58)
> I discovered two new kernels 16.1 and 17.1 so decided to test them to see
> what happened. Both booted successfully to the desktop without the blacklist
> for firewire.

Great, that confirms the reads from the timestamp register causes the reboot. Weird, but Takashi presumed that. I hope he will come up with something.

You can keep kernel 6.4.* locked if you need to use firewire, so that it's not uninstalled. (Until we have a fix/quirk available for 6.5.)
Comment 60 Takashi Sakamoto 2023-10-12 00:30:17 UTC
I note that an issue is filed to kernel.org and I added a comment to it.

* https://bugzilla.kernel.org/show_bug.cgi?id=217994#c5

I'm under investigation, while currently I think the issue relates to any hardware quirk in Asmedia ASM1083 and ASM1085.
Comment 61 Stuart Rogers 2023-10-17 12:37:58 UTC
I just replaced my Firewire card with one which does not have the Asmedia chip on it and the system boots perfectly, this card has a VIA chip. I can test any fix easily by replacing my old card in the PC when required.
Comment 62 Ben Steel 2023-11-15 00:42:57 UTC
I can confirm Mr. Rogers finding. Removing a firewire card visually identical to comment 57, with matching VIA and Asmedia chips, allows my MSI motherboard with AMD 3600 CPU to boot current tumbleweed for the first time since kernel 6.4.12-1.
Comment 63 Jiri Slaby 2024-01-03 11:06:15 UTC
So there is an upstream patch available:
https://lore.kernel.org/all/20240102110150.244475-1-o-takashi@sakamocchi.jp/

Do you still have the HW to test the above?
Comment 64 Stuart Rogers 2024-01-03 11:25:40 UTC
Yes I still have my old firewire card which causes the issue. My problem is that I am unfamiliar with the methods of patching having never done this or compiled a kernel. If there is a kernel to test then I can certainly do that.
Comment 66 Stuart Rogers 2024-01-03 15:52:17 UTC
I've downloaded it and installed OK. Rebooted with new card to make sure it runs OK which it does. Next I will install the previous card to see if it fixes the issue. May take a few hours before I can do that. I will update again once tested.
Comment 67 Jiri Slaby 2024-01-04 06:36:37 UTC
Pushed the patch to the stable branch.
Comment 68 Takashi Sakamoto 2024-01-16 01:44:57 UTC
Hi,

The change for 1394 OHCI driver, aimed at suppressing the unexpected
system reboot in AMD Ryzen machine[1], has been merged into Linux kernel
v6.7[2]. It has also been applied to the following releases of stable and
longterm kernels.

* 6.6.11[3]
* 6.1.72[4]
* 5.15.147[5]
* 5.10.208[6]
* 5.4.267[7]
* 4.19.305[8]
* 4.14.336[9]

Once the downstream distribution project provides the corresponding kernel
packages, you should no longer encounter the unexpected system reboot.

Note that the following combination of hardware is not necessarily suitable,
depending on your use case:

* Any type of AMD Ryzen machine
* 1394 OHCI hardware consists of:
    * Asmedia ASM1083/1085
    * VIA VT6306/6307/6308

When working with time-aware protocol, such as audio sample processing, it
is advisable to avoid the combination. The change accompanies a functional
limitation that the software stack does not provides precise hardware time
in this case.

If you choose to continue using AMD Ryzen machine, the recommendation is
to replace the 1394 OHCI hardware with another one. Conversely, if you
choose to continue using the 1394 OHCI hardware, the recommendation is to
use the machine provided by vendors other than AMD.

Thanks for your report and long patience.

[1] https://git.kernel.org/torvalds/linux/c/ac9184fbb847
[2] https://lore.kernel.org/lkml/CAHk-=widprp4XoHUcsDe7e16YZjLYJWra-dK0hE1MnfPMf6C3Q@mail.gmail.com/
[3] https://lore.kernel.org/lkml/2024011058-sheep-thrower-d2f8@gregkh/
[4] https://lore.kernel.org/lkml/2024011052-unsightly-bronze-e628@gregkh/
[5] https://lore.kernel.org/lkml/2024011541-defective-scuff-c55e@gregkh/
[6] https://lore.kernel.org/lkml/2024011532-lustiness-hybrid-fc72@gregkh/
[7] https://lore.kernel.org/lkml/2024011519-mating-tag-1f62@gregkh/
[8] https://lore.kernel.org/lkml/2024011508-shakiness-resonant-f15e@gregkh/
[9] https://lore.kernel.org/lkml/2024011046-ecology-tiptoeing-ce50@gregkh/


Thanks

Takashi Sakamoto