Bug 1187329 - Kernel Panic loading initial ramdisk with kernel version 5.3.18-57.3 and pm80xx module
Kernel Panic loading initial ramdisk with kernel version 5.3.18-57.3 and pm80...
Status: RESOLVED WORKSFORME
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.3
x86-64 openSUSE Leap 15.3
: P5 - None : Major (vote)
: ---
Assigned To: Lee Duncan
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2021-06-14 22:11 UTC by Dave Addison
Modified: 2022-03-21 17:50 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Minicom capture of the serial console output up to the panic (8.04 KB, text/plain)
2021-06-30 20:20 UTC, Dave Addison
Details
Capture of the serial console output up to the panic with logging_level=4095 (60.13 KB, text/plain)
2021-07-03 22:43 UTC, Dave Addison
Details
output of diff command comparing two pm8001 directories (105.38 KB, text/plain)
2021-07-03 22:53 UTC, Dave Addison
Details
Crash Kernel Dump File (36 bytes, text/plain)
2021-09-22 20:24 UTC, Dave Addison
Details
Console Messages associated with crash kernel dump (85.47 KB, text/plain)
2021-09-22 20:25 UTC, Dave Addison
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Addison 2021-06-14 22:11:44 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0
Build Identifier: 

After upgrading to Leap 15.3, the system fails to boot with a a kernel panic while processing the initial ramdisk. The error message reports that a fatal exception occurred while processing an interrupt in the pm80xx module.

The system will boot if the option pci=nomsi is added to the command line but there is an error reported in the boot log saying that the interrupts could not be intialised and the disks on the PM8001 SAS controller are inaccessible.

The following kernel command options have no effect acpi=off, pci=noacpi, acpi=noirq

With module_blacklist=pm80xx, the system boot successfully to a rescue prompt as systemd times out waiting for the disks on the controller to become available

The system boots successfully with kernel version 5.3.18-lp152.75

Output from an lspci -s 01:00.0 -vv command is included in additional information

The motherboard is an ASUS P8H61-M-LE-USB3 The BIOS is updated to the latest version available.

Reproducible: Always

Steps to Reproduce:
1. boot with default menu option 
2.
3.
Actual Results:  
kernel panic when processing initial RAM disk

Expected Results:  
boot to command line

01:00.0 Serial Attached SCSI controller: Adaptec PMC-Sierra PM8001 SAS HBA [Series 6H] (rev 05)
        Subsystem: Adaptec Device 0800
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at f7960000 (64-bit, non-prefetchable) [size=64K]
        Region 2: Memory at f7950000 (64-bit, non-prefetchable) [size=64K]
        Region 4: Memory at f7940000 (32-bit, non-prefetchable) [size=64K]
        Region 5: Memory at f7900000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at f7800000 [disabled] [size=1M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [ac] MSI-X: Enable+ Count=16 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00004000
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Kernel driver in use: pm80xx
        Kernel modules: pm80xx
Comment 1 Takashi Iwai 2021-06-15 07:40:22 UTC
Sounds like a regression SLE15-SP3 kernel.

Reassigned to Lee, who mostly touched this driver code.
Comment 2 Lee Duncan 2021-06-28 17:41:51 UTC
Can I see the kernel panic output?
Comment 3 Lee Duncan 2021-06-28 17:46:16 UTC
Upstream commit:

> 196ba6629cf9 ("scsi: pm80xx: Fixed kernel panic during error recovery for SATA drive")

Looks interesting, but I need to see what your panic stack looks like.
Comment 4 Dave Addison 2021-06-28 19:23:55 UTC
(In reply to Lee Duncan from comment #2)
> Can I see the kernel panic output?

Yes, but it might take a few days to get the info. I haven't had any luck generating a kernel dump so I was planning to rig a serial console to capture the output
Comment 5 Dave Addison 2021-06-30 20:20:58 UTC
Created attachment 850688 [details]
Minicom capture of the serial console output up to the panic

I've attached a capture of the serial console output up to the panic. Hopefully this is the information you wanted. If not, please let me know
Comment 6 Lee Duncan 2021-07-02 18:37:25 UTC
Even with the panic output, I have no idea what the issue is. I looked over the pm80* patches I added for SLE-15-SP3, and none of the 32 commits looks suspect, on the surface.

Can you try passing in the "logging_level" parameter to pm80xx? I believe it's something like "pm80xx.logging_level=4095" on the boot command line. I _think_ this will work, because it seems like you are already accessing the disc by the time this panic occurs, and the panic is in the pm80xx module.

Lastly, I'm not clear on which version of Leap worked and which one didn't.

I am checking on SLES 15.3 (same kernel), and it has 5.3.18-57-default. You said that "5.3.18-lp152.75" worked? Where did that version come from?
Comment 7 Dave Addison 2021-07-02 21:48:53 UTC
(In reply to Lee Duncan from comment #6)
> Even with the panic output, I have no idea what the issue is. I looked over
> the pm80* patches I added for SLE-15-SP3, and none of the 32 commits looks
> suspect, on the surface.
> 
> Can you try passing in the "logging_level" parameter to pm80xx? I believe
> it's something like "pm80xx.logging_level=4095" on the boot command line. I
> _think_ this will work, because it seems like you are already accessing the
> disc by the time this panic occurs, and the panic is in the pm80xx module.
> 
> Lastly, I'm not clear on which version of Leap worked and which one didn't.
> 
> I am checking on SLES 15.3 (same kernel), and it has 5.3.18-57-default. You
> said that "5.3.18-lp152.75" worked? Where did that version come from?

I'll try increasing the logging level and I'll upload whatever the output is if it adds more detail.

The system was previously running leap 15.2. I performed a distribution upgrade to 15.2 so 5.3.18-57-default was the last kernel from the 15.2 installation
Comment 8 Dave Addison 2021-07-02 21:50:15 UTC
(In reply to Dave Addison from comment #7)
> (In reply to Lee Duncan from comment #6)
> > Even with the panic output, I have no idea what the issue is. I looked over
> > the pm80* patches I added for SLE-15-SP3, and none of the 32 commits looks
> > suspect, on the surface.
> > 
> > Can you try passing in the "logging_level" parameter to pm80xx? I believe
> > it's something like "pm80xx.logging_level=4095" on the boot command line. I
> > _think_ this will work, because it seems like you are already accessing the
> > disc by the time this panic occurs, and the panic is in the pm80xx module.
> > 
> > Lastly, I'm not clear on which version of Leap worked and which one didn't.
> > 
> > I am checking on SLES 15.3 (same kernel), and it has 5.3.18-57-default. You
> > said that "5.3.18-lp152.75" worked? Where did that version come from?
> 
> I'll try increasing the logging level and I'll upload whatever the output is
> if it adds more detail.
> 
> The system was previously running leap 15.2. I performed a distribution
> upgrade to 15.2 so 5.3.18-57-default was the last kernel from the 15.2
> installation
Sorry should have written "I performed a distribution upgrade to 15.3" not 15.2
Comment 9 Dave Addison 2021-07-03 22:43:41 UTC
Created attachment 850752 [details]
Capture of the serial console output up to the panic with logging_level=4095

I've attached a new capture of the serial console output with pm80xx.logging_level=4095 added to the kernel options
Comment 10 Dave Addison 2021-07-03 22:53:11 UTC
Created attachment 850753 [details]
output of diff command comparing two pm8001 directories

In case it's of any use, this is the output of "diff -ruN" comparing the pm8001 directories from the kernel source trees for 5.3.18-lp152.75 and 5.3.18-57.3
Comment 11 Dave Addison 2021-08-11 07:50:47 UTC
(In reply to Dave Addison from comment #7)
> (In reply to Lee Duncan from comment #6)
> > Even with the panic output, I have no idea what the issue is. I looked over
> > the pm80* patches I added for SLE-15-SP3, and none of the 32 commits looks
> > suspect, on the surface.
> > 
> > Can you try passing in the "logging_level" parameter to pm80xx? I believe
> > it's something like "pm80xx.logging_level=4095" on the boot command line. I
> > _think_ this will work, because it seems like you are already accessing the
> > disc by the time this panic occurs, and the panic is in the pm80xx module.
> > 
> > Lastly, I'm not clear on which version of Leap worked and which one didn't.
> > 
> > I am checking on SLES 15.3 (same kernel), and it has 5.3.18-57-default. You
> > said that "5.3.18-lp152.75" worked? Where did that version come from?
> 
> I'll try increasing the logging level and I'll upload whatever the output is
> if it adds more detail.
> 
> The system was previously running leap 15.2. I performed a distribution
> upgrade to 15.2 so 5.3.18-57-default was the last kernel from the 15.2
> installation

sorry, should have written "5.3.18-lp152,75 was the last kernel was the last kernel from the 15.2 installation"
Comment 12 Lee Duncan 2021-09-02 01:17:55 UTC
(In reply to Dave Addison from comment #10)
> Created attachment 850753 [details]
> output of diff command comparing two pm8001 directories
> 
> In case it's of any use, this is the output of "diff -ruN" comparing the
> pm8001 directories from the kernel source trees for 5.3.18-lp152.75 and
> 5.3.18-57.3

Not helpful at all, which is why I haven't tried to figure it out from the diffs: there are a *bunch* of diffs. Better to figure out the actual problem IMHO.
Comment 13 Lee Duncan 2021-09-02 19:52:04 UTC
I went through all 30 of the pm80xx patches that were added (by me) to SLE-15-SP3, to see if I incorrectly applied any of them, but I found no differences.

I sent through the diffs you supplied to see if any new printk()s were added that were not protected by debugging mode, and there were none.

Are you sure your hardware hasn't developed an issue? Is there any way you can try a replacement disc?

Also, a crash kernel dump would help track down where the code is actually dying.
Comment 14 Dave Addison 2021-09-05 19:35:21 UTC
(In reply to Lee Duncan from comment #13)
> I went through all 30 of the pm80xx patches that were added (by me) to
> SLE-15-SP3, to see if I incorrectly applied any of them, but I found no
> differences.
> 
> I sent through the diffs you supplied to see if any new printk()s were added
> that were not protected by debugging mode, and there were none.
> 
> Are you sure your hardware hasn't developed an issue? Is there any way you
> can try a replacement disc?
> 
> Also, a crash kernel dump would help track down where the code is actually
> dying.

I'm putting a test machine together so I can try things on a system without any live data. 

I should be able to swap the card into the new machine at some point this week. 
This will allow me to see if the card will work with a different BIOS and pair of discs.

I didn't have much luck getting a crash dump the last time I tried. I'll have another go once I've moved the card.
Comment 15 Dave Addison 2021-09-12 19:13:30 UTC
I've moved the card into a new machine. Even with a different motherboard and two new discs, I still get a repeatable crash during boot.

The only difference is that the EFI version on the new motherboard has an additional option ("Launch Storage OpROM Policy"). If this is enabled for legacy cards, then the card is initialised by the EFI, the discs are detected and the card BIOS is loaded. This is the same as for the other machine and, in this case, the crash occurs.

If I disable the option, so that the card BIOS isn't loaded, then the system boots successfully.

I've had no luck so far with the crash dump. I've set up the kdump kernel using the yast module and it's all working OK but the kernel panic occurs before the crash kernel is loaded. 

I'll either need to load the kdump kernel earlier (which looks like it might be possible) or defer loading the pm80x module somehow
Comment 16 Dave Addison 2021-09-22 20:24:44 UTC
Created attachment 852701 [details]
Crash Kernel Dump File

The linked crash kernel dump file was created by blacklisting the module during boot and then manually triggering the panic by loading the module using modprobe.

The options set were LZO compressed format, excluding: pages filled with zero; cache pages; user data pages and free pages

If you'd prefer different options used to generate the file, please let me know
Comment 17 Dave Addison 2021-09-22 20:25:54 UTC
Created attachment 852702 [details]
Console Messages associated with crash kernel dump

Console messages leading up to kernel panic
Comment 18 Hannes Reinecke 2021-10-12 15:20:55 UTC
Can you try to switch off the iommu by specifying

amd_iommu=off

on the kernel commandline?
Comment 19 Dave Addison 2021-10-14 21:07:22 UTC
(In reply to Hannes Reinecke from comment #18)
> Can you try to switch off the iommu by specifying
> 
> amd_iommu=off
> 
> on the kernel commandline?

A kernel panic still occurs with amd_iommu=off. If it would be useful, I can generate another crash dump
Comment 20 Lee Duncan 2022-03-14 20:54:15 UTC
I looked at this quite a bit, again, but still cannot see anything obvious about the crash, or in the changes between the two releases.

If you could build a kernel, then it seems like we would be able to dissect the issue in the pm80xx driver, or if I could reproduce the issue here.

I will see if any of our lab systems has the hardware for this, but it seems it's quite old. I will also see if the crash dump might help me figure out the problem.

It looks like the problem occurs shortly after enabling interrupts. How do you have this drive (using this adapter) configured? Perhaps it's a spurious interrupt at startup that it is not ready for?
Comment 23 Lee Duncan 2022-03-16 21:43:10 UTC
Hi Dave: Are you still around and having this issue?

If you would like to continue to fix this, I will have to build you a kernel, or multiple kernels, so that you can test.

Initially I will probably just pass you a debugging kernel.

There are many fixes for the pm80xx driver upstream since your kernel, but sadly most of them seem to be on the pm80xx drivers, and you use the pm8001 driver. Your HBA is quite old it seems. We do not have one present in our hardware lab. So to debug this I will need your help.

So an update on your current status on this issue would be most helpful. Thanks.
Comment 24 Dave Addison 2022-03-18 21:39:21 UTC
(In reply to Lee Duncan from comment #23)
> Hi Dave: Are you still around and having this issue?
> 
> If you would like to continue to fix this, I will have to build you a
> kernel, or multiple kernels, so that you can test.
> 
> Initially I will probably just pass you a debugging kernel.
> 
> There are many fixes for the pm80xx driver upstream since your kernel, but
> sadly most of them seem to be on the pm80xx drivers, and you use the pm8001
> driver. Your HBA is quite old it seems. We do not have one present in our
> hardware lab. So to debug this I will need your help.
> 
> So an update on your current status on this issue would be most helpful.
> Thanks.

Hello Lee

Yes, I'm still around. Seeing your email prompted me to hook up my test machine and see if the panic still occurred with the latest kernel version. Updating the kernel didn't make a difference but I also reflashed the card with a firmware file from the Microsemi website. The updated the card firmware from 01.14.05.00 to 01.14.07.00 and, with this firmware, the problem no longer occurs. the fixed firmware is in the archive 6805h_fw_b10624.zip.

For my part, this means that this is no longer a severe issue for me. However, if you see any advantage in trying to track the problem down, I'm happy to help. I still have one more card with the failing firmware.

Kind Regards
Dave
Comment 25 Lee Duncan 2022-03-21 17:50:21 UTC
Hi Dave:

Thank you for the prompt reply.

Interesting that firmware fixed the issue. I perhaps should have thought of that. Perhaps I work with hardware a little less than I used to.

No, I don't need to continue debugging why the system had issues with the old firmware. In general, though it might be interesting to work on this, I don't have the time to track down possible problems when real problems are so abundant.

Thank you for your help through this process. I'm sorry it took so long.

I will close this bug. Please reopen if you still have this issue, or if it recurs. I closed it as "works for me", even though I don't have that adapter, since I didn't see a better option.