Bug 1224464 - [Build 91.1] system gets stuck and failes to collect kdump core after trigger crash
Summary: [Build 91.1] system gets stuck and failes to collect kdump core after trigger...
Status: NEW
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP6
Classification: openSUSE
Component: Kernel (show other bugs)
Version: unspecified
Hardware: PowerPC-64 Other
: P3 - Medium : Major
Target Milestone: ---
Assignee: Jiri Bohac
QA Contact:
URL: https://openqa.suse.de/tests/14365468...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-20 05:53 UTC by Richard Fan
Modified: 2024-06-14 17:03 UTC (History)
7 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
serial logs (159.78 KB, text/plain)
2024-05-20 05:53 UTC, Richard Fan
Details
screen shot (161.53 KB, image/png)
2024-05-20 05:54 UTC, Richard Fan
Details
screen shot after 5 minutes (49.08 KB, image/png)
2024-05-20 07:36 UTC, Richard Fan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Fan 2024-05-20 05:53:35 UTC
Created attachment 874963 [details]
serial logs

## The issue is similar with https://bugzilla.suse.com/show_bug.cgi?id=1218180, but I am not sure if it is a regression bug on build 91.1

So far, the issue can only be seen on ppc64le platform.

Kernel: "6.4.0-150600.21-default”
Memory: 4gb/8gb
kdump memory: 1gb

Steps to reproduce the issue

>1. Enalbe kdump service with kdump memory=1024m
>2. trigger system crash "echo c > /proc/sysrq-trigger"

Expected result:
system can collect crash dump file and reboot

Actuall result:

system hangs and fails to reboot

Please refer to the attached file for last screen shot and full serial logs.

please feel free to let me know if dev needs to access my setup.

## openQA Observation [automation tests]

openQA test in scenario sle-15-SP6-Online-ppc64le-toolchain_zypper@ppc64le fails in
[kdump_and_crash](https://openqa.suse.de/tests/14365468/modules/kdump_and_crash/steps/83)

## Test suite description
Maintainer: QE Core, mnowak

Install toolchain packages and test the toolchain. Uses a more powerful machine configuration.


## Reproducible

Fails since (at least) Build [91.1](https://openqa.suse.de/tests/14338091)


## Expected result

Last good: [90.1](https://openqa.suse.de/tests/14305219) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le&test=toolchain_zypper&version=15-SP6)
Comment 1 Richard Fan 2024-05-20 05:54:01 UTC
Created attachment 874964 [details]
screen shot
Comment 2 Richard Fan 2024-05-20 07:35:59 UTC
I can catch more console logs if I wait more than 5 minutes. please see attached file
Comment 3 Richard Fan 2024-05-20 07:36:34 UTC
Created attachment 874967 [details]
screen shot after 5 minutes
Comment 4 Santiago Zarate 2024-05-22 15:01:33 UTC
(In reply to Richard Fan from comment #3)
> Created attachment 874967 [details]
> screen shot after 5 minutes

(In reply to Richard Fan from comment #0)
> Created attachment 874963 [details]
> serial logs
> 
> ## The issue is similar with
> https://bugzilla.suse.com/show_bug.cgi?id=1218180, but I am not sure if it
> is a regression bug on build 91.1
> 
> So far, the issue can only be seen on ppc64le platform.
> 
> Kernel: "6.4.0-150600.21-default”
> Memory: 4gb/8gb
> kdump memory: 1gb
> 
> Steps to reproduce the issue
> 
> >1. Enalbe kdump service with kdump memory=1024m
> >2. trigger system crash "echo c > /proc/sysrq-trigger"
> 
> Expected result:
> system can collect crash dump file and reboot
> 
> Actuall result:
> 
> system hangs and fails to reboot
> 
> Please refer to the attached file for last screen shot and full serial logs.
> 
> please feel free to let me know if dev needs to access my setup.
> 


This is consistently failing (even for 16GB of ram https://openqa.suse.de/tests/14415084#step/kdump_and_crash/83), but is passing for kernel team
Comment 5 Santiago Zarate 2024-05-22 16:35:30 UTC
Ok so this seems to be: again bsc#1161421
- Passes https://openqa.suse.de/tests/14415136 - CRASH_MEMORY 2048
- Fails https://openqa.suse.de/tests/14415419  - CRASH_MEMORY 1200
Comment 7 Richard Fan 2024-05-23 04:24:01 UTC
@Santiago Zarate,

The reason why kernel tests are passed should be only 1 VCPU is assigned, see job setting 'QEMUCPUS=1', with 4 VCPUS assigned, the issue can be reproduced now. 

>http://openqa.suse.de/tests/overview?build=rfan0523_kernel&distri=sle&version=15-SP6


While if we set 'QEMUCPUS=1' for qe-core tests, the issue is gone

>http://openqa.suse.de/tests/overview?version=15-SP6&build=rfan0523&distri=sle

----------------

Do we need need more crash dump memory if more VCPUs assigned to VM in this case?
Comment 8 Jiri Bohac 2024-06-11 16:17:21 UTC
This is strange. Kdump does require a little more reserved memory whet more CPUS are configured for the kdump envoronment (using KDUMP_CPUS in /etc/sysconfig/kdump). But that the default (and I checked it has not been changed in your qemu image) is KDUMP_CPUS=1. 

So is the statement about CRASH_MEMORY 2048 passing and CRASH_MEMORY 1200 from Comment #5 correct?

I tried with your qemu image (on a x86-64, using qemu full emulation) and it works for me with either 1 or 4 cpus, with the default crashkernel=460M.

Is there way I could to manually play with a failing system?