Bug 1220723

Summary: [Build 59.2] openQA test fails in first_boot - display manager can not be shown after migration from 12SP5 to 15SP6
Product: [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP6 Reporter: Lemon Li <leli>
Component: GNOMEAssignee: E-mail List <gnome-bugs>
Status: NEW --- QA Contact:
Severity: Normal    
Priority: P3 - Medium CC: alynx.zhou, hector.oron, xiaoguang.wang, yfjiang
Version: unspecified   
Target Milestone: ---   
Hardware: PowerPC-64   
OS: SLES 15   
URL: https://openqa.suse.de/tests/13636406/modules/first_boot/steps/4
Whiteboard:
Found By: openQA Services Priority:
Business Priority: Blocker: No
Marketing QA Status: --- IT Deployment: ---
Attachments: serial0.txt

Description Lemon Li 2024-03-01 02:08:36 UTC
Created attachment 873141 [details]
serial0.txt

## Observation
This test is migration from SLES 12SP5 to 15SP6 with gnome, after migration and reboot, but there is sporadic issue that the display manager can't be shown. And this failure only happened on ppc64le, the reproduce rate is about 1/5.

After the issue happened, it seems the system is hang and can't switch tty to support journal log.

Besides, I have tried to switch the test to multipath target, and haven't found the issue any more. https://openqa.suse.de/tests/13639655#step/first_boot/5 (Show the login screen, check job name prefix of multipath https://openqa.suse.de/tests/overview?version=15-SP6&build=lemon-suse%2Fos-autoinst-distri-opensuse%23ppc64le-fisrt-boot-slow&distri=sle 0/10 reproduced the issue)

openQA test in scenario sle-15-SP6-Regression-on-Migration-from-SLE12-SPx-ppc64le-offline_sles12sp5_pscc_sdk-lp-asmm-contm-lgm-tcm-wsm-pcm_all_full@ppc64le-4g fails in
[first_boot](https://openqa.suse.de/tests/13636406/modules/first_boot/steps/4)

## Test suite description
The base test suite is used for job templates defined in YAML documents. It has no settings of its own.


## Reproducible

Fails since (at least) Build [53.1](https://openqa.suse.de/tests/13458620)


## Expected result

Last good: (unknown) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Regression-on-Migration-from-SLE12-SPx&machine=ppc64le-4g&test=offline_sles12sp5_pscc_sdk-lp-asmm-contm-lgm-tcm-wsm-pcm_all_full&version=15-SP6)
Comment 1 Alynx Zhou 2024-03-01 09:17:02 UTC
It seems there is no related log about display manager or GNOME session, and I checked the systemctl output, there is even no graphical target...
Comment 4 Lemon Li 2024-03-05 05:53:08 UTC
Hi, we reproduced this issue on latest build 62.1, https://openqa.suse.de/tests/13714553#step/first_boot/4

Any log or info needed? We will try to provide to help to make thing clear. Thanks.
Comment 5 xiaoguang wang 2024-03-05 08:07:50 UTC
(In reply to Ming Li from comment #4)
> Hi, we reproduced this issue on latest build 62.1,
> https://openqa.suse.de/tests/13714553#step/first_boot/4
> 
> Any log or info needed? We will try to provide to help to make thing clear.
> Thanks.

From this case I don't find the journal log.
Could you collect the journal log by "journalctl -b", and the package information by "rpm -qa".
Comment 6 Lemon Li 2024-03-07 05:44:23 UTC
(In reply to xiaoguang wang from comment #5)
> (In reply to Ming Li from comment #4)
> > Hi, we reproduced this issue on latest build 62.1,
> > https://openqa.suse.de/tests/13714553#step/first_boot/4
> > 
> > Any log or info needed? We will try to provide to help to make thing clear.
> > Thanks.
> 
> From this case I don't find the journal log.
> Could you collect the journal log by "journalctl -b", and the package
> information by "rpm -qa".

I tried to switch tty when the failure happened but failed, so can't provide the journal log, it seems the system is hang. 

Just see a failure in worker: [2024-03-07T03:31:18.125684Z] [debug] [pid:77254] QEMU: KVM: Failed to create TCE64 table for liobn 0x80000000
[2024-03-07T03:32:47.256961Z] [debug] [pid:77254] QEMU: KVM: Failed to create TCE64 table for liobn 0x80000001
Not sure whether related with the issue.

Besides, I tried to disable kdump and clone the test 10 times, but still reproduced this issue at least (4/10) https://openqa.suse.de/tests/overview?distri=sle&build=lemon-suse%2Fos-autoinst-distri-opensuse%23master&version=15-SP6 (job prefixed as no-kdump)
Comment 7 Yifan Jiang 2024-03-07 06:28:17 UTC
(In reply to Ming Li from comment #6)
> (In reply to xiaoguang wang from comment #5)
> > (In reply to Ming Li from comment #4)
> > > Hi, we reproduced this issue on latest build 62.1,
> > > https://openqa.suse.de/tests/13714553#step/first_boot/4
> > > 
> > > Any log or info needed? We will try to provide to help to make thing clear.
> > > Thanks.
> > 
> > From this case I don't find the journal log.
> > Could you collect the journal log by "journalctl -b", and the package
> > information by "rpm -qa".
> 
> I tried to switch tty when the failure happened but failed, so can't provide
> the journal log, it seems the system is hang. 

If the graphical environment issue is suspected here, the necessary logs make much better sense to diagnose it further. How about the network status, does ssh work to collect logs?
Comment 8 Lemon Li 2024-03-08 08:33:20 UTC
(In reply to Yifan Jiang from comment #7)
> (In reply to Ming Li from comment #6)
> > (In reply to xiaoguang wang from comment #5)
> > > (In reply to Ming Li from comment #4)
> > > > Hi, we reproduced this issue on latest build 62.1,
> > > > https://openqa.suse.de/tests/13714553#step/first_boot/4
> > > > 
> > > > Any log or info needed? We will try to provide to help to make thing clear.
> > > > Thanks.
> > > 
> > > From this case I don't find the journal log.
> > > Could you collect the journal log by "journalctl -b", and the package
> > > information by "rpm -qa".
> > 
> > I tried to switch tty when the failure happened but failed, so can't provide
> > the journal log, it seems the system is hang. 
> 
> If the graphical environment issue is suspected here, the necessary logs
> make much better sense to diagnose it further. How about the network status,
> does ssh work to collect logs?
I'm trying to check whether ssh work or not when issue happened, but it is a random issue and un-luky for me that I haven't met this issue when I set developer mode.
Besides, currently migration test have a ssh issue that after migration the root login is disabled, so need enable root login before first_boot.
And I have another thinking about this, I found sometimes on one worker 'mania' it can switch tty when issue happend, so one idea is to create a branch to run more times to get the needed log; another is try to update the openQA worker to check the results.
Comment 9 Lemon Li 2024-03-08 08:55:44 UTC
(In reply to Ming Li from comment #8)
> (In reply to Yifan Jiang from comment #7)
> > (In reply to Ming Li from comment #6)
> > > (In reply to xiaoguang wang from comment #5)
> > > > (In reply to Ming Li from comment #4)
> > > > > Hi, we reproduced this issue on latest build 62.1,
> > > > > https://openqa.suse.de/tests/13714553#step/first_boot/4
> > > > > 
> > > > > Any log or info needed? We will try to provide to help to make thing clear.
> > > > > Thanks.
> > > > 
> > > > From this case I don't find the journal log.
> > > > Could you collect the journal log by "journalctl -b", and the package
> > > > information by "rpm -qa".
> > > 
> > > I tried to switch tty when the failure happened but failed, so can't provide
> > > the journal log, it seems the system is hang. 
> > 
> > If the graphical environment issue is suspected here, the necessary logs
> > make much better sense to diagnose it further. How about the network status,
> > does ssh work to collect logs?
> I'm trying to check whether ssh work or not when issue happened, but it is a
> random issue and un-luky for me that I haven't met this issue when I set
> developer mode.
> Besides, currently migration test have a ssh issue that after migration the
> root login is disabled, so need enable root login before first_boot.
> And I have another thinking about this, I found sometimes on one worker
> 'mania' it can switch tty when issue happend, so one idea is to create a
> branch to run more times to get the needed log; another is try to update the
> openQA worker to check the results.

I think I just reproduced the issue with developer mode, but I can't ssh login since root login disabled, I tried to change /etc/ssh/sshd_config in VNC but failed to switch tty.