Bug 1225352 - [Build 13.199] openQA test fails in prepare_firstboot: RPi3 not booting?
Summary: [Build 13.199] openQA test fails in prepare_firstboot: RPi3 not booting?
Status: NEW
: 1225787 (view as bug list)
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Leap 15.5
Hardware: aarch64 Other
: P5 - None : Major (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL: https://openqa.opensuse.org/tests/421...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-27 12:45 UTC by Fabian Vogt
Modified: 2024-06-18 06:52 UTC (History)
12 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
15.5 boot log on serial (from openQA run) (13.90 KB, text/plain)
2024-05-27 18:17 UTC, Guillaume GARDET
Details
15.5 serial boot log with a longer timeout (26.99 KB, text/plain)
2024-05-28 07:17 UTC, Guillaume GARDET
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Vogt 2024-05-27 12:45:03 UTC
The Leap 15.5 JeOS image fails to boot in openQA for some time now.
Initially it looked to me like an issue with the test infra, but the TW and 15.6 images boot while 15.5 fails consistently.

I tried to reproduce the issue locally and except for a slow boot due to no ethernet cable it worked. 

The package diff between last working and latest build shows changes in aaa_base, coreutils, kernel, less, protobuf, libsemanage, perl, rpm and yast. Could be a kernel issue?

## Observation

openQA test in scenario opensuse-15.5-JeOS-for-RPi-aarch64-jeos@RPi3 fails in
[prepare_firstboot](https://openqa.opensuse.org/tests/4217406/modules/prepare_firstboot/steps/1)

## Test suite description
Maintainer: fvogt, mnowak

Start JeOS from the HDD image, configure it using the firstboot wizard and then run basic tests. console=tty0 added as needed for aarch64.


## Reproducible

Fails since (at least) Build [13.183](https://openqa.opensuse.org/tests/4120363)


## Expected result

Last good: [13.182](https://openqa.opensuse.org/tests/4119085) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.opensuse.org/tests/latest?arch=aarch64&distri=opensuse&flavor=JeOS-for-RPi&machine=RPi3&test=jeos&version=15.5)
Comment 1 Fabian Vogt 2024-05-27 12:48:30 UTC
Could you have a look what the SUT is doing when the test fails?
Comment 2 Guillaume GARDET 2024-05-27 13:22:20 UTC
I would guess a problem with network. I will have deeper look.
Comment 3 Guillaume GARDET 2024-05-27 18:17:23 UTC
Created attachment 875139 [details]
15.5 boot log on serial (from openQA run)
Comment 4 Guillaume GARDET 2024-05-28 07:17:27 UTC
Created attachment 875148 [details]
15.5 serial boot log with a longer timeout

With a longer timeout, it manages to boot until login prompt and starts jeos-fisrtboot on serial.

I think jeos-fisrtboot should not be started as it blocks the start of the ssh server.
Comment 5 Fabian Vogt 2024-05-28 07:29:13 UTC
(In reply to Guillaume GARDET from comment #4)
> Created attachment 875148 [details]
> 15.5 serial boot log with a longer timeout
> 
> With a longer timeout, it manages to boot until login prompt and starts
> jeos-fisrtboot on serial.
>
> I think jeos-fisrtboot should not be started as it blocks the start of the
> ssh server.

Why does it start jeos-firstboot? AFAIK it shouldn't be enabled in this image?

In my local test run it did not.
Comment 6 Guillaume GARDET 2024-05-28 07:59:22 UTC
(In reply to Fabian Vogt from comment #5)
> (In reply to Guillaume GARDET from comment #4)
> > Created attachment 875148 [details]
> > 15.5 serial boot log with a longer timeout
> > 
> > With a longer timeout, it manages to boot until login prompt and starts
> > jeos-fisrtboot on serial.
> >
> > I think jeos-fisrtboot should not be started as it blocks the start of the
> > ssh server.
> 
> Why does it start jeos-firstboot? AFAIK it shouldn't be enabled in this
> image?
> 
> In my local test run it did not.

Ah no, sorry, an additional boot from a Tumbleweed test polluted the serial log.

It hanged after:
**********
[  OK  ] Listening on Load/Save RF …itch Status /dev/rfkill Watch.
         Starting Security Auditing Service...
         Starting Rebuild Journal Catalog...
[  OK  ] Finished Commit a transient machine-id on disk.
         Starting Load/Save RF Kill Switch Status...
**********
Comment 7 Fabian Vogt 2024-05-28 12:21:43 UTC
(In reply to Guillaume GARDET from comment #6)
> (In reply to Fabian Vogt from comment #5)
> > (In reply to Guillaume GARDET from comment #4)
> > > Created attachment 875148 [details]
> > > 15.5 serial boot log with a longer timeout
> > > 
> > > With a longer timeout, it manages to boot until login prompt and starts
> > > jeos-fisrtboot on serial.
> > >
> > > I think jeos-fisrtboot should not be started as it blocks the start of the
> > > ssh server.
> > 
> > Why does it start jeos-firstboot? AFAIK it shouldn't be enabled in this
> > image?
> > 
> > In my local test run it did not.
> 
> Ah no, sorry, an additional boot from a Tumbleweed test polluted the serial
> log.
> 
> It hanged after:
> **********
> [  OK  ] Listening on Load/Save RF …itch Status /dev/rfkill Watch.
>          Starting Security Auditing Service...
>          Starting Rebuild Journal Catalog...
> [  OK  ] Finished Commit a transient machine-id on disk.
>          Starting Load/Save RF Kill Switch Status...
> **********

Did it completely hang or just take a long time? Can you get the full journal?

Smells like a kernel issue.
Comment 8 Guillaume GARDET 2024-05-28 12:47:26 UTC
(In reply to Fabian Vogt from comment #7)
> (In reply to Guillaume GARDET from comment #6)
> > (In reply to Fabian Vogt from comment #5)
> > > (In reply to Guillaume GARDET from comment #4)
> > > > Created attachment 875148 [details]
> > > > 15.5 serial boot log with a longer timeout
> > > > 
> > > > With a longer timeout, it manages to boot until login prompt and starts
> > > > jeos-fisrtboot on serial.
> > > >
> > > > I think jeos-fisrtboot should not be started as it blocks the start of the
> > > > ssh server.
> > > 
> > > Why does it start jeos-firstboot? AFAIK it shouldn't be enabled in this
> > > image?
> > > 
> > > In my local test run it did not.
> > 
> > Ah no, sorry, an additional boot from a Tumbleweed test polluted the serial
> > log.
> > 
> > It hanged after:
> > **********
> > [  OK  ] Listening on Load/Save RF …itch Status /dev/rfkill Watch.
> >          Starting Security Auditing Service...
> >          Starting Rebuild Journal Catalog...
> > [  OK  ] Finished Commit a transient machine-id on disk.
> >          Starting Load/Save RF Kill Switch Status...
> > **********
> 
> Did it completely hang or just take a long time? Can you get the full
> journal?

Looks like a hang. I waited more than 10 min after the boot and the serial was unresponsive after those lines. (No login prompt)

I will try to get more traces on serial.
Comment 9 Guillaume GARDET 2024-06-03 07:28:23 UTC
*** Bug 1225787 has been marked as a duplicate of this bug. ***
Comment 10 Robert Munteanu 2024-06-04 17:40:30 UTC
To echo the comments from bug #1225787

- kernel 5.14.21-150500.55.52-default was the last known working kernel from me
- same system does not boot with kernel-default-5.14.21-150500.55.59.1.aarch64
Comment 11 Fabian Vogt 2024-06-07 08:37:22 UTC
Ping. Raising severity as well.
Comment 12 Zaoliang Luo 2024-06-07 13:18:56 UTC
I tested today for both RPi 3 and 4, no issue at all.


https://paste.opensuse.org/pastes/6733655a325c

or


Mac-mini:.ssh Zaoliang$ ssh zaoliang@192.168.8.187
The authenticity of host '192.168.8.187 (192.168.8.187)' can't be established.
ED25519 key fingerprint is SHA256:xWM/Nt3SdkZ7W1YyOnG4FEbB/WdAXrGDMlGfFopQWYM.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '192.168.8.187' (ED25519) to the list of known hosts.
(zaoliang@192.168.8.187) Password: 
Have a lot of fun...
zaoliang@localhost:~> cat /etc/os-release
NAME="openSUSE Leap"
VERSION="15.5"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.5"
PRETTY_NAME="openSUSE Leap 15.5"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.5"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Leap"
LOGO="distributor-logo-Leap"
Comment 13 Zaoliang Luo 2024-06-07 13:20:00 UTC
openSUSE-Leap-15.5-ARM-JeOS-raspberrypi.aarch64-2023.03.31-Build13.217.raw.xz is used.
Comment 14 Zaoliang Luo 2024-06-07 14:00:48 UTC
I checked this on another device, in this case RPi 400. It boots up from grub menu, but never reached prompt or desktop.

This is quite strange.
Comment 15 Andrea della Porta 2024-06-10 07:18:12 UTC
Hi, is there at least one report of non-booting Rpi4 or does it hangs only on Rpi3? Furthermore, doers runnign it on CM4 make any difference wrt running it on straight model B?

Thanks
Comment 16 Fabian Vogt 2024-06-10 07:24:51 UTC
(In reply to Andrea della Porta from comment #15)
> Hi, is there at least one report of non-booting Rpi4 or does it hangs only
> on Rpi3? Furthermore, doers runnign it on CM4 make any difference wrt
> running it on straight model B?
> 
> Thanks

I'm not aware of any issues on RPi 4, but on my RPi 3 the image works fine so it might just be random...
Comment 17 Andrea della Porta 2024-06-10 09:35:12 UTC
Rpi4 works just fine, rpi3 is hanging somewhat randomly. Some investigation is needed.
Comment 18 Andrea della Porta 2024-06-10 09:40:42 UTC
For the record, I'm testing openSUSE-Leap-15.5-ARM-JeOS-raspberrypi.aarch64-2023.03.31-Build13.217.raw.xz.
Comment 19 Andrea della Porta 2024-06-11 17:03:11 UTC
(In reply to Robert Munteanu from comment #10)
> To echo the comments from bug #1225787
> 
> - kernel 5.14.21-150500.55.52-default was the last known working kernel from
> me
> - same system does not boot with
> kernel-default-5.14.21-150500.55.59.1.aarch64

5.14.21-150500.55.52-default does not work for me either. May I ask you how did you test the older kernel? Did you just burn an older Leap raw image on SD or did you just downgrade the kernel via commandline with something like:

zypper install --oldpackage kernel-default=5.14.21-150500.55.52.1

Many thanks
Comment 20 Zaoliang Luo 2024-06-11 18:54:45 UTC
(In reply to Andrea della Porta from comment #19)
> (In reply to Robert Munteanu from comment #10)
> > To echo the comments from bug #1225787
> > 
> > - kernel 5.14.21-150500.55.52-default was the last known working kernel from
> > me
> > - same system does not boot with
> > kernel-default-5.14.21-150500.55.59.1.aarch64
> 
> 5.14.21-150500.55.52-default does not work for me either. May I ask you how
> did you test the older kernel? Did you just burn an older Leap raw image on
> SD or did you just downgrade the kernel via commandline with something like:
> 
> zypper install --oldpackage kernel-default=5.14.21-150500.55.52.1
> 
> Many thanks

5.14.21-150500.55.65-default(In reply to Andrea della Porta from comment #19)
> (In reply to Robert Munteanu from comment #10)
> > To echo the comments from bug #1225787
> > 
> > - kernel 5.14.21-150500.55.52-default was the last known working kernel from
> > me
> > - same system does not boot with
> > kernel-default-5.14.21-150500.55.59.1.aarch64
> 
> 5.14.21-150500.55.52-default does not work for me either. May I ask you how
> did you test the older kernel? Did you just burn an older Leap raw image on
> SD or did you just downgrade the kernel via commandline with something like:
> 
> zypper install --oldpackage kernel-default=5.14.21-150500.55.52.1
> 
> Many thanks

openSUSE-Leap-15.5-ARM-JeOS-raspberrypi.aarch64-2023.03.31-Build13.217.raw.xz is working fine, 5.14.21-150500.55.65-default.

maybe an issue with SD card?
Comment 21 Andrea della Porta 2024-06-12 09:41:37 UTC
 
> openSUSE-Leap-15.5-ARM-JeOS-raspberrypi.aarch64-2023.03.31-Build13.217.raw.
> xz is working fine, 5.14.21-150500.55.65-default.

you tested it on rpi4 or also on rpi3?
Comment 22 Zaoliang Luo 2024-06-12 14:37:43 UTC
(In reply to Andrea della Porta from comment #21)
>  
> > openSUSE-Leap-15.5-ARM-JeOS-raspberrypi.aarch64-2023.03.31-Build13.217.raw.
> > xz is working fine, 5.14.21-150500.55.65-default.
> 
> you tested it on rpi4 or also on rpi3?

yes, both.
Comment 23 Robert Munteanu 2024-06-13 07:29:11 UTC
(In reply to Andrea della Porta from comment #19)
> (In reply to Robert Munteanu from comment #10)
> > To echo the comments from bug #1225787
> > 
> > - kernel 5.14.21-150500.55.52-default was the last known working kernel from
> > me
> > - same system does not boot with
> > kernel-default-5.14.21-150500.55.59.1.aarch64
> 
> 5.14.21-150500.55.52-default does not work for me either. May I ask you how
> did you test the older kernel? Did you just burn an older Leap raw image on
> SD or did you just downgrade the kernel via commandline with something like:
> 
> zypper install --oldpackage kernel-default=5.14.21-150500.55.52.1
> 
> Many thanks

I had the old kernel installed, it was not cleaned up. There was no manipulation of the SD card or reinstallation.
Comment 24 John Paul Adrian Glaubitz 2024-06-13 09:12:00 UTC
FWIW, this is fixed by upgrading the system to openSUSE Leap 15.6 which updates the kernel to version 6.4.x.
Comment 25 Andrea della Porta 2024-06-17 10:29:26 UTC
I confirm that build 13.224 does not work (hangs as expected by the ticket), just like 13.217: in fact they share the exact same kernel.
Some further info: adding modprobe.blacklist=vc4 avoid the issue and let the Rpi3 boot correctly (albeit without monitor support). On rpi4, everything is ok and vc4 module does not hang.

Also, on rpi3, this is the ftrace stack when it hangs (on vc4 module only: "echo '*:mod:vc4' > set_ftrace_filter"):

2)              | vc4_drm_register [vc4]() {
 2) + 12.813 us  |   vc4_hvs_dev_probe [vc4]();
 2) + 30.521 us  |   vc4_hdmi_dev_probe [vc4]();
 2) + 54.844 us  |   vc4_vec_dev_probe [vc4]();
 2) + 21.146 us  |   vc4_txp_probe [vc4]();
 2) + 18.855 us  |   vc4_crtc_dev_probe [vc4]();
 2)  7.656 us   |   vc4_crtc_dev_prob

there's report in this ticket that 15.6 works (I've not tried it by myself yet), and from 15.5 (kernel 5.14) to 15.6 (kernel 6.4) there's plenty of kernel commits related to vc4, two of which (at least) may be worth our attention since solve crashes: 797d72ce8e0f8 and c86b41214362e8e.

Reassigning to the HW enablement team. 
Many thanks,
Andrea
Comment 26 Patrik Jakobsson 2024-06-17 11:26:47 UTC
At a quick glance this might be a mismatch between kernel and DTB. Perhaps the RPi firmware package changed (or needs to be updated)? The firmware lives on the SD-card so I'm not sure how this is updated in QA.

Ivan, since you know the platform better, do you have any thoughts?
Comment 28 Ivan Ivanov 2024-06-18 06:52:07 UTC
Devicetree's for RPi's are provided by raspberrypi-firmware-dt package.
Which hasn't been updated in a while(5 months).