Bug 1222290 - snapper list hangs indefinitely with kernel-rt
Summary: snapper list hangs indefinitely with kernel-rt
Status: REOPENED
Alias: None
Product: SUSE Linux Enterprise Real Time 15 SP6
Classification: SUSE Linux Enterprise Real Time Extension
Component: Kernel (show other bugs)
Version: unspecified
Hardware: x86-64 SLES 15
: P2 - High : Normal
Target Milestone: ---
Assignee: Mel Gorman
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-04 05:44 UTC by Petr Cervinka
Modified: 2024-04-11 07:21 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
journal (244.60 KB, application/gzip)
2024-04-04 05:44 UTC, Petr Cervinka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Petr Cervinka 2024-04-04 05:44:07 UTC
Created attachment 874044 [details]
journal

We started validation of 15-SP6 RT product and we noticed strange behavior during snapper list command. It just hangs:

# snapper list
.... hangs

I left virtual machine untouched for couple of hours and there is no change.
Logs are just filled with:

Apr 04 01:23:42 susetest kernel: BTRFS warning (device vda2): qgroup rescan is already in progress

I tried kernel-default on the same system and snapper list finished in few seconds.
When I booted kernel-rt again (after kernel-default), snapper list worked just fine.

Tested kernel versions:
kernel-default 6.4.0-150600.10-default
kernel-rt 6.4.0-150600.5-rt

I found older similar issue https://bugzilla.suse.com/show_bug.cgi?id=1211459, but that one was not reported for RT.
Comment 1 Mel Gorman 2024-04-04 09:54:54 UTC
(In reply to Petr Cervinka from comment #0)
> Tested kernel versions:
> kernel-default 6.4.0-150600.10-default
> kernel-rt 6.4.0-150600.5-rt
> 

rpm-6.4.0-150600.5 is a SLE kernel, not an SLERT kernel. Where was the RPM taken from and what commit ID was it based on? SLERT development had not officially started when that tag was created and I would be surprised if it was included in any SLERT build.

> I found older similar issue
> https://bugzilla.suse.com/show_bug.cgi?id=1211459, but that one was not
> reported for RT.

I tried this using the kernel that should be the Beta1 kernel for SLERT and got this

rotom:~/:[130]# uname -a && time snapper list
Linux rotom 6.4.0-150600.10-default #1 SMP PREEMPT_DYNAMIC Fri Mar 15 09:32:30 UTC 2024 (6ecedba) x86_64 x86_64 x86_64 GNU/Linux
  # | Type   | Pre # | Date                         | User | Used Space | Cleanup | Description           | Userdata     
----+--------+-------+------------------------------+------+------------+---------+-----------------------+--------------
 0  | single |       |                              | root |            |         | current               |              
 1* | single |       | Sat 23 Mar 2024 20:17:20 CET | root | 531.05 MiB |         | first root filesystem |              
 2  | single |       | Sat 23 Mar 2024 20:28:54 CET | root |  98.42 MiB | number  | after installation    | important=yes
 9  | pre    |       | Sat 23 Mar 2024 20:32:16 CET | root | 384.00 KiB | number  | zypp(zypper)          | important=no 
10  | post   |     9 | Sat 23 Mar 2024 20:32:19 CET | root |   1.95 MiB | number  |                       | important=no 
11  | pre    |       | Sat 23 Mar 2024 20:32:24 CET | root |  64.00 KiB | number  | zypp(zypper)          | important=no 
12  | post   |    11 | Sat 23 Mar 2024 20:32:31 CET | root |   2.57 MiB | number  |                       | important=no 
13  | pre    |       | Sat 23 Mar 2024 20:33:09 CET | root | 812.00 KiB | number  | zypp(zypper)          | important=yes
14  | post   |    13 | Sat 23 Mar 2024 20:36:38 CET | root |  47.84 MiB | number  |                       | important=yes
15  | pre    |       | Sat 23 Mar 2024 20:39:24 CET | root | 256.00 KiB | number  | zypp(zypper)          | important=no 
16  | post   |    15 | Sat 23 Mar 2024 20:39:27 CET | root | 356.00 KiB | number  |                       | important=no 
17  | pre    |       | Sat 23 Mar 2024 20:39:29 CET | root | 128.00 KiB | number  | zypp(zypper)          | important=no 
18  | post   |    17 | Sat 23 Mar 2024 20:39:31 CET | root | 476.00 KiB | number  |                       | important=no 

real    0m12.897s
user    0m0.004s
sys     0m0.011s
rotom:~/:[0]# 
Broadcast message from root@rotom on pts/2 (Thu 2024-04-04 17:51:22 CEST):

The system will reboot now!

Connection to rotom closed by remote host.
Connection to rotom closed.
marvin@perf-vm-lp:~ > ssh root@rotom
Temporary motd until orthos regenerates
Last login: Thu Apr  4 17:05:20 2024 from 10.100.128.112
rotom:~/:[0]# uname -a && time snapper list
Linux rotom 6.4.0-rt-e5213efd8503 #1 SMP PREEMPT_RT Thu Apr  4 17:40:47 CEST 2024 x86_64 x86_64 x86_64 GNU/Linux
  # | Type   | Pre # | Date                         | User | Used Space | Cleanup | Description           | Userdata     
----+--------+-------+------------------------------+------+------------+---------+-----------------------+--------------
 0  | single |       |                              | root |            |         | current               |              
 1* | single |       | Sat 23 Mar 2024 20:17:20 CET | root |   8.32 GiB |         | first root filesystem |              
 2  | single |       | Sat 23 Mar 2024 20:28:54 CET | root |  98.42 MiB | number  | after installation    | important=yes
 9  | pre    |       | Sat 23 Mar 2024 20:32:16 CET | root | 384.00 KiB | number  | zypp(zypper)          | important=no 
10  | post   |     9 | Sat 23 Mar 2024 20:32:19 CET | root |   1.95 MiB | number  |                       | important=no 
11  | pre    |       | Sat 23 Mar 2024 20:32:24 CET | root |  64.00 KiB | number  | zypp(zypper)          | important=no 
12  | post   |    11 | Sat 23 Mar 2024 20:32:31 CET | root |   2.57 MiB | number  |                       | important=no 
13  | pre    |       | Sat 23 Mar 2024 20:33:09 CET | root | 812.00 KiB | number  | zypp(zypper)          | important=yes
14  | post   |    13 | Sat 23 Mar 2024 20:36:38 CET | root |  47.84 MiB | number  |                       | important=yes
15  | pre    |       | Sat 23 Mar 2024 20:39:24 CET | root | 256.00 KiB | number  | zypp(zypper)          | important=no 
16  | post   |    15 | Sat 23 Mar 2024 20:39:27 CET | root | 356.00 KiB | number  |                       | important=no 
17  | pre    |       | Sat 23 Mar 2024 20:39:29 CET | root | 128.00 KiB | number  | zypp(zypper)          | important=no 
18  | post   |    17 | Sat 23 Mar 2024 20:39:31 CET | root | 476.00 KiB | number  |                       | important=no 

real    0m11.133s
user    0m0.013s
sys     0m0.006s

The default and RT kernel ran "snapper list" to completion at roughly the same speed.
Comment 2 Petr Cervinka 2024-04-04 11:00:57 UTC
We use repository https://download.suse.de/ibs/SUSE/Products/SLE-Module-RT/15-SP6/x86_64/product/ and https://dist.suse.de/ibs/SUSE/Products/SLE-Product-RT/15-SP6/x86_64/product/ . It is similar to 15-SP5 repository, which we used last year.

I assumed same pattern like for 15-SP5 as development was pushed again to be not much delayed after GM.

If it is not correct repository, which one should be used?
Comment 3 Petr Cervinka 2024-04-04 11:08:51 UTC
(In reply to Mel Gorman from comment #1)
> I tried this using the kernel that should be the Beta1 kernel for SLERT and
> got this
> 
> rotom:~/:[130]# uname -a && time snapper list
> Linux rotom 6.4.0-150600.10-default #1 SMP PREEMPT_DYNAMIC Fri Mar 15

This is kernel-default  6.4.0-150600.10-default, not kernel-rt.

> Linux rotom 6.4.0-rt-e5213efd8503 #1 SMP PREEMPT_RT Thu Apr  4 17:40:47 CEST

This is kernel-rt, but you if you did it in this order, you reproduced scenario which works (as it is in description).


If you try kernel-rt first, it will hang. If you boot kernel-default, snapper will not hang. If you boot kernel-rt(after default) it will work. Problem is, when you boot kernel-rt by default on freshly installed system and try snapper list.
Comment 4 Mel Gorman 2024-04-04 12:10:08 UTC
(In reply to Petr Cervinka from comment #3)
> If you try kernel-rt first, it will hang. If you boot kernel-default,
> snapper will not hang. If you boot kernel-rt(after default) it will work.
> Problem is, when you boot kernel-rt by default on freshly installed system
> and try snapper list.

Given that the installer typically installs with the default kernel and then switches, reproducing this exactly may be problematic. I'll make an attempt when Beta1 is released. It's hard to imagine how this is RT-specific although RT may make it easier to reproduce a bug within btrfs. For example, qgroup rescan is particularly slow if it's lock intensive although indefinite starvation is a possibility.
Comment 5 Mel Gorman 2024-04-04 12:13:15 UTC
Rescanning itself doesn't seem to be overly problematic unless the very first scan is somehow different.

rotom:~/:[0]# time btrfs quota rescan /; time btrfs quota rescan -s /; time btrfs quota rescan -w /
quota rescan started

real    0m0.015s
user    0m0.001s
sys     0m0.008s
rescan operation running (current key 15437825)

real    0m0.003s
user    0m0.000s
sys     0m0.002s

real    0m11.220s
user    0m0.003s
sys     0m0.000s

The last one is the time waiting for the rescan to complete. dmesg agrees with

[ 9710.121688] BTRFS warning (device sda3): qgroup rescan is already in progress
[ 9721.338704] BTRFS info (device sda3): qgroup scan completed (inconsistency flag cleared)
Comment 6 Petr Cervinka 2024-04-04 12:26:57 UTC
(In reply to Mel Gorman from comment #4)
> 
> Given that the installer typically installs with the default kernel and then
> switches, reproducing this exactly may be problematic. I'll make an attempt
> when Beta1 is released. It's hard to imagine how this is RT-specific
> although RT may make it easier to reproduce a bug within btrfs. For example,
> qgroup rescan is particularly slow if it's lock intensive although
> indefinite starvation is a possibility.

We use image [1] produced by openQA and generated by autoyast, that's the reason why we probably hit it as we have kernel-rt as default already. On other side, same scenario was used for previous SP5/SP4 testing and didn't have any issue.

https://openqa.suse.de/tests/13936149/asset/hdd/SLERT-15-SP6-x86_64-Build73.1@64bit-gnome_rt.qcow2
Comment 7 Jeffrey Cheung 2024-04-04 12:31:26 UTC
(In reply to Petr Cervinka from comment #2)
> We use repository
> https://download.suse.de/ibs/SUSE/Products/SLE-Module-RT/15-SP6/x86_64/
> product/ and
> https://dist.suse.de/ibs/SUSE/Products/SLE-Product-RT/15-SP6/x86_64/product/
> . It is similar to 15-SP5 repository, which we used last year.
> 
> I assumed same pattern like for 15-SP5 as development was pushed again to be
> not much delayed after GM.
> 
> If it is not correct repository, which one should be used?

The packages in these repo are outdated, for example, the 15 SP6 GA repo have lttng-module but not in RT repo, I will ping Autobuild Team to refresh.
Comment 9 Petr Cervinka 2024-04-09 07:35:53 UTC
It looks that problem was caused by not-synced repositories in IBS and first test run was done with uncertain kernel version, which was missing many fixes.

Project repositories were synced and test looks fine: https://openqa.suse.de/tests/13972352#step/installation_snapshots/3

Kernel version is now 6.4.0-150600.1-rt and snapper doesn't hang on snapshot list.

It was not product or test issue, it was just small glitch in project setup.
Comment 10 Petr Cervinka 2024-04-09 12:11:27 UTC
I think i set it to resolved too early. Unfortunately issue was reproduced on next test run https://openqa.suse.de/tests/13974430#step/installation_snapshots/6. 

I will try to collect more information how often it happens later this week.
Comment 11 Petr Cervinka 2024-04-10 06:15:36 UTC
I setup short openQA test scenario which just boots, tests that we really run rt kernel and do snapshot list.

Ratio is, 7 passed jobs out of 50 scheduled jobs. Issue is sporadic with high chance to fail.

Pass: https://openqa.suse.de/tests/13991840
Fail: https://openqa.suse.de/tests/13991846

Fail is not dependent on hardware (across all amd+intel cpus and different versions), it can just fail on the worker and next run can be fine.
Comment 12 Jeffrey Cheung 2024-04-11 07:21:48 UTC
(In reply to Petr Cervinka from comment #9)
> It looks that problem was caused by not-synced repositories in IBS and first
> test run was done with uncertain kernel version, which was missing many
> fixes.
> 
> Project repositories were synced and test looks fine:
> https://openqa.suse.de/tests/13972352#step/installation_snapshots/3
> 
> Kernel version is now 6.4.0-150600.1-rt and snapper doesn't hang on snapshot
> list.
> 
Yes, the kernel version changed is due to the repo now get refresh based on SUSE:SLE-15-SP6:Update:Products:SLERT now instead of SUSE:SLE-15-SP6:GA. 

We need to split the SLERT to a new project to facilitate the separate repo refresh.