|
Bugzilla – Full Text Bug Listing |
| Summary: | snapper list hangs indefinitely with kernel-rt | ||
|---|---|---|---|
| Product: | [SUSE Linux Enterprise Real Time Extension] SUSE Linux Enterprise Real Time 15 SP6 | Reporter: | Petr Cervinka <pcervinka> |
| Component: | Kernel | Assignee: | Mel Gorman <mgorman> |
| Status: | REOPENED --- | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P2 - High | CC: | fweisbecker, jcheung, mgorman, pcervinka, tiwai |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | SLES 15 | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | journal | ||
|
Description
Petr Cervinka
2024-04-04 05:44:07 UTC
(In reply to Petr Cervinka from comment #0) > Tested kernel versions: > kernel-default 6.4.0-150600.10-default > kernel-rt 6.4.0-150600.5-rt > rpm-6.4.0-150600.5 is a SLE kernel, not an SLERT kernel. Where was the RPM taken from and what commit ID was it based on? SLERT development had not officially started when that tag was created and I would be surprised if it was included in any SLERT build. > I found older similar issue > https://bugzilla.suse.com/show_bug.cgi?id=1211459, but that one was not > reported for RT. I tried this using the kernel that should be the Beta1 kernel for SLERT and got this rotom:~/:[130]# uname -a && time snapper list Linux rotom 6.4.0-150600.10-default #1 SMP PREEMPT_DYNAMIC Fri Mar 15 09:32:30 UTC 2024 (6ecedba) x86_64 x86_64 x86_64 GNU/Linux # | Type | Pre # | Date | User | Used Space | Cleanup | Description | Userdata ----+--------+-------+------------------------------+------+------------+---------+-----------------------+-------------- 0 | single | | | root | | | current | 1* | single | | Sat 23 Mar 2024 20:17:20 CET | root | 531.05 MiB | | first root filesystem | 2 | single | | Sat 23 Mar 2024 20:28:54 CET | root | 98.42 MiB | number | after installation | important=yes 9 | pre | | Sat 23 Mar 2024 20:32:16 CET | root | 384.00 KiB | number | zypp(zypper) | important=no 10 | post | 9 | Sat 23 Mar 2024 20:32:19 CET | root | 1.95 MiB | number | | important=no 11 | pre | | Sat 23 Mar 2024 20:32:24 CET | root | 64.00 KiB | number | zypp(zypper) | important=no 12 | post | 11 | Sat 23 Mar 2024 20:32:31 CET | root | 2.57 MiB | number | | important=no 13 | pre | | Sat 23 Mar 2024 20:33:09 CET | root | 812.00 KiB | number | zypp(zypper) | important=yes 14 | post | 13 | Sat 23 Mar 2024 20:36:38 CET | root | 47.84 MiB | number | | important=yes 15 | pre | | Sat 23 Mar 2024 20:39:24 CET | root | 256.00 KiB | number | zypp(zypper) | important=no 16 | post | 15 | Sat 23 Mar 2024 20:39:27 CET | root | 356.00 KiB | number | | important=no 17 | pre | | Sat 23 Mar 2024 20:39:29 CET | root | 128.00 KiB | number | zypp(zypper) | important=no 18 | post | 17 | Sat 23 Mar 2024 20:39:31 CET | root | 476.00 KiB | number | | important=no real 0m12.897s user 0m0.004s sys 0m0.011s rotom:~/:[0]# Broadcast message from root@rotom on pts/2 (Thu 2024-04-04 17:51:22 CEST): The system will reboot now! Connection to rotom closed by remote host. Connection to rotom closed. marvin@perf-vm-lp:~ > ssh root@rotom Temporary motd until orthos regenerates Last login: Thu Apr 4 17:05:20 2024 from 10.100.128.112 rotom:~/:[0]# uname -a && time snapper list Linux rotom 6.4.0-rt-e5213efd8503 #1 SMP PREEMPT_RT Thu Apr 4 17:40:47 CEST 2024 x86_64 x86_64 x86_64 GNU/Linux # | Type | Pre # | Date | User | Used Space | Cleanup | Description | Userdata ----+--------+-------+------------------------------+------+------------+---------+-----------------------+-------------- 0 | single | | | root | | | current | 1* | single | | Sat 23 Mar 2024 20:17:20 CET | root | 8.32 GiB | | first root filesystem | 2 | single | | Sat 23 Mar 2024 20:28:54 CET | root | 98.42 MiB | number | after installation | important=yes 9 | pre | | Sat 23 Mar 2024 20:32:16 CET | root | 384.00 KiB | number | zypp(zypper) | important=no 10 | post | 9 | Sat 23 Mar 2024 20:32:19 CET | root | 1.95 MiB | number | | important=no 11 | pre | | Sat 23 Mar 2024 20:32:24 CET | root | 64.00 KiB | number | zypp(zypper) | important=no 12 | post | 11 | Sat 23 Mar 2024 20:32:31 CET | root | 2.57 MiB | number | | important=no 13 | pre | | Sat 23 Mar 2024 20:33:09 CET | root | 812.00 KiB | number | zypp(zypper) | important=yes 14 | post | 13 | Sat 23 Mar 2024 20:36:38 CET | root | 47.84 MiB | number | | important=yes 15 | pre | | Sat 23 Mar 2024 20:39:24 CET | root | 256.00 KiB | number | zypp(zypper) | important=no 16 | post | 15 | Sat 23 Mar 2024 20:39:27 CET | root | 356.00 KiB | number | | important=no 17 | pre | | Sat 23 Mar 2024 20:39:29 CET | root | 128.00 KiB | number | zypp(zypper) | important=no 18 | post | 17 | Sat 23 Mar 2024 20:39:31 CET | root | 476.00 KiB | number | | important=no real 0m11.133s user 0m0.013s sys 0m0.006s The default and RT kernel ran "snapper list" to completion at roughly the same speed. We use repository https://download.suse.de/ibs/SUSE/Products/SLE-Module-RT/15-SP6/x86_64/product/ and https://dist.suse.de/ibs/SUSE/Products/SLE-Product-RT/15-SP6/x86_64/product/ . It is similar to 15-SP5 repository, which we used last year. I assumed same pattern like for 15-SP5 as development was pushed again to be not much delayed after GM. If it is not correct repository, which one should be used? (In reply to Mel Gorman from comment #1) > I tried this using the kernel that should be the Beta1 kernel for SLERT and > got this > > rotom:~/:[130]# uname -a && time snapper list > Linux rotom 6.4.0-150600.10-default #1 SMP PREEMPT_DYNAMIC Fri Mar 15 This is kernel-default 6.4.0-150600.10-default, not kernel-rt. > Linux rotom 6.4.0-rt-e5213efd8503 #1 SMP PREEMPT_RT Thu Apr 4 17:40:47 CEST This is kernel-rt, but you if you did it in this order, you reproduced scenario which works (as it is in description). If you try kernel-rt first, it will hang. If you boot kernel-default, snapper will not hang. If you boot kernel-rt(after default) it will work. Problem is, when you boot kernel-rt by default on freshly installed system and try snapper list. (In reply to Petr Cervinka from comment #3) > If you try kernel-rt first, it will hang. If you boot kernel-default, > snapper will not hang. If you boot kernel-rt(after default) it will work. > Problem is, when you boot kernel-rt by default on freshly installed system > and try snapper list. Given that the installer typically installs with the default kernel and then switches, reproducing this exactly may be problematic. I'll make an attempt when Beta1 is released. It's hard to imagine how this is RT-specific although RT may make it easier to reproduce a bug within btrfs. For example, qgroup rescan is particularly slow if it's lock intensive although indefinite starvation is a possibility. Rescanning itself doesn't seem to be overly problematic unless the very first scan is somehow different. rotom:~/:[0]# time btrfs quota rescan /; time btrfs quota rescan -s /; time btrfs quota rescan -w / quota rescan started real 0m0.015s user 0m0.001s sys 0m0.008s rescan operation running (current key 15437825) real 0m0.003s user 0m0.000s sys 0m0.002s real 0m11.220s user 0m0.003s sys 0m0.000s The last one is the time waiting for the rescan to complete. dmesg agrees with [ 9710.121688] BTRFS warning (device sda3): qgroup rescan is already in progress [ 9721.338704] BTRFS info (device sda3): qgroup scan completed (inconsistency flag cleared) (In reply to Mel Gorman from comment #4) > > Given that the installer typically installs with the default kernel and then > switches, reproducing this exactly may be problematic. I'll make an attempt > when Beta1 is released. It's hard to imagine how this is RT-specific > although RT may make it easier to reproduce a bug within btrfs. For example, > qgroup rescan is particularly slow if it's lock intensive although > indefinite starvation is a possibility. We use image [1] produced by openQA and generated by autoyast, that's the reason why we probably hit it as we have kernel-rt as default already. On other side, same scenario was used for previous SP5/SP4 testing and didn't have any issue. https://openqa.suse.de/tests/13936149/asset/hdd/SLERT-15-SP6-x86_64-Build73.1@64bit-gnome_rt.qcow2 (In reply to Petr Cervinka from comment #2) > We use repository > https://download.suse.de/ibs/SUSE/Products/SLE-Module-RT/15-SP6/x86_64/ > product/ and > https://dist.suse.de/ibs/SUSE/Products/SLE-Product-RT/15-SP6/x86_64/product/ > . It is similar to 15-SP5 repository, which we used last year. > > I assumed same pattern like for 15-SP5 as development was pushed again to be > not much delayed after GM. > > If it is not correct repository, which one should be used? The packages in these repo are outdated, for example, the 15 SP6 GA repo have lttng-module but not in RT repo, I will ping Autobuild Team to refresh. It looks that problem was caused by not-synced repositories in IBS and first test run was done with uncertain kernel version, which was missing many fixes. Project repositories were synced and test looks fine: https://openqa.suse.de/tests/13972352#step/installation_snapshots/3 Kernel version is now 6.4.0-150600.1-rt and snapper doesn't hang on snapshot list. It was not product or test issue, it was just small glitch in project setup. I think i set it to resolved too early. Unfortunately issue was reproduced on next test run https://openqa.suse.de/tests/13974430#step/installation_snapshots/6. I will try to collect more information how often it happens later this week. I setup short openQA test scenario which just boots, tests that we really run rt kernel and do snapshot list. Ratio is, 7 passed jobs out of 50 scheduled jobs. Issue is sporadic with high chance to fail. Pass: https://openqa.suse.de/tests/13991840 Fail: https://openqa.suse.de/tests/13991846 Fail is not dependent on hardware (across all amd+intel cpus and different versions), it can just fail on the worker and next run can be fine. (In reply to Petr Cervinka from comment #9) > It looks that problem was caused by not-synced repositories in IBS and first > test run was done with uncertain kernel version, which was missing many > fixes. > > Project repositories were synced and test looks fine: > https://openqa.suse.de/tests/13972352#step/installation_snapshots/3 > > Kernel version is now 6.4.0-150600.1-rt and snapper doesn't hang on snapshot > list. > Yes, the kernel version changed is due to the repo now get refresh based on SUSE:SLE-15-SP6:Update:Products:SLERT now instead of SUSE:SLE-15-SP6:GA. We need to split the SLERT to a new project to facilitate the separate repo refresh. |