Bug 1222346

Summary: snapperd can't work
Product: [openSUSE] openSUSE Tumbleweed Reporter: Dmitry Markov <werwolf131313>
Component: Kernel:FilesystemsAssignee: Wenruo Qu <wqu>
Status: REOPENED --- QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: aschnell, rgoldwyn, werwolf131313
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
See Also: https://bugzilla.suse.com/show_bug.cgi?id=1206814
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Dmitry Markov 2024-04-05 04:07:51 UTC
This morning I noticed that my home PC is loading 5 out of 20 cores without running application software.
In htop I found that the only consumer of CPU time is snapperd.
The same lines appear in the /var/log/snapper.log file several times a second:
> 2024-04-05 08:57:46 WAR libsnapper(1375) BtrfsUtils.cc(quota_rescan):474 - waiting for old quota rescan to finish
> 2024-04-05 08:57:46 WAR libsnapper(1375) BtrfsUtils.cc(quota_rescan):474 - waiting for old quota rescan to finish
> 2024-04-05 08:57:46 WAR libsnapper(1375) BtrfsUtils.cc(quota_rescan):474 - waiting for old quota rescan to finish

I tried:
> [werwolf@home] ~  
> ❯ sudo btrfs quota disable / && sudo systemctl restart snapperd

the CPU load ended, but after a couple of hours I again found exactly the same behavior with the same messages in the log.
Comment 1 Dmitry Markov 2024-04-05 04:28:06 UTC
additionally: the sudo snapper list command has been running for about 20 minutes and is not showing any results.
Comment 2 Dmitry Markov 2024-04-08 07:05:02 UTC
unfortunately, after 2 days the problem went away and no longer appears, I still could not understand what this problem was connected with.
I also don’t know what to do with this bug report now.
Comment 3 Dmitry Markov 2024-04-15 03:24:46 UTC
The problem didn't happen again, I'm closing it
Comment 4 Dmitry Markov 2024-05-07 05:43:32 UTC
the problem repeated itself.
in a few minutes /var/log/snapper.log began to take up 70 gigabytes and occupied the entire system partition.
Comment 5 Arvin Schnell 2024-05-07 08:48:22 UTC
I have already added a sleep in snapper to avoid the excessive logging
(see https://github.com/openSUSE/snapper/issues/892 - not included in
Tumbleweed yet).

Anyway, btrfs seems to have a problem: snapper does a BTRFS_IOC_QUOTA_RESCAN
and if that reports EINPROGRESS it waits using BTRFS_IOC_QUOTA_RESCAN_WAIT.
But AFAIS the RESCAN_WAIT returns immediately.

Also, the rescan itself takes apparently more that 20 minutes. snapper
is waiting for the rescan to finish. Whether such a long time to rescan
is a bug in btrfs I cannot say.

Also see https://bugzilla.suse.com/show_bug.cgi?id=1206814#c11.
Comment 6 Wenruo Qu 2024-05-21 22:26:56 UTC
I'm wondering if snapper is not handling rescan correctly.

Yes, qgroup is very heavy, but that's so far the only way to get accurate accounting numbers, considering how many tricks we put to get super fast snapshotting.
All the super fast snapshotting has put a lot of complexity to qgroups.
Not to mention the extra complexity from multi-level qgroups, utilized by snapper.

But for the RESCAN_WAIT returns immeidately problem, I think it's something from the snapper side.
Snapper should check if qgroup is even enabled and rescan is running.

Or RESCAN_WAIT ioctl immediately return 0 if qgroup rescan is not running nor qgroup is enabled.
Any clue how snapper triggers qgroup rescan? Did it check if qgroup is enabled or handles qgroup rescan return value correctly?
Comment 7 Arvin Schnell 2024-05-22 07:06:20 UTC
snapper does a BTRFS_IOC_QUOTA_RESCAN and if that fails and errno is
EINPROGRESS it does a BTRFS_IOC_QUOTA_RESCAN_WAIT. And only if that
succeeds it calls BTRFS_IOC_QUOTA_RESCAN again. For details:

https://github.com/openSUSE/snapper/blob/master/snapper/BtrfsUtils.cc#L464