Bug 1222313 - btrfs storage discrepancy between used and free space of about 1/3 of the total disk capacity
Summary: btrfs storage discrepancy between used and free space of about 1/3 of the tot...
Status: NEW
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel:Filesystems (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Wenruo Qu
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-04 12:27 UTC by Felix Niederwanger
Modified: 2024-04-09 09:01 UTC (History)
11 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
btrfs fi df (201 bytes, text/plain)
2024-04-04 12:27 UTC, Felix Niederwanger
Details
btrfs fi du (951 bytes, text/plain)
2024-04-04 12:27 UTC, Felix Niederwanger
Details
btrfs subvolume list (911 bytes, text/plain)
2024-04-04 12:28 UTC, Felix Niederwanger
Details
btrfs fi usage (762 bytes, text/plain)
2024-04-04 12:28 UTC, Felix Niederwanger
Details
df -h (1.01 KB, text/plain)
2024-04-04 12:28 UTC, Felix Niederwanger
Details
du -hs (301 bytes, text/plain)
2024-04-04 12:28 UTC, Felix Niederwanger
Details
ls -al (1.06 KB, text/plain)
2024-04-04 12:28 UTC, Felix Niederwanger
Details
lsblk (1.34 KB, text/plain)
2024-04-04 12:29 UTC, Felix Niederwanger
Details
snapper ls (1.15 KB, text/plain)
2024-04-04 12:29 UTC, Felix Niederwanger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Felix Niederwanger 2024-04-04 12:27:04 UTC
I have a btrfs filesystem of about 1 TiB on my Laptop where I'm missing about 1/3 of the capacity of the disk.

According to the output of `du`, my disk should occupy about 540 GiB of storage space. This includes snapper snapshots, ignoring the possible space savings by shared space, so this is a generous upper limit. However, df reports that 778GiB of storage are being used at the moment, leaving a discrepancy of about 240 GiB or almost 1/4 of the total SSD capacity that I am unable to account for. If I take the used space reported by snapper instead of the du output of /.snapshots the discrepancy increases to 314 GiB or about 1/3 of the total SSD capacity that are missing.

I wrote to the research@suse.de mailing list on Tuesday (https://mailman.suse.de/mlarch/SuSE/research/2024/research.2024.04/msg00000.html) and we were unable to find where the missing storage went. Two other users confirmed my issue by reporting they hit a similar issues - one on-list one in a private conversation. In both cases they report increased disk usage with no obvious consumer present.

To exclude the known btrfs metadata 6.7 kernel bug, I run a full balance this week, and also checked the output of fi df, which shows that metadata are occupying only 4GiB. The balance didn't change the overall picture.

I will attach all collected logs and information as requested in the mail thread below.

## System description

Running Tumbleweed 20240402 with Kernel 6.8.1-1-default.

I'm using the full disk encryption layout as suggested by the installer in November 2023.
This means a single btrfs volume atop a LUKS encrypted lvm volume.

I run scrubs manually, and otherwise left the default btrfs maintenance script in their default configuration (balance/trim enabled).
Comment 1 Felix Niederwanger 2024-04-04 12:27:30 UTC
Created attachment 874057 [details]
btrfs fi df
Comment 2 Felix Niederwanger 2024-04-04 12:27:53 UTC
Created attachment 874058 [details]
btrfs fi du
Comment 3 Felix Niederwanger 2024-04-04 12:28:05 UTC
Created attachment 874059 [details]
btrfs subvolume list
Comment 4 Felix Niederwanger 2024-04-04 12:28:19 UTC
Created attachment 874060 [details]
btrfs fi usage
Comment 5 Felix Niederwanger 2024-04-04 12:28:32 UTC
Created attachment 874061 [details]
df -h
Comment 6 Felix Niederwanger 2024-04-04 12:28:45 UTC
Created attachment 874062 [details]
du -hs
Comment 7 Felix Niederwanger 2024-04-04 12:28:56 UTC
Created attachment 874063 [details]
ls -al
Comment 8 Felix Niederwanger 2024-04-04 12:29:05 UTC
Created attachment 874064 [details]
lsblk
Comment 9 Felix Niederwanger 2024-04-04 12:29:14 UTC
Created attachment 874065 [details]
snapper ls
Comment 10 Felix Niederwanger 2024-04-04 12:43:30 UTC
du: 459GiB + 80GiB (snapshots)

> 80G	/.snapshots
> 0	/dev
> 142G	/home
> 224K	/opt
> 0	/proc
> 23M	/root
> 2.6M	/run
> 304G	/srv
> 0	/sys
> 4.0K	/tmp
> 13G	/var

df: 778G

> /dev/mapper/system-root  932G  778G  148G  85% /.snapshots
> /dev/mapper/system-root  932G  778G  148G  85% /boot/grub2/i386-pc
> /dev/mapper/system-root  932G  778G  148G  85% /boot/grub2/x86_64-efi
> /dev/mapper/system-root  932G  778G  148G  85% /home
> /dev/mapper/system-root  932G  778G  148G  85% /opt
> /dev/mapper/system-root  932G  778G  148G  85% /usr/local
> /dev/mapper/system-root  932G  778G  148G  85% /srv
> /dev/mapper/system-root  932G  778G  148G  85% /root
> /dev/mapper/system-root  932G  778G  148G  85% /var

btrfs fi df: 769GiB

> Data, single: total=794.00GiB, used=769.06GiB
> System, DUP: total=32.00MiB, used=128.00KiB
> Metadata, DUP: total=7.00GiB, used=4.03GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

snapper:


>    # | Type   | Pre # | Date                     | User | Used Space | Cleanup | Description                      | Userdata    
> -----+--------+-------+--------------------------+------+------------+---------+----------------------------------+-------------
>   0  | single |       |                          | root |            |         | current                          |             
>   1* | single |       | Tue Nov  7 14:02:39 2023 | root |   1.22 MiB |         | first root filesystem            |             
>   3  | single |       | Tue Nov  7 14:23:38 2023 | root |   5.38 GiB |         | Fresh                            |             
> 298  | single |       | Tue Apr  2 08:52:47 2024 | root | 286.58 MiB |         | TW 20240329 - after libzma vuln  |             
> 301  | pre    |       | Thu Apr  4 08:03:38 2024 | root |  85.66 MiB | number  | zypp(zypper)                     | important=no
> 302  | post   |   301 | Thu Apr  4 08:04:52 2024 | root |  19.06 MiB | number  |                                  | important=no
> 303  | single |       | Thu Apr  4 13:37:29 2024 | root | 944.00 KiB |         | TW 20240402 - after liblzma vuln |             

However I count, there's always a considerable portion of the disk storage capacity being eaten away by something.
Comment 11 Wenruo Qu 2024-04-04 21:24:42 UTC
If you have some random IO workload, it's very possible that btrfs bookend extents are causing the problem.

Furthermore, if you have truncated files (which is previously very large or preallocated), it can also take tons of unexpected space.
Another point is, preallocation (falloc) is very btrfs unfriendly, if you have something like VM images, you'd be much better disable the snapshot of that subvolume, and set NOCOW flag for them.

But since you have snapshots, the normal way to solve the problem (defrag) is not suitable as it would break the shared extents and cause extra space usage.

It's recommended to delete all the unnecessary snapshots, and try "btrfs fi defrag" to see if it help (needs a sync after full defrag).
Although there is a limitation on defrag that truncated files may not be that well defragged.

If regular defrag (after deleting all snapshots) is not helping, you may want to try the following patches:

- kernel part to enhance defrag:
  https://lore.kernel.org/linux-btrfs/cover.1710213625.git.wqu@suse.com/

- btrfs-progs support
  https://lore.kernel.org/linux-btrfs/cover.1710214834.git.wqu@suse.com/
Comment 12 Felix Niederwanger 2024-04-08 08:35:54 UTC
(In reply to Wenruo Qu from comment #11)
> If you have some random IO workload, it's very possible that btrfs bookend
> extents are causing the problem.
> 
> Furthermore, if you have truncated files (which is previously very large or
> preallocated), it can also take tons of unexpected space.
> Another point is, preallocation (falloc) is very btrfs unfriendly, if you
> have something like VM images, you'd be much better disable the snapshot of
> that subvolume, and set NOCOW flag for them.

I believe this could the the reason. I store my VM disk images on /srv, which by default does not has the NOCOW flag set.

I'll try to delete and restore the VM disk images in question after setting the flag, but this is going to take some time and then report back.
Comment 13 Felix Niederwanger 2024-04-08 09:43:55 UTC
(In reply to Wenruo Qu from comment #11)
> If you have some random IO workload, it's very possible that btrfs bookend
> extents are causing the problem.

Is there a way to check the booked extents of a certain file? I have a bunch of VM images that could be the issue, but I'd like to check this hypothesis.
Comment 14 Wenruo Qu 2024-04-08 10:13:17 UTC
(In reply to Felix Niederwanger from comment #13)
> (In reply to Wenruo Qu from comment #11)
> > If you have some random IO workload, it's very possible that btrfs bookend
> > extents are causing the problem.
> 
> Is there a way to check the booked extents of a certain file? I have a bunch
> of VM images that could be the issue, but I'd like to check this hypothesis.

Pretty hard, we do not have any good way to check that.
There are tools like compsize which goes TREE_SEARCH ioctl to verify each file extent, but
compsize is not designed to check the bookend wasted bytes, thus it doesn't do much help.

A more convieant way is to defrag that subvolume (as long as that subvolume is not snapshotted).

We're moving towards enhancing fiemap sysctl to export more info, but that may even take years.
Meanwhile we may want to develop a tool to do the bookend accounting soon, since it's not the
first time an end user is complaining about it.
Comment 15 Felix Niederwanger 2024-04-08 13:29:10 UTC
Moving the disk images to an external medium and back made a HUGE difference - I got almost 200 GiB back:

> /dev/mapper/system-root  932G  586G  338G  64% /.snapshots
> /dev/mapper/system-root  932G  586G  338G  64% /boot/grub2/i386-pc
> /dev/mapper/system-root  932G  586G  338G  64% /boot/grub2/x86_64-efi
> /dev/mapper/system-root  932G  586G  338G  64% /home
> /dev/mapper/system-root  932G  586G  338G  64% /opt
> /dev/mapper/system-root  932G  586G  338G  64% /root
> /dev/mapper/system-root  932G  586G  338G  64% /srv
> /dev/mapper/system-root  932G  586G  338G  64% /usr/local
> /dev/mapper/system-root  932G  586G  338G  64% /var

I now also disabled COW for the filesystem in question via `chattr +C -R /srv` and hope this prevents the problem from arising.

The "storage hole" were my VM disk images that I keep on this laptop for testing. All VMs are updated once per day during lunch time, and I keep them until the products run EOL.

We do apply the +C attribute for the /var partition by default and this is also the case here. I wonder if it would make sense to apply the same defaults for /srv, which are likely holding similar kind of data like /var does.
Comment 16 Martin Wilck 2024-04-09 09:01:06 UTC
I recall that btrfs is generally not recommended for storing VM images. I'm using XFS for this kind of thing.