Bug 1219829

Summary: EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
Product: [openSUSE] openSUSE Tumbleweed Reporter: Ruediger Oertel <ro>
Component: KernelAssignee: Luis Henriques <lhenriques>
Status: NEW --- QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: ada.lovelace, dmueller, ihno, jack, marcela.maslanova, rgoldwyn, ro, tiwai
Version: Current   
Target Milestone: ---   
Hardware: S/390-64   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Ruediger Oertel 2024-02-12 10:43:12 UTC
trying to run the IBS LPARs on 6.7.4

Linux version 6.7.4-2.gc38a620-default (geeko@buildhost) (gcc (SUSE Linux) 13.2.1 20240125 [revision fc7d87e0ffadca49bec29b2107c1efd0da6b6ded]

# worker creates a filesystem:
mke2fs -t ext4 -O ^has_journal -F /dev/mapper/$DMTARG

# creates the fstab entry and mounts it:
echo "/dev/mapper/$DMTARG $OBS_WORKER_DIRECTORY ext4 noatime,nodiratime,discard,nobarrier,async 1 2" >> /etc/fstab

mkdir -p $OBS_WORKER_DIRECTORY
mount $OBS_WORKER_DIRECTORY

almost directly afterwards you can see fs errors:

[  171.720733] EXT4-fs (dm-5): mounting with "discard" option, but the device does not support discard
[  171.720743] EXT4-fs (dm-5): mounted filesystem b286e1d2-cc6a-4990-9f4a-6a66bb7df2ca r/w without journal. Quota mode: none.
[  171.722381] sysrq: Changing Loglevel
[  171.722385] sysrq: Loglevel set to 7
[  225.920112] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  225.920565] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  225.957830] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  225.958456] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  225.958901] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  225.959304] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  225.968457] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  226.056499] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  226.118631] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  226.134514] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
[  226.654747] loop0: detected capacity change from 0 to 104857600
[  226.657919] EXT4-fs (loop0): mounted filesystem 92fdfcd8-945b-46e6-9851-d28bbbb112f7 r/w without journal. Quota mode: none.
Comment 1 Luis Henriques 2024-02-12 14:37:00 UTC
> # worker creates a filesystem:
> mke2fs -t ext4 -O ^has_journal -F /dev/mapper/$DMTARG

Can you provide details on what this $DMTARG is?  Is it a real device, luks, ...?

> [  171.720733] EXT4-fs (dm-5): mounting with "discard" option, but the
> device does not support discard
> [  171.720743] EXT4-fs (dm-5): mounted filesystem
> b286e1d2-cc6a-4990-9f4a-6a66bb7df2ca r/w without journal. Quota mode: none.
> [  171.722381] sysrq: Changing Loglevel
> [  171.722385] sysrq: Loglevel set to 7
> [  225.920112] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  225.920565] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  225.957830] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  225.958456] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  225.958901] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  225.959304] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  225.968457] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  226.056499] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  226.118631] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  226.134514] EXT4-fs error (device dm-5) in ext4_mb_clear_bb:6517: error 95
> [  226.654747] loop0: detected capacity change from 0 to 104857600

If $DMTARG is this^^^ loop device, then this looks odd, because the line above should have been seen *before* anything else, right?

Anyway, I'll see if I can reproduce it locally.

> [  226.657919] EXT4-fs (loop0): mounted filesystem
> 92fdfcd8-945b-46e6-9851-d28bbbb112f7 r/w without journal. Quota mode: none.
Comment 2 Ruediger Oertel 2024-02-12 15:40:21 UTC
no the loop device resides in files on top of the filesystem on dm-5 ($DMTARG)

the layout looks like this:
s390zl31:~ # multipath -ll 3600507630bffd216000000000000201a
3600507630bffd216000000000000201a dm-3 IBM,2107900
size=512G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 0:0:2:2 sdd 8:48  active ready running
  `- 1:0:2:2 sdl 8:176 active ready running
s390zl31:~ # df -hT /var/cache/obs/worker/
Filesystem                                    Type  Size  Used Avail Use% Mounted on
/dev/mapper/3600507630bffd216000000000000201a ext4  504G   42G  438G   9% /var/cache/obs/worker


2-4 pathes to a multipath device map and the filesystem directly lives on that block device. inside the filesystems we create directories and files and some of these then work as loopback files (used as root/swamp for VMs of the build processes).
Comment 3 Jan Kara 2024-02-12 16:12:43 UTC
Well, this one looks simple (and harmless). You have mount options "noatime,nodiratime,discard,nobarrier,async" - in particular the "discard" option is important. This means that without a journal ext4_mb_clear_bb() tries to issue discard requests for each freed extent. And the underlying storage apparently doesn't support discard and so we end up with the error 95 which is EOPNOTSUPP.

Now I agree ext4 should not hog the log with these errors but ultimately the easiest is to fix the mount options to not include 'discard' mount option. Generally I seriously doubt 'discard' option is a good choice for your storage because for most storage types doing these small discards is hurting performance instead of helping it. Calling fstrim once a day or so tends to be much better choice.
Comment 4 Ruediger Oertel 2024-02-12 16:46:32 UTC
well, we added discard all across the board, but since s390 has it's own file creating the fs and writing the mount options I can just drop this.

so the only real change is that the EOPNOTSUPP is logged and before it was just being ignored ... thanks for looking at this!
Comment 5 Jan Kara 2024-02-12 19:15:15 UTC
OK, I'd question why "discard" was added across the board. Do you have some evaluation showing it actually benefits anything? Because I have hard time remembering where "discard" mount option was actually a net win over all those years.

Regarding EOPNOTSUPP not being logged before - I'm not sure what was the "before" state. Without "discard" option, no discard was sent to the underlying device so sure, no error was reported. Similarly if the ext4 filesystem uses journalling, the discard actually happens at a difference place and the EOPNOTSUPP error happens to be silent.

I'll send a fix upstream to make this consistent in ext4 (i.e. silence the EOPNOTSUPP error).
Comment 6 Ruediger Oertel 2024-02-13 10:16:15 UTC
well, our usage pattern is that any use of this filesystem is mostly write-only
we copy all the packages into that fs, have loopback files on there and at the end of the buildjob just the resulting rpms are extracted and all the rest is thrown away.

depending on the hardware the "physical device" underneath is either:
- a multipath scsi device from some storage like on s390
- a single or multipath scsi disk for machines where we have nothing better
- these days usually a nvme for all platforms where we could get these
- tmpfs (basically gone due to slower performance than nvme and RAM prices
  being high, almost all replaced by nvme today)

As far as I remember, Dirk Mueller was the one that proposed using "discard"
for our "build" filesystems. Dirk, do you remember the background ?
Comment 7 Dirk Mueller 2024-02-27 08:42:55 UTC
The issue was that on aarch64 and x86_64 machines, the underlying storage devices were rate-limiting writes to reach the MTBF endurance ratings. without discard, we were down to 3-4MB/s of write performance. after mounting all the layers with discard (which afaik is the default anyhow meanwhile in newer kernels), it went back up to the expected 500MB/s+ write performance. 

now a fstream could in *theory* do something similar, however the usage pattern here is that we create huge filesystems every few minutes to seconds. if they're not trimmed, then the NVME sees a completely full disk all the time. which isn't true. 

I guess we could make a more clever discard once the build job is completed and zap the entire filesystem that was allocated, but I never got the time to implement that.
Comment 8 Jan Kara 2024-03-04 10:49:11 UTC
(In reply to Dirk Mueller from comment #7)
> The issue was that on aarch64 and x86_64 machines, the underlying storage
> devices were rate-limiting writes to reach the MTBF endurance ratings.
> without discard, we were down to 3-4MB/s of write performance. after
> mounting all the layers with discard (which afaik is the default anyhow
> meanwhile in newer kernels), it went back up to the expected 500MB/s+ write
> performance.

Doh, weird. I've never heard about such behavior in the past :) 'discard' mount option definitely is not the default with any recent kernels as there were quite a few reports of it being detrimental to the performance.

> now a fstrim could in *theory* do something similar, however the usage
> pattern here is that we create huge filesystems every few minutes to
> seconds. if they're not trimmed, then the NVME sees a completely full disk
> all the time. which isn't true. 

So mkfs.ext4 can discard the whole device before creating the filesystem but I guess this is not very useful for you because AFAIU you create the big filesystem, fill it with build, then it gets emptied as we copy-out the RPM and the build artifacts are removed - and this is the moment when you'd like to tell  the disk that most of the blocks are actually uninteresting with discard.

> I guess we could make a more clever discard once the build job is completed
> and zap the entire filesystem that was allocated, but I never got the time
> to implement that.

Yeah, e.g. running mkfs on the device once you're done with the filesystem will do the job as mkfs.ext4 by default discards the device.