Bug 1215138

Summary: mdadm 4.2 can't start raid array with external journal
Product: [openSUSE] openSUSE Distribution Reporter: Lars Altenhain <lars>
Component: BasesystemAssignee: Coly Li <colyli>
Status: NEW --- QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: nfbrown
Version: Leap 15.5   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE Leap 15.5   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Fix for 0043-super1-report-truncated-device.patch

Description Lars Altenhain 2023-09-07 23:20:55 UTC
User-Agent:       Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0
Build Identifier: 

After updating from Leap 15.4 to 15.5 mdadm couldn't start my Raid5 array with an external journal any longer and gave an error, that the disk for the journal could not be found. Using the mdadm-4.1 package from Leap 15.4 the raid array starts without issues.

I could narrow it down to the patch "0043-super1-report-truncated-device.patch" by rebuilding the mdadm package with different patches removed and this was the one patch that cause the issue.

The check "(dsize < (__le64_to_cpu(super->data_offset) + __le64_to_cpu(super->size))" is triggered on the journal disk because this device is much smaller than the data disks but super->size has the value from the actual data disk even on the journal disk.



Reproducible: Always
Comment 1 Coly Li 2024-01-21 11:21:13 UTC
(In reply to Lars Altenhain from comment #0)
> 
> The check "(dsize < (__le64_to_cpu(super->data_offset) +
> __le64_to_cpu(super->size))" is triggered on the journal disk because this
> device is much smaller than the data disks but super->size has the value
> from the actual data disk even on the journal disk.
> 

Thank for the information. I assume this issue still reproducible, right? Since I don't see specific fix from mdadm upstream.

Could you please offer me the detailed steps to build/make a similar raid configuration with extra journal device as your environment did? Then I can take a look and try to find out a fix.

Thanks in advance.

Coly Li
Comment 2 Lars Altenhain 2024-01-21 14:06:29 UTC
Created attachment 872046 [details]
Fix for 0043-super1-report-truncated-device.patch
Comment 3 Lars Altenhain 2024-01-21 14:08:06 UTC
I can still reproduce the issue with the latest mdadm version available in the update repos for Leap15.5 (mdadm-4.2-150500.6.3.1). I also build the latest mdadm package from Factory for Leap15.5 and get the same result.

For testing I added a disk image to a virtual machine, created some partitions on there (sdb[123] with 32GB each as data disks and a sdb[4] with 1GB for the journal) and than created a raid5 with journal. 

mdadm --create /dev/md0 --level=5 --raid-disks=3  --write-journal=/dev/sdb4 /dev/sdb1 /dev/sdb2 /dev/sdb3

Booting the system with mdadm from the update repos results in a read only array because it doesn't find the journal disk:
[So Jan 21 14:38:31 2024] md/raid:md0: device sdb3 operational as raid disk 2
[So Jan 21 14:38:31 2024] md/raid:md0: device sdb2 operational as raid disk 1
[So Jan 21 14:38:31 2024] md/raid:md0: device sdb1 operational as raid disk 0
[So Jan 21 14:38:31 2024] md/raid:md0: journal disk is missing, force array readonly
[So Jan 21 14:38:31 2024] md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
[So Jan 21 14:38:31 2024] md0: detected capacity change from 0 to 134209536


This is are the resultiong log entries when I boot the system with my patched mdadm package installed:
Jan 21 15:02:34 cargohold1 kernel: md/raid:md0: device sdb3 operational as raid disk 2
Jan 21 15:02:34 cargohold1 kernel: md/raid:md0: device sdb2 operational as raid disk 1
Jan 21 15:02:34 cargohold1 kernel: md/raid:md0: device sdb1 operational as raid disk 0
Jan 21 15:02:34 cargohold1 kernel: md/raid:md0: raid level 5 active with 3 out of 3 devices, algorithm 2
Jan 21 15:02:34 cargohold1 kernel: md/raid:md0: starting from clean shutdown
Jan 21 15:02:34 cargohold1 kernel: md0: detected capacity change from 0 to 134209536

I have attached the small patch I had added to my self compiled mdadm package. 

Lars