Bug 1214004 - Rebuilding MD-RAID brings system almost to a halt
Summary: Rebuilding MD-RAID brings system almost to a halt
Status: NEW
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Leap 15.5
Hardware: x86-64 Other
: P5 - None : Major (vote)
Target Milestone: ---
Assignee: Coly Li
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-04 21:22 UTC by Ulrich Windl
Modified: 2023-09-07 10:34 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Windl 2023-08-04 21:22:27 UTC
(This problem is not new, but as it's still there, I thought it's time to report it)
My system has one Intel on-board SATA RAID controller:
09: PCI 1f.2: 0104 RAID bus controller                          
  [Created at pci.386]
  Unique ID: w7Y8.IzwgmJk+U9F
  SysFS ID: /devices/pci0000:00/0000:00:1f.2
  SysFS BusID: 0000:00:1f.2
  Hardware Class: storage
  Model: "Intel SATA Controller [RAID mode]"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x2822 "SATA Controller [RAID mode]"
  SubVendor: pci 0x1043 "ASUSTeK Computer Inc."
  SubDevice: pci 0x8534 
  Revision: 0x05
  Driver: "ahci"
  Driver Modules: "ahci"
  I/O Ports: 0xf070-0xf077 (rw)
  I/O Ports: 0xf060-0xf063 (rw)
  I/O Ports: 0xf050-0xf057 (rw)
  I/O Ports: 0xf040-0xf043 (rw)
  I/O Ports: 0xf020-0xf03f (rw)
  Memory Range: 0xf7f16000-0xf7f167ff (rw,non-prefetchable)
  IRQ: 29 (744001 events)
  Module Alias: "pci:v00008086d00002822sv00001043sd00008534bc01sc04i00"
  Driver Info #0:
    Driver Status: ahci is active
    Driver Activation Cmd: "modprobe ahci"
  Config Status: cfg=no, avail=yes, need=no, active=unknown

The system has four disks (and a BluRay burner), two pairs of RAID1:
Model: "WDC WD20EZRZ-00Z"
Model: "WDC WD20EZRZ-00Z"
Model: "HGST HTS541010A9"
Model: "HGST HTS541010A9"

Linux is on one RAID, other RAID has some Windows data (unused in Linux).

When the system hung on reboot after upgrading from Leap 15.4 to 15.5, I pressed the reset button after having waited a long time.
Unfortunately during boot that caused the OS RAID to be "unclean", forcing a rebuild:
Aug 05 00:23:22 pc kernel: md/raid1:md126: not clean -- starting background reconstruction
Aug 05 00:23:22 pc kernel: md/raid1:md126: active with 2 out of 2 mirrors

md126 : active raid1 sda[1] sdb[0]
      1953497088 blocks super external:/md127/0 [2/2] [UU]
      [==>..................]  resync = 12.4% (243003648/1953497088) finish=236.9min speed=120296K/sec
md127 : inactive sdb[1](S) sda[0](S)
      10402 blocks super external:imsm

When the MD-RAID is rebuilding some commands take almost forever, also giving nonsense results.
For example:
> rpm -ql cpupower-5.14
warning: waiting for shared lock on /usr/lib/sysimage/rpm/Packages
error: cannot get shared lock on /usr/lib/sysimage/rpm/Packages
error: cannot open Packages index using db4 - Die Operation ist nicht erlaubt (1)
error: cannot open Packages database in /usr/lib/sysimage/rpm
warning: waiting for shared lock on /usr/lib/sysimage/rpm/Packages
error: cannot get shared lock on /usr/lib/sysimage/rpm/Packages
error: cannot open Packages index using db4 - Die Operation ist nicht erlaubt (1)
error: cannot open Packages database in /usr/lib/sysimage/rpm
package cpupower-5.14 is not installed
> rpm -ql cpupower-5.14
warning: waiting for shared lock on /usr/lib/sysimage/rpm/Packages
error: cannot get shared lock on /usr/lib/sysimage/rpm/Packages
error: cannot open Packages index using db4 - Die Operation ist nicht erlaubt (1)
error: cannot open Packages database in /usr/lib/sysimage/rpm
warning: waiting for shared lock on /usr/lib/sysimage/rpm/Packages
/usr/bin/cpupower
/usr/bin/intel-speed-select
/usr/bin/turbostat
...

(First it said the package is not installed, then it obviously was)
Of course that is a bug in RPM, most likely, too.

Anyway I think the default rebuild speed should reserve a bigger amount of bandwidth for application use (i.e.: no for rebuilding)

I improved the situation limiting the rebuild speed to 50MB/s or so (see also https://superuser.com/a/625724/964771):
sudo sh -c "echo 50000 > /proc/sys/dev/raid/speed_limit_max"
Comment 3 Hannes Reinecke 2023-09-07 06:26:02 UTC
Well, I am really not sure if we can improve matter much here.
First there are the physics. You are using two 2TB drives via SATA, so for a complete rebuild you will have to transfer _all_ data from the drives.
Assuming a typical I/O speed of 120MB/s and ideal rebuilding such that we're only having to transfer data once to each disk and operating both drives in parallel we are still looking at 4.8 hours before rebuild is complete.
That's not something we can change.
And when rebuilding is ongoing the raid code will have to lock the sectors currently being rebuild, so operations touching these sectors might be rejected/retried while rebuilding is in process.
Again, not really something we can change.

What is it you want us to do?

Adjusting the rebuild rate is also quite tricky; user-facing installations like laptops etc might want to devote more bandwidth to user processes, server-like installation might want to devote more bandwidth to rebuild.
We might be able to come up with a clever algorithm, but really it should be done with a systemd service or udev rule to tweak it per installation.
Comment 4 Ulrich Windl 2023-09-07 10:34:17 UTC
(In reply to Hannes Reinecke from comment #3)

According to the BIOS message on booting the RAID was is "verify mode", so nothing actually needed writing or fixing unless there was a mismatch detected.
However I don't know how MD-RAID actually handles that.

What could be done?

If it's possible to detect that kind of problem (read stalls obviously), the default speed_limit_max could be reduced automatically until the situation improves. Obviously such mechanism should be "located near MD-RAID activation".

Being practically unable to work with the system for hours cannot be the solution (I'm aware that a "change bitmap" on the array would be the best solution, but the hardware doesn't support that AFAIK)