Bug 1177595 - many device resets and I/O errors during mdadm scrub
many device resets and I/O errors during mdadm scrub
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.2
x86-64 openSUSE Leap 15.2
: P5 - None : Critical (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-10-12 14:59 UTC by Peter van Hoof
Modified: 2021-01-03 23:09 UTC (History)
5 users (show)

See Also:
Found By: Community User
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
the kernel message log (116.17 KB, application/x-xz)
2020-10-19 16:37 UTC, Peter van Hoof
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter van Hoof 2020-10-12 14:59:30 UTC
We have a disk server with a Supermicro S3008 L8e SAS controller and 6 SAS drives of 12 TB each in an mdadm RAID5 software raid configuration. When starting a scrub of the RAID array with

echo check > /sys/block/md0/md/sync_action

after about 0.5 - 1.5 hours of running the scrub, a lot of error messages start appearing in the syslog. Mostly there are lots of cryptic messages like this:

kernel: mpt3sas_cm0: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a)

these are interspersed with other error messages about device resets and I/O errors:

kernel: sd 6:0:2:0: Power-on or device reset occurred

kernel: blk_update_request: I/O error, dev sdc, sector 5160938280 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 
kernel: sd 6:0:2:0: [sdc] tag#1073 CDB: Read(10) 28 00 26 73 b5 65 00 00 01 00 
kernel: sd 6:0:2:0: [sdc] tag#1073 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK 

These errors happen on all 6 disks in the RAID array (only sdc is shown here, but the problems on the other disks are essentially identical).

I have also seen I/O errors in the output of smartctl -a (while the scrub was ongoing), but that may simply be due to the device being reset during the call...

Initially we thought these were hardware problems and we had the server thoroughly checked by the manufacturer. They swapped out all the hardware, but the problems would not go away. They concluded that it must be a software (i.e., driver) issue. I cannot be completely certain, but it looks like the problems started after upgrading openSUSE 15.1 -> 15.2. The kernel was fully patched at the time we detected the problems on 29 September. Test showed that the previous installed kernel version also showed the same problem. It is likely that all kernel versions shipped with openSUSE 15.2 show this problem.

We currently mount the RAID5 array in read-only mode to prevent the I/O errors from corrupting the file system. This severely limits the functionality of the server.
Comment 1 Coly Li 2020-10-15 03:48:14 UTC
(In reply to Peter van Hoof from comment #0)
> We have a disk server with a Supermicro S3008 L8e SAS controller and 6 SAS
> drives of 12 TB each in an mdadm RAID5 software raid configuration. When
> starting a scrub of the RAID array with
> 
> echo check > /sys/block/md0/md/sync_action
> 
> after about 0.5 - 1.5 hours of running the scrub, a lot of error messages
> start appearing in the syslog. Mostly there are lots of cryptic messages
> like this:
> 
> kernel: mpt3sas_cm0: log_info(0x3112011a): originator(PL), code(0x12),
> sub_code(0x011a)
> 
> these are interspersed with other error messages about device resets and I/O
> errors:
> 
> kernel: sd 6:0:2:0: Power-on or device reset occurred
> 
> kernel: blk_update_request: I/O error, dev sdc, sector 5160938280 op
> 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 
> kernel: sd 6:0:2:0: [sdc] tag#1073 CDB: Read(10) 28 00 26 73 b5 65 00 00 01
> 00 
> kernel: sd 6:0:2:0: [sdc] tag#1073 FAILED Result: hostbyte=DID_SOFT_ERROR
> driverbyte=DRIVER_OK 
> 
> These errors happen on all 6 disks in the RAID array (only sdc is shown
> here, but the problems on the other disks are essentially identical).
> 
> I have also seen I/O errors in the output of smartctl -a (while the scrub
> was ongoing), but that may simply be due to the device being reset during
> the call...
> 
> Initially we thought these were hardware problems and we had the server
> thoroughly checked by the manufacturer. They swapped out all the hardware,
> but the problems would not go away. They concluded that it must be a
> software (i.e., driver) issue. I cannot be completely certain, but it looks
> like the problems started after upgrading openSUSE 15.1 -> 15.2. The kernel
> was fully patched at the time we detected the problems on 29 September. Test
> showed that the previous installed kernel version also showed the same
> problem. It is likely that all kernel versions shipped with openSUSE 15.2
> show this problem.
> 
> We currently mount the RAID5 array in read-only mode to prevent the I/O
> errors from corrupting the file system. This severely limits the
> functionality of the server.

I used to hear of similar issue situation when the hard drive was device-managed SMR.

What are the exact models of these hard drives ?

Thanks.

Coly Li
Comment 2 Peter van Hoof 2020-10-15 10:21:40 UTC
The drives are HGST HUH721212AL4200 12 TB SAS drives with firmware revision A3D0.
Comment 3 Peter van Hoof 2020-10-16 14:47:20 UTC
I decided to use the time to test some other kernels. First I downloaded some kernels from kernel.org: 4.19.151, 5.4.71, and 5.9.0 all showed the same problems as described in my report. Kernel 4.14.201 did not boot.

I also reinstalled a kernel from openSUSE 15.1: kernel-default-4.12.14-lp151.27.3. This one behaved differently. I started the mdadm scrub at 19:08. There was one burst of messages from mpt3sas_cm0 at 01:27, followed by a single reset of sde and another burst of messages from mpt3sas_cm0 at 06:39, followed by a single reset of sdc. The scrub ended successfully at 13:27. So the differences are that the error messages and device resets started much later and were far, far less frequent than with the later kernels. There were no I/O errors reported. In my initial report I thought that openSUSE 15.1 kernels would be free of these errors. I cannot confirm that, but the problems are clearly far less severe with the 15.1 kernels. Since mdcheck only runs for 6 hours on a single day, the system may never or rarely have reached the point were these messages were triggered...
Comment 4 Hannes Reinecke 2020-10-19 14:52:41 UTC
Can you please post the kernel message log?
Comment 5 Peter van Hoof 2020-10-19 16:37:47 UTC
Created attachment 842799 [details]
the kernel message log

Attached is an excerpt of the kernel message log between 2020-09-28T01:00:00 and 2020-09-29T01:00:00. The problems start at 01:33.
Comment 6 Peter van Hoof 2020-12-29 17:53:53 UTC
Let me first correct a statement I made in the initial report. I stated that it looked like the problems did not occur under openSUSE Leap 15.1. More careful analysis of the log messages showed that the problems did occur under 15.1 as well, though there were far, far less log messages, which is why I overlooked them initially.

I started working on a hunch that this may be a firmware issue. So I first upgraded the firmware of the hard drives to version A925. That did not solve the problems. Next I upgraded the firmware of the SAS controller card to version 16.00.10.00 (IT version -- there are two versions of the firmware, one with simple RAID support, and another that simply exports the disks without implementing any RAID solutions -- the IT version is the latter). This seems to have solved the problems. There have been no more error messages since the firmware upgrade.
Comment 7 Neil Brown 2021-01-03 23:09:54 UTC
> There have been no more error messages since the firmware upgrade.

That's great new - thanks for letting us know.