Bugzilla – Bug 1177595
many device resets and I/O errors during mdadm scrub
Last modified: 2021-01-03 23:09:54 UTC
We have a disk server with a Supermicro S3008 L8e SAS controller and 6 SAS drives of 12 TB each in an mdadm RAID5 software raid configuration. When starting a scrub of the RAID array with echo check > /sys/block/md0/md/sync_action after about 0.5 - 1.5 hours of running the scrub, a lot of error messages start appearing in the syslog. Mostly there are lots of cryptic messages like this: kernel: mpt3sas_cm0: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) these are interspersed with other error messages about device resets and I/O errors: kernel: sd 6:0:2:0: Power-on or device reset occurred kernel: blk_update_request: I/O error, dev sdc, sector 5160938280 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 kernel: sd 6:0:2:0: [sdc] tag#1073 CDB: Read(10) 28 00 26 73 b5 65 00 00 01 00 kernel: sd 6:0:2:0: [sdc] tag#1073 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK These errors happen on all 6 disks in the RAID array (only sdc is shown here, but the problems on the other disks are essentially identical). I have also seen I/O errors in the output of smartctl -a (while the scrub was ongoing), but that may simply be due to the device being reset during the call... Initially we thought these were hardware problems and we had the server thoroughly checked by the manufacturer. They swapped out all the hardware, but the problems would not go away. They concluded that it must be a software (i.e., driver) issue. I cannot be completely certain, but it looks like the problems started after upgrading openSUSE 15.1 -> 15.2. The kernel was fully patched at the time we detected the problems on 29 September. Test showed that the previous installed kernel version also showed the same problem. It is likely that all kernel versions shipped with openSUSE 15.2 show this problem. We currently mount the RAID5 array in read-only mode to prevent the I/O errors from corrupting the file system. This severely limits the functionality of the server.
(In reply to Peter van Hoof from comment #0) > We have a disk server with a Supermicro S3008 L8e SAS controller and 6 SAS > drives of 12 TB each in an mdadm RAID5 software raid configuration. When > starting a scrub of the RAID array with > > echo check > /sys/block/md0/md/sync_action > > after about 0.5 - 1.5 hours of running the scrub, a lot of error messages > start appearing in the syslog. Mostly there are lots of cryptic messages > like this: > > kernel: mpt3sas_cm0: log_info(0x3112011a): originator(PL), code(0x12), > sub_code(0x011a) > > these are interspersed with other error messages about device resets and I/O > errors: > > kernel: sd 6:0:2:0: Power-on or device reset occurred > > kernel: blk_update_request: I/O error, dev sdc, sector 5160938280 op > 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 > kernel: sd 6:0:2:0: [sdc] tag#1073 CDB: Read(10) 28 00 26 73 b5 65 00 00 01 > 00 > kernel: sd 6:0:2:0: [sdc] tag#1073 FAILED Result: hostbyte=DID_SOFT_ERROR > driverbyte=DRIVER_OK > > These errors happen on all 6 disks in the RAID array (only sdc is shown > here, but the problems on the other disks are essentially identical). > > I have also seen I/O errors in the output of smartctl -a (while the scrub > was ongoing), but that may simply be due to the device being reset during > the call... > > Initially we thought these were hardware problems and we had the server > thoroughly checked by the manufacturer. They swapped out all the hardware, > but the problems would not go away. They concluded that it must be a > software (i.e., driver) issue. I cannot be completely certain, but it looks > like the problems started after upgrading openSUSE 15.1 -> 15.2. The kernel > was fully patched at the time we detected the problems on 29 September. Test > showed that the previous installed kernel version also showed the same > problem. It is likely that all kernel versions shipped with openSUSE 15.2 > show this problem. > > We currently mount the RAID5 array in read-only mode to prevent the I/O > errors from corrupting the file system. This severely limits the > functionality of the server. I used to hear of similar issue situation when the hard drive was device-managed SMR. What are the exact models of these hard drives ? Thanks. Coly Li
The drives are HGST HUH721212AL4200 12 TB SAS drives with firmware revision A3D0.
I decided to use the time to test some other kernels. First I downloaded some kernels from kernel.org: 4.19.151, 5.4.71, and 5.9.0 all showed the same problems as described in my report. Kernel 4.14.201 did not boot. I also reinstalled a kernel from openSUSE 15.1: kernel-default-4.12.14-lp151.27.3. This one behaved differently. I started the mdadm scrub at 19:08. There was one burst of messages from mpt3sas_cm0 at 01:27, followed by a single reset of sde and another burst of messages from mpt3sas_cm0 at 06:39, followed by a single reset of sdc. The scrub ended successfully at 13:27. So the differences are that the error messages and device resets started much later and were far, far less frequent than with the later kernels. There were no I/O errors reported. In my initial report I thought that openSUSE 15.1 kernels would be free of these errors. I cannot confirm that, but the problems are clearly far less severe with the 15.1 kernels. Since mdcheck only runs for 6 hours on a single day, the system may never or rarely have reached the point were these messages were triggered...
Can you please post the kernel message log?
Created attachment 842799 [details] the kernel message log Attached is an excerpt of the kernel message log between 2020-09-28T01:00:00 and 2020-09-29T01:00:00. The problems start at 01:33.
Let me first correct a statement I made in the initial report. I stated that it looked like the problems did not occur under openSUSE Leap 15.1. More careful analysis of the log messages showed that the problems did occur under 15.1 as well, though there were far, far less log messages, which is why I overlooked them initially. I started working on a hunch that this may be a firmware issue. So I first upgraded the firmware of the hard drives to version A925. That did not solve the problems. Next I upgraded the firmware of the SAS controller card to version 16.00.10.00 (IT version -- there are two versions of the firmware, one with simple RAID support, and another that simply exports the disks without implementing any RAID solutions -- the IT version is the latter). This seems to have solved the problems. There have been no more error messages since the firmware upgrade.
> There have been no more error messages since the firmware upgrade. That's great new - thanks for letting us know.