Bug 115890

Summary: aacraid: Dell PE4640 + PERC 3/DI: Fatal SCSI errors
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Klaus Wagner <kgw>
Component: KernelAssignee: John Hull <john_hull>
Status: RESOLVED INVALID QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: john_hull, matt_domsch, mistinie
Version: Beta 4 Plus   
Target Milestone: ---   
Hardware: x86   
OS: SUSE Other   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: tarball of PE4600 startup messages

Description Klaus Wagner 2005-09-08 15:14:07 UTC
Installation: SUSE 10.0 beta4plus

Zert184.suse.de (Dell PE4640) does not survive the HWCert 24 hour stress test.
After a few hours runtime the kernel emits the following msgs:

kernel: aacraid: Host adapter reset request. SCSI hang ?
kernel: aacraid: SCSI bus appears hung
kernel: end_request: I/O error, dev sda, sector 77906021
kernel: Buffer I/O error on device sda5, logical block 7637746
kernel: lost page write due to I/O error on sda5
kernel: scsi2 (0:0): rejecting I/O to offline device
kernel: scsi2 (0:0): rejecting I/O to offline device
  ... (more of the same) ...
kernel: ReiserFS: sda5: warning: vs-13070: reiserfs_read_locked_inode: i/o   
        failure occurred trying to find stat data of  [5467659 5712953 0x0 SD]
  ... (more of the same) ...

To recover, nothing short of a power-cycle seems available.

I should add that this system has acquired some reputation of being
reliable and fast. It has been certified for several SLES-[78] type products.
Comment 2 Jens Axboe 2005-09-08 18:23:00 UTC
Report the incident to dell so they can look at aacraid, there's not much we can
do about it. BTW, I don't think an aacraid bug is a blocker for SL10.0.
Comment 3 Chris L Mason 2005-09-09 02:45:27 UTC
I agree this shouldn't block things.  Does this hardware pass the test on other kernels? 
Comment 4 Andreas Jaeger 2005-09-09 04:56:56 UTC
AACRAID is not a blocker.
Comment 7 Klaus Wagner 2005-09-09 10:31:06 UTC
SLES-9 SP1 testing is in progress.

Gerald, Marc: please inform DELL.
Comment 9 Marc Ruehrschneck 2005-09-09 14:21:25 UTC
Done. Waiting for Dells feedback.
Comment 10 Klaus Wagner 2005-09-09 14:39:53 UTC
SLES-9 SP1 stress test is still running smoothly. To be on safe ground, 
QA nevertheless advises to let the test continue for an extended period.
Status update on Mon Sep 12 planned.
Comment 11 Andreas Jaeger 2005-09-10 05:06:18 UTC
2.6.13.1 contains the following, but I fear this is unrelated - isn't it?

[PATCH] aacraid: 2.6.13 aacraid bad BUG_ON fix
    
    This was noticed by Doug Bazamic and the fix found by Mark Salyzyn at
    Adaptec.
    
    There was an error in the BUG_ON() statement that validated the
    calculated fib size which can cause the driver to panic.
Comment 12 Jens Axboe 2005-09-10 12:15:29 UTC
No, it's not that bug.
Comment 13 Klaus Wagner 2005-09-12 16:13:17 UTC
Update to Comment #10:

Good: Stress test ran over the entire weekend on the SLES-9 SP1 installation
      (kernel 2.6.5-7.139-smp #1 SMP Fri Jan 14 15:41:33 UTC 2005 i686,
       SP1 UPDATE kernel-smp-2.6.5-7.139-smp ).
      It is still in progress.

Bad:  Under SLES-9 SP1 rescue system (kernel 2.6.5-7.139-default), copying big
      directory trees from one disk partition to another quickly produces the
      same type of SCSI hang as described in the original bug description.
      (It uses to take only a few minutes to reproduce).

The restore of SLES-9 SP1 mentioned in comment #5 was therefore carried
through with a SLES-8 SP4 rescue system. No problems then.

Hope that helps in narrowing things down.
Comment 14 Frank Balzer 2005-09-12 20:34:06 UTC
Seems to be, but I'm not sure yet, that an IBM customer hits the same problem.
The customer is using SLES 9 SP1 on xSeries. I will try to get more detailed
information.
Comment 15 John Hull 2005-09-13 13:55:07 UTC
What firmware version is on the controller?
Comment 16 Klaus Wagner 2005-09-13 15:35:45 UTC
This is what the "afacli" utility tells (SUSE Linux 10.0 beta4plus running):

AFA0> 
COMMAND: controller details 
Executing: controller details 
Controller Information
----------------------
         Remote Computer: .
             Device Name: AFA0
         Controller Type: PERC 3/Di
             Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = B410D3
         Number of Buses: 1
         Devices per Bus: 15
          Controller CPU: i960 R series
    Controller CPU Speed: 100 Mhz
       Controller Memory: 126 Mbytes
           Battery State: Ok

Component Revisions
-------------------
                CLI: 2.7-1 (Build #4944)
                API: 2.7-1 (Build #4944)
    Miniport Driver: 1.1-4 (Build #9999)
Controller Software: 2.7-0 (Build #3153)
    Controller BIOS: 2.7-0 (Build #3153)
Controller Firmware: (Build #3153)

Comment 17 Matt Domsch 2005-09-13 21:34:54 UTC
This firmware is ancient, and has known bugs which will cause the controller to
appear to be offline when it fact it's just stuck in an infinite cache flush loop.

Firmware 2.8 Build 6091 or higher, available on support.dell.com, is necessary.
 If failures still occur after this, please report.
Comment 18 Klaus Wagner 2005-09-14 14:24:26 UTC
Thanks very much, this sounds promising. Status update:

-  Manually terminated SLES-9 SP1 stress test after it had run smoothly
   for six days.
-  Updated: System BIOS:    A06                -->  A13
            PERC 3/Di BIOS: v2.7-0 Build 3153  -->  v2.8-0 Build 6095
-  Restored the SUSE 10.0 beta4plus installation (don't want to change too much
   in one step, therefore no new RC? installation yet).
-  Re-started stress test on SUSE 10.0 beta4plus.

Desired good result: test still be running to-morrow morning without
fatal errors.
We might then even have time to test again with the most recent RC.

QA: do you agree with this timing?


Comment 19 Klaus Wagner 2005-09-15 07:38:36 UTC
System CRASHED again.

Test started at 09/14 16:13. A few hours later the familiar messages popped up:

Sep 14 21:14:00 zert184 Zert184 kernel: aacraid: SCSI bus appears hung
Sep 14 21:14:00 zert184 Zert184 kernel: end_request: I/O error, dev sda, sector
62175357
Sep 14 21:14:00 zert184 Zert184 kernel: Buffer I/O error on device sda5, logical
block 5671413
Sep 14 21:14:00 zert184 Zert184 kernel: lost page write due to I/O error on sda5
Comment 21 Klaus Wagner 2005-09-15 10:30:55 UTC
Status update:

    -  More firmware updates (could not do this yesterday because the running
       test would have gone lost):

       Embedded Server Management (ESM) Firmware: --> Rev A31. In particular:
       BMC: 1.35 --> 1.77,  SB: 0.25 --> 1.01,  SDR:  0.21 --> 0.33.
       (Could it be that e.g. the SB (System Backplane) firmware would affect
       this bug?)

    -  Installed SUSE Linux 10.0 (RC4). Re-started stress test at 12:16.
Comment 23 Klaus Wagner 2005-09-15 10:38:31 UTC
Created attachment 50009 [details]
tarball of PE4600 startup messages

These startup messages document the active BIOS/Fw
revisions before Sep 14, at Sep 14 and at Sep 15
(to-day).
Comment 24 Klaus Wagner 2005-09-16 08:09:40 UTC
System CRASHED again (just managed to run a bit longer, 10 hours):

Sep 15 22:01:24 zert184 Zert184 kernel: aacraid: Host adapter reset request.
SCSI hang ?
... [familiar followups snipped] ...

So even the most recent firmware and SUSE Linux 10.0 kernel don't help.

As for Dell's wish about how to reproduce: I'm going to try a simple big
disk copy with the SUSE Linux 10.0 rescue system running (SLES-9 SP1 rescue
would reproduce, see above comment #13. Maybe 10.0 GA (=RC4) will, too.).

Comment 25 Klaus Wagner 2005-09-20 07:47:13 UTC
No success to reproduce via big disk copies using the rescue systems of
   -  either SUSE Linux 10.0 (GA = RC4)
   -  or SLES-9 SP1

In both cases, the systems was busy for many hours copying/erasing files
without crashing. So the firmware updates did change something after all.

Since I can't back out of the FW updates (the flash util didn't offer to back
up all needed old versions) it seems that the only known method remaining 
to reproduce is our big "package building" stress test. I'll run it another time
to be sure.


Comment 27 Marc Ruehrschneck 2005-11-07 14:07:01 UTC
John, what are your thoughts on this issue?
Do you still want to try and reproduce this with earlier releases, or should we close it for now?
Comment 28 Marc Ruehrschneck 2006-01-16 15:55:53 UTC
I'm closing this for now as invalid, as reproduction seems not possible in a reliable way. Feel free to reopen if issues appear again with this controller/driver