|
Bugzilla – Full Text Bug Listing |
| Summary: | aacraid: Dell PE4640 + PERC 3/DI: Fatal SCSI errors | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE LINUX 10.0 | Reporter: | Klaus Wagner <kgw> |
| Component: | Kernel | Assignee: | John Hull <john_hull> |
| Status: | RESOLVED INVALID | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P5 - None | CC: | john_hull, matt_domsch, mistinie |
| Version: | Beta 4 Plus | ||
| Target Milestone: | --- | ||
| Hardware: | x86 | ||
| OS: | SUSE Other | ||
| Whiteboard: | |||
| Found By: | Development | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | tarball of PE4600 startup messages | ||
Report the incident to dell so they can look at aacraid, there's not much we can do about it. BTW, I don't think an aacraid bug is a blocker for SL10.0. I agree this shouldn't block things. Does this hardware pass the test on other kernels? AACRAID is not a blocker. SLES-9 SP1 testing is in progress. Gerald, Marc: please inform DELL. Done. Waiting for Dells feedback. SLES-9 SP1 stress test is still running smoothly. To be on safe ground, QA nevertheless advises to let the test continue for an extended period. Status update on Mon Sep 12 planned. 2.6.13.1 contains the following, but I fear this is unrelated - isn't it?
[PATCH] aacraid: 2.6.13 aacraid bad BUG_ON fix
This was noticed by Doug Bazamic and the fix found by Mark Salyzyn at
Adaptec.
There was an error in the BUG_ON() statement that validated the
calculated fib size which can cause the driver to panic.
No, it's not that bug. Update to Comment #10: Good: Stress test ran over the entire weekend on the SLES-9 SP1 installation (kernel 2.6.5-7.139-smp #1 SMP Fri Jan 14 15:41:33 UTC 2005 i686, SP1 UPDATE kernel-smp-2.6.5-7.139-smp ). It is still in progress. Bad: Under SLES-9 SP1 rescue system (kernel 2.6.5-7.139-default), copying big directory trees from one disk partition to another quickly produces the same type of SCSI hang as described in the original bug description. (It uses to take only a few minutes to reproduce). The restore of SLES-9 SP1 mentioned in comment #5 was therefore carried through with a SLES-8 SP4 rescue system. No problems then. Hope that helps in narrowing things down. Seems to be, but I'm not sure yet, that an IBM customer hits the same problem. The customer is using SLES 9 SP1 on xSeries. I will try to get more detailed information. What firmware version is on the controller? This is what the "afacli" utility tells (SUSE Linux 10.0 beta4plus running):
AFA0>
COMMAND: controller details
Executing: controller details
Controller Information
----------------------
Remote Computer: .
Device Name: AFA0
Controller Type: PERC 3/Di
Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = B410D3
Number of Buses: 1
Devices per Bus: 15
Controller CPU: i960 R series
Controller CPU Speed: 100 Mhz
Controller Memory: 126 Mbytes
Battery State: Ok
Component Revisions
-------------------
CLI: 2.7-1 (Build #4944)
API: 2.7-1 (Build #4944)
Miniport Driver: 1.1-4 (Build #9999)
Controller Software: 2.7-0 (Build #3153)
Controller BIOS: 2.7-0 (Build #3153)
Controller Firmware: (Build #3153)
This firmware is ancient, and has known bugs which will cause the controller to appear to be offline when it fact it's just stuck in an infinite cache flush loop. Firmware 2.8 Build 6091 or higher, available on support.dell.com, is necessary. If failures still occur after this, please report. Thanks very much, this sounds promising. Status update:
- Manually terminated SLES-9 SP1 stress test after it had run smoothly
for six days.
- Updated: System BIOS: A06 --> A13
PERC 3/Di BIOS: v2.7-0 Build 3153 --> v2.8-0 Build 6095
- Restored the SUSE 10.0 beta4plus installation (don't want to change too much
in one step, therefore no new RC? installation yet).
- Re-started stress test on SUSE 10.0 beta4plus.
Desired good result: test still be running to-morrow morning without
fatal errors.
We might then even have time to test again with the most recent RC.
QA: do you agree with this timing?
System CRASHED again. Test started at 09/14 16:13. A few hours later the familiar messages popped up: Sep 14 21:14:00 zert184 Zert184 kernel: aacraid: SCSI bus appears hung Sep 14 21:14:00 zert184 Zert184 kernel: end_request: I/O error, dev sda, sector 62175357 Sep 14 21:14:00 zert184 Zert184 kernel: Buffer I/O error on device sda5, logical block 5671413 Sep 14 21:14:00 zert184 Zert184 kernel: lost page write due to I/O error on sda5 Status update:
- More firmware updates (could not do this yesterday because the running
test would have gone lost):
Embedded Server Management (ESM) Firmware: --> Rev A31. In particular:
BMC: 1.35 --> 1.77, SB: 0.25 --> 1.01, SDR: 0.21 --> 0.33.
(Could it be that e.g. the SB (System Backplane) firmware would affect
this bug?)
- Installed SUSE Linux 10.0 (RC4). Re-started stress test at 12:16.
Created attachment 50009 [details]
tarball of PE4600 startup messages
These startup messages document the active BIOS/Fw
revisions before Sep 14, at Sep 14 and at Sep 15
(to-day).
System CRASHED again (just managed to run a bit longer, 10 hours): Sep 15 22:01:24 zert184 Zert184 kernel: aacraid: Host adapter reset request. SCSI hang ? ... [familiar followups snipped] ... So even the most recent firmware and SUSE Linux 10.0 kernel don't help. As for Dell's wish about how to reproduce: I'm going to try a simple big disk copy with the SUSE Linux 10.0 rescue system running (SLES-9 SP1 rescue would reproduce, see above comment #13. Maybe 10.0 GA (=RC4) will, too.). No success to reproduce via big disk copies using the rescue systems of - either SUSE Linux 10.0 (GA = RC4) - or SLES-9 SP1 In both cases, the systems was busy for many hours copying/erasing files without crashing. So the firmware updates did change something after all. Since I can't back out of the FW updates (the flash util didn't offer to back up all needed old versions) it seems that the only known method remaining to reproduce is our big "package building" stress test. I'll run it another time to be sure. John, what are your thoughts on this issue? Do you still want to try and reproduce this with earlier releases, or should we close it for now? I'm closing this for now as invalid, as reproduction seems not possible in a reliable way. Feel free to reopen if issues appear again with this controller/driver |
Installation: SUSE 10.0 beta4plus Zert184.suse.de (Dell PE4640) does not survive the HWCert 24 hour stress test. After a few hours runtime the kernel emits the following msgs: kernel: aacraid: Host adapter reset request. SCSI hang ? kernel: aacraid: SCSI bus appears hung kernel: end_request: I/O error, dev sda, sector 77906021 kernel: Buffer I/O error on device sda5, logical block 7637746 kernel: lost page write due to I/O error on sda5 kernel: scsi2 (0:0): rejecting I/O to offline device kernel: scsi2 (0:0): rejecting I/O to offline device ... (more of the same) ... kernel: ReiserFS: sda5: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [5467659 5712953 0x0 SD] ... (more of the same) ... To recover, nothing short of a power-cycle seems available. I should add that this system has acquired some reputation of being reliable and fast. It has been certified for several SLES-[78] type products.