Bug 135637

Summary: dac960 - after rebuild status rimains in "ALERT"
Product: [openSUSE] SUSE LINUX 10.0 Reporter: carlo menozzi <scs>
Component: KernelAssignee: Chris L Mason <mason>
Status: VERIFIED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: hare
Version: unspecified   
Target Milestone: ---   
Hardware: i386   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Customer Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description carlo menozzi 2005-11-28 07:47:24 UTC
THE PROBLEM IS with the driver DAC960 linux kernel 2.6.x
These are the controllers we used for the tests
-AcceleRAID 170
-AcceleRAID 160 (AcceleRAID 170LP)
-AcceleRAID 352
This is the configuration:
2 hard disks IBM 18gb - RAID1
no standby drives
THIS IS THE PROBLEM:
 while I/O is active Physical Drive 0:1 is disconnected, simulating a drive failure. 
-The status message in /proc/rd/status has changed from "OK" to "ALERT":
-We launch the commands:
echo "rebuild 0:1" > /proc/rd/c0/user_command
cat /proc/rd/c0/user_command
Rebuild of Physical Drive 0:1 Initiated
At this point everything works ok and the rebuilding starts
cat /proc/rd/c0/current_status
DOES NOT PRODUCE in output 
.....................................................
DAC960#0: Rebuild in Progress: Logical Drive 0 (/dev/rd/c0d0) x% completed
BUT PRODUCES:
.....................................................
Disk status Rebuild
.....................................................
Logical Drive 0 (/dev/rd/c0d0) Manual Rebuild Started.

SO, NO REBUILD PROGRESS IS VISUALISED.
THEN, WHEN THE REBUILD PROCESS IS FINISHED, 
cat /proc/rd/c0/user_command produces
Rebuild Completed 
BUT
/proc/rd/status rimains in "ALERT":
AND ONLY AFTER A reboot GOES BACK TO "OK"
ALL THIS HAPPENS WITH:
-kernel 2.6

WHILE WITH 
-kernel 2.4.x
EVERYTHING WORKS OK.
Comment 1 Olaf Kirch 2005-11-28 08:30:28 UTC
thank you for your bug report.

i agree that it would be a bug if the controller still
displays a status of "alert" even after a successful
rebuild.

but why do you think it is a bug that the rebuild process
isn't "visualized"? in which way did 2.4 visualize the
rebuild process?
Comment 5 carlo menozzi 2005-11-29 08:35:48 UTC
The thing is that if the rebuild process is not visualized what happens is that
if there is a hardware problem and the process is blocked I cannot see that it is blocked and I'm still waiting for the process to end. As the process in some cases takes hours, I could be waiting for hours without knowing that the process is blocked due to hardware problems.
In fact this has actually happend to me. I substituted what I thought was a new disk but in fact did not work. I was waiting for hours for the process to end when instead it was blocked due to the disk that was faulty.
So what I did that time is that instead of using the Linux software, I used the software which is internal to the controller and this permitted me to see that the rebuild was blocked.
The version 2.4 showed the percentage of where the rebuild process was at so that I could follow the process exactly.
Thanks
Comment 6 Lars Marowsky-Bree 2005-11-29 11:54:47 UTC
I looked through the code, and this makes sense, at least if your description is complete.

You disconnect a drive, and then initiate a rebuild, which proceeds correctly. However, you never re-connect the drive! So the status remains as ALERT; this will only revert if no drives are critical, failed or offline. Until you re-add the drive, you'll still be missing one though.

Do you in fact re-add the drive? If so, at what step and how?

The progress bar is not displayed for more recent versions of the DAC960 firmware, which no longer exports this bit of information to the kernel, so this can no longer work. We won't/can't fix this.

Comment 7 carlo menozzi 2005-12-01 08:10:00 UTC
Perhaps I did not explain things well.
When the drive is failed, we switch off the PC, we substitute the failed drive with a new drive, we switch the machine on again and initiate a rebuild again. Therefore we DO re-connect the drive.

The problem is also that at the end of the rebuild the status remains in ALERT and if we want it to return to OK, we need to switch off the machine and then switch it on again.

Another thing - its very useful to be able to visualize the progress bar when the progress itself lasts more than 30 minutes.


 Regarding what you say about the progress bar not being diplayed for more recent versions of the DAC960 firmware, its not our case because  we use AC170,AC160,AC352 controllers and with kernel 2.4 the progress bar is displayed perfectly.

Comment 11 Lars Marowsky-Bree 2005-12-13 14:42:10 UTC
Ihno, you're taking care of storage features. Could you please decide how important the dac960 is for SLES10/SLES9? Is this a RESOLVED WONTFIX or do we need to reproduce locally & fix?
Comment 12 carlo menozzi 2005-12-15 09:00:50 UTC
In my opinion it is VERY IMPORTANT to fix this bug because normally Mylex 
controllers are installed in important servers.
And these types of servers are usually switched on 24 hours a day.
Comment 15 Chris L Mason 2005-12-20 15:26:58 UTC
Unfortunately, the DAC960 is not one of the cards we explicitly support.  We don't have sufficient customer demand to have the cards on hand in house, or support contracts with partners in place that require the card.

This bug is also against SL10.0, where we don't provide the level of support needed to dive into the driver and make the required modifications.
Comment 16 Ihno Krumreich 2007-06-04 16:46:11 UTC
Closed.