Bug 115115

Summary: irqbalance causes system lockup
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Paul Beltrani <echo>
Component: OtherAssignee: Stefan Fent <stefan.fent>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None    
Version: Beta 4   
Target Milestone: ---   
Hardware: i686   
OS: SUSE Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Output from hwinfo
contents of /proc/interrupts

Description Paul Beltrani 2005-09-03 01:23:44 UTC
Shortly after starting /usr/sbin/irqbalance system locks up.

Hardware : Asus P4S800D-E Motherboard, Intel P4 CPU, hyper-threading enabled
Package  : irqbalance, version 0.09, release 42
Kernel   : Linux 2.6.13-3-smp #1 SMP Mon Aug 29 19:48:23 UTC 2005

The system would lockup shortly after irqbalance was started during boot. 
Booting in safe mode disabled SMP and irqbalance which allowed the system to boot.

-----
Disabled start of irqbalance on boot: "chkconfig --del irq_balancer"

System was then able to boot normally with hyper-threading enabled

System ran normally until irqbalance was started manually "rcirq_balancer
start".  Shortly afterward system would lock-up.

This is repeatable.
Comment 1 Andreas Kleen 2005-09-03 17:51:24 UTC
What kind of lockup? When you run in console mode and do klogconsole -l8 -r0
first do you see something? Does the system still react to console switches
then?

Also please attach hwinfo output.

Comment 2 Paul Beltrani 2005-09-05 23:18:31 UTC
Created attachment 48861 [details]
Output from hwinfo
Comment 3 Andreas Kleen 2005-09-05 23:34:34 UTC
Hmm, you seem to have an SIS chipset. Maybe those don't like setting the IRQ
affinity and have some APIC bugs.

Did earlier releases work? 

I guess on a HT system like this we can just disable it because irq balancing
only really makes sense on a real multi socket SMP system.
Comment 4 Paul Beltrani 2005-09-05 23:49:38 UTC
> What kind of lockup?  Does the system still react to console switches then?

Within a few seconds of starting irqbalance: System fails to respond to keyboard
inputs. Unable to switch between VTTYs.  System stops displaying output to
attached monitor.

Further testing shows existing SSH connections continue to function and NEW
connections may be established.

Attempted to execute "top" from from an SSH connection and the session appeared
to hang.  Had to kill the "top" process from another session to get back to a
command prompt.

> When you run in console mode and do klogconsole -l8 -r0 first do you see
something?

No.  Booted system with "console=ttyS0,115200n8" kernel arg.  Then ran
"klogconsole -l8 -r0" before starting irqbalance.  No output to VTTY0, serial
port, dmesg or /var/log/messages.

> Did earlier releases work?
Unknown.  Did not try with earlier 10.0x releases
Comment 5 Andreas Kleen 2005-09-06 00:16:38 UTC
Ok. Wild theory: irqbalanced touches the keyboard interrupt and that chipset
doesn't like that.

What happens when you don't start irqbalanced but just do over a ssh connection
as root

cut -d: -f1 /proc/interrupts | while read i ; do
    echo $i 1 
    echo 1 > /proc/irq/$i/smp_affinity
    sleep 1
    echo $i 2 
    echo 2 > /proc/irq/$i/smp_affinity
    sleep 1
done 

Does that lock up too? What is the last output you see? Also 
attach /proc/interrupts.

>Unknown.  Did not try with earlier 10.0x releases
I meant releases before 10 like 9.3 or earlier

Comment 6 Paul Beltrani 2005-09-06 00:32:21 UTC
Re: Behavior on pre 10.0 releases
Unknown.  irqbalance was not installed with 9.3 on this hardware.  As 9.3 is no
longer installed on this hardware this would be difficult to test.


Re: test script
linux:~ # ./test.sh
CPU0 CPU1 1
./test.sh: line 4: /proc/irq/CPU0       CPU1/smp_affinity: No such file or directory
CPU0 CPU1 2
./test.sh: line 7: /proc/irq/CPU0       CPU1/smp_affinity: No such file or directory
0 1
0 2

<Hung.  Needed to kill test.sh process from another ssh session>

Ran a second time, output was:
linux:~ # ./test.sh
CPU0 CPU1 1
./test.sh: line 4: /proc/irq/CPU0       CPU1/smp_affinity: No such file or directory

<Hung.  Needed to kill test.sh process from another ssh session>
Comment 7 Paul Beltrani 2005-09-06 00:33:17 UTC
Created attachment 48864 [details]
contents of /proc/interrupts
Comment 10 Andreas Kleen 2005-09-07 15:23:42 UTC
It looks like it locks up when trying to change irq 0.   
 
Stefan, can you just make irqbalanced ignore irq 0?  
Comment 11 Stefan Fent 2005-09-09 07:54:59 UTC
Paul, 
does this work with RC1?
In RC1, irqbalance is not installed on P4.
(doesn't make sense with 1 HT CPU anyways)
Comment 12 Andreas Jaeger 2005-09-09 08:44:19 UTC
The exact rule is: It is only installed on 64-bit x86-64 SMP systems.

So, after an rpm -e irqbalance, an update should not install it again.

Lowering priority.
Comment 13 Paul Beltrani 2005-09-09 11:34:34 UTC
> The exact rule is: It is only installed on 64-bit x86-64 SMP systems.
In that case part of the fault is with the installer as irqbalance was installed
as part of a "Standard system with KDE" install.

As an error proofing measure would it be difficult to modify irqbalance to
verify it is on appropriate hardware before continuing to run?

I plan on doing a clean install of RC1 later today on the same hardware and will
report back on what the installer does with irqbalance.

Comment 14 Andreas Kleen 2005-09-09 11:40:11 UTC
That it causes a lockup on your system is a hardware bug in your chipset.
It is hard for irqbalanced to predict hardware bugs like this.

irqbalanced itself is not to blame.
Comment 15 Paul Beltrani 2005-09-09 12:02:12 UTC
> irqbalanced itself is not to blame.

I agree, irqbalance should not have been installed or run on this hardware to
begin with so perhaps this is better described as an installer bug.

However as you mentioned above, it doesn't make much sense to run it on a single
CPU system. What I was suggesting is that perhaps irqbalanced could do a sanity
check for this condition before it runs.  I am NOT suggesting that there should
be some sort of test/exception table for every chipset.  

In any event, I appreciate everyones time on this and will report back on how
RC1 deals with the install issue.

Comment 16 Andreas Kleen 2005-09-09 12:04:07 UTC
There are plans to make irqbalanced hyperthreading/dual core aware,
but not for 10.0
Comment 17 Paul Beltrani 2005-09-10 12:42:13 UTC
I have reinstalled the system with 10.0-RC1 using the same auotyast control file
as before.  This time irqbalance was NOT installed.  The problem appears to have
been resolved.

Thanks to everyone for their time and effort on this issue.
Comment 18 Stefan Fent 2005-09-12 11:01:42 UTC
Thanks for testing again, as this seems to be the only system with this problem,
I'll close it now.