Bug 465039 - ksoftirqd takes 100% cpu, unable to reboot properly
Summary: ksoftirqd takes 100% cpu, unable to reboot properly
Status: RESOLVED NORESPONSE
: 540550 543235 (view as bug list)
Alias: None
Product: openSUSE 11.3
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Final
Hardware: i586 openSUSE 11.3
: P3 - Medium : Critical (vote)
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-09 22:04 UTC by Michal Veselenyi
Modified: 2011-08-30 19:29 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
chart of /proc/interrupts (26.71 KB, image/png)
2009-01-11 18:48 UTC, Michal Veselenyi
Details
proc.interrupts (2.11 MB, application/x-tar)
2009-03-13 11:33 UTC, Elmar Stellnberger
Details
var.log.messages (2.10 MB, text/plain)
2009-03-13 11:35 UTC, Elmar Stellnberger
Details
/var/log/messages file (695.30 KB, text/plain)
2009-03-13 16:08 UTC, Ákos Szőts
Details
erroneous partgui that kann trigger the ksoftirqd bug (1.02 MB, application/x-tbz)
2009-03-15 17:03 UTC, Elmar Stellnberger
Details
proc.interrupts for partgui triggered overhang (1.56 KB, text/plain)
2009-03-15 17:14 UTC, Elmar Stellnberger
Details
still far from being resolved (14 bytes, text/plain)
2009-04-21 14:58 UTC, Elmar Stellnberger
Details
another /proc/interrupts table (30.00 KB, application/x-tar)
2009-10-21 08:29 UTC, Elmar Stellnberger
Details
subsequent /proc/interrupts snapshots, os11.2 RC1 (40.00 KB, application/x-tar)
2009-10-28 10:48 UTC, Elmar Stellnberger
Details
subsequent snapshots (delay:none, 0.5s), os11.2 RC1; nicy cpuirqd (40.00 KB, application/x-tar)
2009-10-28 15:29 UTC, Elmar Stellnberger
Details
2 proc interrupts evolutions during 8min (52.34 KB, image/png)
2009-10-30 11:37 UTC, Michal Veselenyi
Details
three 20x snapshots a 0.5s (190.00 KB, application/x-tar)
2009-11-20 11:29 UTC, Elmar Stellnberger
Details
clcoksource=jiffies, 2.6.37-8.99.14-desktop, 10x a 1s + stacktraces (379.92 KB, application/x-bzip2)
2011-01-20 20:29 UTC, Elmar Stellnberger
Details
tasklet debug patch (1.72 KB, patch)
2011-02-20 15:53 UTC, Jiri Slaby
Details | Diff
tasklet debug patch (1.81 KB, patch)
2011-02-20 16:45 UTC, Jiri Slaby
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michal Veselenyi 2009-01-09 22:04:07 UTC
After a while of using my notebook the ksoftirqd begins to use all my remaining cpu time.
The only way to return to a normal state is by reboot. But also the reboot fails (it hangs somewhere, I cannot see the console. I'm using nvidia's video driver and have black console as soon as X starts - I can eventually set the nv or vesa driver in xorg.conf and then see).
The worst part is it hangs during shutdown before unmounting HDDs. alt+ctrl+del nor sysrq does not work so I need to hard boot (power off). (Therefore I set this bug as critical, otherwise it can be major)

I have no idea what causes it. Maybe network? Time in /var/log/messages shows only dhcpd doing its stuff + knetworkmanager somewhat appears on the top list.
It hangs on both wlan or lan

I searched for a while. Some guy have similar problem with a Clevo notebook with very similar specs.
I had no such problem with previous suse (11.0)


Any ideas?
My ideas for now can be to try these :
- use for a while only the vesa/nv driver
- install vanilla kernel
- download & install 
- disable dhcp daemon (easiest I think)


Here is top of the top processes:

top - 22:35:21 up 1 day,  4:05,  7 users,  load average: 1.66, 1.76, 1.52
Tasks: 142 total,   3 running, 139 sleeping,   0 stopped,   0 zombie
Cpu0  :  2.0%us,  6.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi, 92.0%si,  0.0%st
Cpu1  :  4.0%us,  0.0%sy,  0.3%ni, 95.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4088564k total,  3940616k used,   147948k free,    66132k buffers
Swap:  4409800k total,       28k used,  4409772k free,  3204660k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    4 root      15  -5     0    0    0 R   99  0.0  24:24.67 ksoftirqd/0
 3049 root      20   0  360m  69m 9772 S    4  1.7  31:08.22 X
 9720 micho     39  19  128m  47m  14m S    4  1.2  14:14.83 operapluginwrap
 9628 micho     20   0  511m 382m  17m S    2  9.6  12:52.75 opera
10710 micho     20   0  156m  56m  29m S    2  1.4  11:58.14 amarokapp
 3853 micho     20   0 35884  13m 8840 S    1  0.3   2:55.25 knetworkmanager
 8838 root      15  -5     0    0    0 S    1  0.0   0:00.96 events/1
 3984 micho     20   0 37716  16m  10m R    0  0.4   1:51.73 konsole
11588 root      20   0 99112  57m  24m S    0  1.4   0:51.00 y2base
    1 root      20   0  1008  356  308 S    0  0.0   0:01.20 init
    2 root      15  -5     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      RT  -5     0    0    0 S    0  0.0   0:00.14 migration/0
    7 root      15  -5     0    0    0 S    0  0.0   0:05.14 events/0


/var/log/messages around the fatal time (I think it occured somewhere after the "MARK"):
Jan  9 21:46:30 linux-6vsc dhclient: DHCPREQUEST on eth0 to 192.168.1.1 port 67
Jan  9 21:46:31 linux-6vsc dhclient: DHCPACK from 192.168.1.1
Jan  9 21:46:31 linux-6vsc dhclient: bound to 192.168.1.3 -- renewal in 1625 seconds.
Jan  9 22:06:31 linux-6vsc -- MARK --
Jan  9 22:13:36 linux-6vsc dhclient: DHCPREQUEST on eth0 to 192.168.1.1 port 67
Jan  9 22:13:36 linux-6vsc dhclient: DHCPACK from 192.168.1.1
Jan  9 22:13:36 linux-6vsc dhclient: bound to 192.168.1.3 -- renewal in 1726 seconds.



My system:

Linux linux-6vsc 2.6.27.7-9-pae #1 SMP 2008-12-04 18:10:04 +0100 i686 i686 i386 GNU/Linux
I used kde4. Now I have kde3.

sys_vendor   = "CLEVO CO."
sys_product  = "M570TU"
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz
4GB ddr3 ram
nvidia 9800gt

Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
Network controller: Intel Corporation PRO/Wireless 3945ABG [Golan] Network Connection (rev 02)

Hope somebody will help. I have no idea what ksoftirqd if for.

Thanks in advance.
Comment 1 Marcus Meissner 2009-01-10 09:07:01 UTC
can you get output of:
cat /proc/interrupts

to see if an interrupt is triggered very often
Comment 2 Michal Veselenyi 2009-01-11 09:28:15 UTC
Hi here it is (I'm running nearly the entire day and ksoftfirqd is normal for now).
I'll provide another one (maybe also a graph) when the ksoftirqd problem occurs (I supopose that is really needed).

> cat /proc/interrupts
           CPU0       CPU1
  0:   26863620          0   IO-APIC-edge      timer
  1:        254          0   IO-APIC-edge      i8042
  6:          0          0   IO-APIC-edge      lirc_ite8709
  8:          1          0   IO-APIC-edge      rtc0
  9:       8972          0   IO-APIC-fasteoi   acpi
 12:        564          0   IO-APIC-edge      i8042
 16:     224750          0   IO-APIC-fasteoi   uhci_hcd:usb2, nvidia
 18:    1191186          0   IO-APIC-fasteoi   uhci_hcd:usb8, jmb38x_ms:slot0, ohci1394, mmc0
 19:    1121921          0   IO-APIC-fasteoi   ata_piix, ata_piix, ehci_hcd:usb1, uhci_hcd:usb4, uhci_hcd:usb7
 21:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 22:    3042821          0   IO-APIC-fasteoi   HDA Intel
 23:          2          0   IO-APIC-fasteoi   ehci_hcd:usb5, uhci_hcd:usb6
216:    2369317          0   PCI-MSI-edge      iwl3945
217:    6094614          0   PCI-MSI-edge      eth0
NMI:          0          0   Non-maskable interrupts
LOC:    4722382   16643134   Local timer interrupts
RES:    1373132    3216824   Rescheduling interrupts
CAL:    1511391    2413558   function call interrupts
TLB:      32209      34157   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0
Comment 3 Michal Veselenyi 2009-01-11 18:46:22 UTC
Hi again.
So I have here the latest interrupts before I rebooted.

   0:   37799577          0   IO-APIC-edge      timer
   1:        309          0   IO-APIC-edge      i8042
   6:      47784          0   IO-APIC-edge      lirc_ite8709
   8:          1          0   IO-APIC-edge      rtc0
   9:      11984          0   IO-APIC-fasteoi   acpi
  12:       1074          0   IO-APIC-edge      i8042
  16:     276919          0   IO-APIC-fasteoi   uhci_hcd:usb2, nvidia
  18:    1315728          0   IO-APIC-fasteoi   uhci_hcd:usb8, jmb38x_ms:slot0, ohci1394, mmc0
  19:    1579638          0   IO-APIC-fasteoi   ata_piix, ata_piix, ehci_hcd:usb1, uhci_hcd:usb4, uhci_hcd:usb7
  21:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
  22:    3305783          0   IO-APIC-fasteoi   HDA Intel
  23:          2          0   IO-APIC-fasteoi   ehci_hcd:usb5, uhci_hcd:usb6
 216:    3033526          0   PCI-MSI-edge      iwl3945
 217:    9101163          0   PCI-MSI-edge      eth0
 NMI:          0          0   Non-maskable interrupts
 LOC:    6434802   22944923   Local timer interrupts
 RES:    1790955    4134157   Rescheduling interrupts
 CAL:    1564485    2469460   function call interrupts
 TLB:      43600      45062   TLB shootdowns
 TRM:          0          0   Thermal event interrupts
 SPU:          0          0   Spurious interrupts
 ERR:          0
 MIS:          0


I'm also joining the chart from collected /proc/interrupts each minute during ~870 minutes (with a simple script) created with OOo.
basically I see nothing very unusual on it.

Just to note: 870th minute is around 18:30. Resume from s2ram is at 375th minute (9:15 morning).  I closed all browsers and went out for skiing at around 12:30 (I left only Azureus open). So the ksoftirqd problem appeared when I was out.
There is little change in slope at minute 450 (11:30), except the "eth0" curve (3rd from top). In var/log/messages there is nothing unusual, only classic dhcprequest stuff (same as in previous post). I also realized now that the minutes aren't extra accurate (just a sleep 60) which can add some error.

regards.
Comment 4 Michal Veselenyi 2009-01-11 18:48:11 UTC
Created attachment 264360 [details]
chart of /proc/interrupts

Temporal chart of /proc/interrupts
Comment 5 Michal Veselenyi 2009-02-14 13:14:21 UTC
Hello.
I have some good news.
I downloaded/compiled/installed the newest kernel from kernel.org (2.6.28.2) and I'm running on it for some week now without any problem. In fact, there were lots of changes concerning softirqd in the 2.6.28 release.
Maybe an upgrade of the current opensuse kernel (2.6.27.7-9.1) in repositories would fix this for anybody else.

Regards.
Comment 6 Elmar Stellnberger 2009-03-13 11:33:54 UTC
Created attachment 279389 [details]
proc.interrupts
Comment 7 Elmar Stellnberger 2009-03-13 11:35:56 UTC
Created attachment 279390 [details]
var.log.messages
Comment 8 Ákos Szőts 2009-03-13 16:08:12 UTC
Created attachment 279486 [details]
/var/log/messages file

I also suffer from this bug.

uname -ir:
2.6.27.19-3.2-default x86_64

/proc/interrupts:
           CPU0       CPU1                      
  0:      71832      72828   IO-APIC-edge      timer
  1:          5          7   IO-APIC-edge      i8042
  8:          1          0   IO-APIC-edge      rtc0
  9:          0          1   IO-APIC-fasteoi   acpi
 12:         72         64   IO-APIC-edge      i8042
 14:       1690       1616   IO-APIC-edge      ata_piix
 15:          0          0   IO-APIC-edge      ata_piix
 16:        581        168   IO-APIC-fasteoi   nvidia
 17:      26663      10586   IO-APIC-fasteoi   ata_piix, eth0, b43
 18:          0          0   IO-APIC-fasteoi   mmc0
 19:          1          1   IO-APIC-fasteoi   ohci1394
 20:       3236       2221   IO-APIC-fasteoi   uhci_hcd:usb1, uhci_hcd:usb4, ehci_hcd:usb7
 21:       8821       2932   IO-APIC-fasteoi   uhci_hcd:usb2, uhci_hcd:usb5, HDA Intel
 22:          0          0   IO-APIC-fasteoi   ehci_hcd:usb3, uhci_hcd:usb6
NMI:          0          0   Non-maskable interrupts
LOC:      53608      57620   Local timer interrupts
RES:      13893      16887   Rescheduling interrupts
CAL:       1241        296   function call interrupts
TLB:        216        229   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts
ERR:          0

lspci:
00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory Controller Hub (rev 0c)
00:01.0 PCI bridge: Intel Corporation Mobile PM965/GM965/GL960 PCI Express Root Port (rev 0c) 
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 02)
00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2 (rev 02)
00:1c.3 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 4 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HEM (ICH8M) LPC Interface Controller (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) IDE Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation GeForce 8400M GS (rev a1)
03:00.0 Ethernet controller: Broadcom Corporation BCM4401-B0 100Base-TX (rev 02)
03:01.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 05)
03:01.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 22)
03:01.2 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 12)
03:01.3 System peripheral: Ricoh Co Ltd xD-Picture Card Controller (rev 12)
0c:00.0 Network controller: Broadcom Corporation BCM4311 802.11b/g WLAN (rev 01)

I attached the /var/log/messages file.

I have a dual core Intel Core2 CPU, and one of the cores was totally used by ksoftirqd and the network died after some time (`ping` also had an error with some sort of sendmessage buffer).
Comment 9 Ákos Szőts 2009-03-13 16:28:21 UTC
I can reproduce this bug with Eclipse and Aptana (and probably with Last.fm).

I have an Eclipse installed with Aptana and the latter wants to download some MBs of update. At 21% it stops as Last.fm also does.

Closing Eclipse and Last.fm the CPU usage caused by ksoftirqd/1 decreases to 0-1%.
Comment 10 Michal Veselenyi 2009-03-14 14:09:23 UTC
Hi. Did you try to update to the newest kernel (2.6.28.x)? It solved this for me (at least I didn't have any problems so far).
Comment 11 Elmar Stellnberger 2009-03-15 16:55:44 UTC
  In deed kernel 2.6.28-next-20090107-20090107.18-default from http://download.opensuse.org/repositories/Kernel:/linux-next/openSUSE_11.1/ seems to resolve the issue. If I start the self compiled partgui(/usr/sbin/piguicqt) then it will simply return an error with the new kernel while it has always caused a light variant of the 100%-cpu-ksoftirqd bug with the old kernel fortunately not triggering any disk access (which makes things much worse). However linux-next is not an option for me since it does not awake from s2ram at me as pm-suspend.log revealed.
Comment 12 Elmar Stellnberger 2009-03-15 17:03:46 UTC
Created attachment 279654 [details]
erroneous partgui that kann trigger the ksoftirqd bug

  Here I have uploaded an erroneously self-compiled version of partgui that can trigger a light version of the ksoftirqd bug featuring 100% cpu load but no disk access. Note that the cause for the ksoftirqd overload during normal operation will be different from that kind of artificially triggered one (and more severe because of hdd-access overload). To test with it type:
> make install (on a 64bit machine)
> /usr/sbin/piguicqt (as root)
Comment 13 Elmar Stellnberger 2009-03-15 17:14:37 UTC
Created attachment 279655 [details]
proc.interrupts for partgui triggered overhang
Comment 14 Elmar Stellnberger 2009-04-21 14:58:08 UTC
Created attachment 287170 [details]
still far from being resolved
Comment 15 Elmar Stellnberger 2009-04-21 15:06:10 UTC
this time it is a permanent hangup (sometimes it goes away by itself).
occurs on both platforms: i586, x86_64
should perhaps have been a shipment blocker.
why don`t they offer us a downgrade?
Comment 16 Elmar Stellnberger 2009-04-21 15:09:31 UTC
Of what use will the 'next' version be if it comes with its very own set of unacceptable bugs? I believe it should be resolved for OpenSuse11.1.
Comment 17 Greg Kroah-Hartman 2009-04-21 15:26:07 UTC
Only Novell is allowed to set the priority.
Comment 18 Greg Kroah-Hartman 2009-04-21 15:27:27 UTC
If no one can duplicate this without the nvidia driver loaded, there is not going to be anything that we can do about this.

So, can someone run without the nvidia driver and still see this?
Comment 19 Michal Veselenyi 2009-04-22 07:24:21 UTC
Hi
For now I'm using the vanilla 2.6.28.2 for 2 months without problems.

I can eventually try to start with 2.6.27 (original kernel from osuse 11.1, I still have it in grub) and to not use the nvidia driver.
Fairly easy but it will take some time for me (the bug appears after few minutes but sometimes after several houts or a day).
Comment 20 Elmar Stellnberger 2009-04-22 09:55:38 UTC
  Perhaps I forgot to mention that this occurs with the ati radeonhd driver for x86_64 platforms as well. Unfortunately linux-next(2.6.28) is not an option as long as the s2ram problems are not resolved there (Bug 496954) though the issue does not seem to apply to linux-next(2.6.28).
  Has anyone tried to trigger the overload with 2.6.27 kernel and my partgui test compilation?
Comment 21 Elmar Stellnberger 2009-04-22 09:56:29 UTC
.
Comment 22 Jeff Mahoney 2009-05-21 16:29:31 UTC
(In reply to comment #16)
> Of what use will the 'next' version be if it comes with its very own set of
> unacceptable bugs? I believe it should be resolved for OpenSuse11.1.

The linux-next kernel isn't an official openSUSE release. The description itself indicates where to report bugs while using it.

If you're comfortable building and testing kernels, I can give you some tips on how to track down the bug more quickly. Once you've identified the upstream fix, then we can backport it to the openSUSE 11.1 kernel.
Comment 23 Elmar Stellnberger 2009-06-20 20:12:12 UTC
Could you give me some advice on how to activate Apparmor for the 2.6.30 kernel provided at ftp.suse.com/pub/projects/kernel/kotd/master? There is still no replacement for the 2.6.27 kernel series which keeps suffering from the ksoftirq-bug! For me 2.6.30 is now working best, better than linux-next (no s2ram) and of course better than 2.6.27.
Comment 24 Jeff Mahoney 2009-06-20 20:34:20 UTC
AppArmor hasn't been forward-ported to 2.6.30 yet.
Comment 25 Jeff Mahoney 2009-09-17 21:31:03 UTC
AppArmor has since been forward ported to the master kernel and has been available since 11.2 M4. I still haven't been able to reproduce on 11.1.
Comment 26 Stephan Kulow 2009-09-22 18:26:21 UTC
*** Bug 540550 has been marked as a duplicate of this bug. ***
Comment 27 Stephan Kulow 2009-10-01 08:29:30 UTC
moving to 11.2 as Elmar sees it there too.
Comment 28 Camaleon -- 2009-10-01 21:20:34 UTC
*** Bug 543235 has been marked as a duplicate of this bug. ***
Comment 29 Elmar Stellnberger 2009-10-21 08:29:37 UTC
Created attachment 323402 [details]
another /proc/interrupts table
Comment 30 Elmar Stellnberger 2009-10-28 10:48:30 UTC
Created attachment 324468 [details]
subsequent /proc/interrupts snapshots, os11.2 RC1
Comment 31 Michal Veselenyi 2009-10-28 12:42:47 UTC
Hi all.
I can confirm this ksoftirq bug happens also on 11.2 RC1
On 11.1 with vanilla kernel (2.6.28, 29) it did not happen anymore. I suppose it will be same case here.
I have now 2 versions of /proc/interrupts monitored so I'll look at them if I can see any difference between them (I'll reply then).

Could it be that opensuse patches to kernel could cause this?
Comment 32 Elmar Stellnberger 2009-10-28 15:29:20 UTC
Created attachment 324507 [details]
subsequent snapshots (delay:none, 0.5s), os11.2 RC1; nicy cpuirqd

There are many different types of ksoftirqd problems:
* full unnicy cpu usage (user load, no disk a.)
* full cpu usage as nice load (no disk a.)
* half cpu usage on dual core systems (no disk a.)
* massive disk access issues
Comment 33 Michal Veselenyi 2009-10-30 11:37:05 UTC
Created attachment 324847 [details]
2 proc interrupts evolutions during 8min

Attaching /proc/interrupts chart measured each minute during approx 8 minutes.
col1 and col2 are 1st and 2nd columns from /proc/interrupts - but I had to call them rather cpu0 and cpu1.
Case OK is the normal state of PC, case BAD is when ksoftirqd takes 100% of one cpu.
This was taken on opensuse 11.2 RC1 with kernel: Linux linux-7bt6 2.6.31.3-1-desktop #1 SMP PREEMPT 2009-10-08 00:27:25 +0200 i686 i686 i386 GNU/Linux

You can clearly see that LOC(cpu0), LOC(cpu1), 0(cpu0) go crazy during ksoftirqd
madness.

I see there is new kernel update to 2.6.31.5. I'll look if it still occurs there.
Comment 34 Elmar Stellnberger 2009-11-20 11:29:40 UTC
Created attachment 328648 [details]
three 20x snapshots a 0.5s

 Horrible, it is just horrible. Instead of being resolved after two full releases this problem has worsened. - and openSUSE 11.2 has failed. I will have to look for another distro or cease to use Linux. Things are just inacceptible as they are now. It happens very often although such a thing is supposed not to ever happen at all. ... and no one seems to feel responsible for it.
Comment 35 Elmar Stellnberger 2009-11-20 11:31:13 UTC
P5-None is a provocation; it needs to be P1. Accept it - or never see me again at openSUSE!
Comment 36 Elmar Stellnberger 2009-11-20 12:00:26 UTC
Why can`t we simply drop ksoftirqd and let all softirqds unhandled?
Comment 37 Elmar Stellnberger 2009-11-22 18:32:41 UTC
 Sorry for the harsh critics. There have been some quality issues with openSUSE 11.2 which I have not addressed in time (although this is not the right place to complain.). 
 The problem is not openSUSE specific as it occurs on Debian, Ubuntu, RedHat/Fedora and Mandriva as well. Nonetheless it would be really great if you could do anything about it!! 
 Kernel downgrading does not seem to be an option since kernel 2.6.25 (no ksoftirqd problem) seems to inhibit some powersave scripts with a root cause at me. Isn`t it simply possible to diff all changes from kernel 2.6.25 to 2.6.27? This is a lot of work sure, but it needs to be done as the problem has not been resolved by time. The ksoftirqd-hangs may often be bearable on a dual core machine though it sometimes causes that massive disk access that a hard reset is the only escape. Perhaps kernel developers of other distros could help us. It should be possible by a conjoint effort.
Comment 38 Jeff Mahoney 2009-11-22 19:26:52 UTC
Please stop touching the priorities. At least look up what the priorities MEAN before setting them.

Realistically, we are not going to go through every single change between 2.6.25 and 2.6.27. There are over *21000* of them and the fact remains that we are still unable to reproduce it.

Since you seem to be able to reproduce it reliably, we can show you how to bisect the vanilla kernel down to the exact change that causes the issue. This is really the only way we're going to be able to track this down. The good news is that you should only need to test it a maximum of 15 times.
Comment 39 Elmar Stellnberger 2009-11-22 19:58:56 UTC
 The problem is that I will most likely only be able to tell you the lowest version that still shows the ksoftirqd problem but not the highest that does not. It occurs very irregularely; once multiple times a day; and sometimes not a single time in a whole month (perhaps depending on the kernel version).
  Isn`t there really any possiblity to find out about changes that could most likely affect ksoftirqd; i.e. that take place in certain modules or refer to certain variables/ call procedures of a certain module?
  If not could you please provide me with ready-to-install kernel subversion-builds via the buildservice? I will at least try to run the posted partgui compilation that has caused the ksoftirqd-problem at me. I would personally like it best to run the kernels from an USB-stick (never know; the posted compilation could contain a backdoor as I am not sure whether my system has had been cracked that time; should perhaps have posted here.). Do you have any link for booting from USB-sticks for me?
Comment 40 Elmar Stellnberger 2009-11-22 20:07:23 UTC
 ... if the posted compilation should not do its job (well for a certain kernel and system setup it certainly did.) can you imagine anything that could trigger the ksoftirqd-problem? We could try different things out; perhaps we can find another program that can trigger it. That would ease tracing down the problem considerably. Any ideas? We should ask multiple developers!
Comment 41 Elmar Stellnberger 2009-11-23 16:45:16 UTC
 Please provide me with the respective kernel versions so that I can start testing!!
 The situation is intolerable as it is now. My whole system always slows down that much that I can not work with it.- and I can not reboot all the time either.
Comment 42 Jeff Mahoney 2009-11-23 20:29:50 UTC
The important bit is "system setup." I haven't been able to reproduce it and I'm not going to try to reproduce your setup. If you think your system has been compromised, it's a good idea to reinstall it anyway. Since there have been other reports of this, I don't expect that's the root cause.

That said, openSUSE is community supported with best-effort support from Novell engineers. I understand that your problem is making it difficult for you to get work done but if you look at how many other open kernel bugs there are (for openSUSE and other community distros), you'll understand why I don't have time to create a bisect tree for you.

The problem has been observed across distributions and you've narrowed it down to differences between two releases. I can give you an RPM containing 2.6.26-vanilla, but beyond that you're going to need to build test kernels yourself.
Comment 43 Elmar Stellnberger 2009-11-23 21:00:38 UTC
  Well, I am already working on a fresh installation (which could already be compromised again; you  never know.). If I had links to the source tar.bz2s for all versions to test I could simply copy the kernel package with osc and exchange the .tar.bz2 on my own to get the respective versions built.
Comment 44 Michal Veselenyi 2009-11-23 21:16:24 UTC
So here I am back. New 2.6.31.5 suse kernel does hand also in ksoftirqd. I downloaded vanilla kernel and took the .config from /usr/src/linux-2.6.31.5-0.1-obj/i386/desktop/.config
Recompiled and booted. The result is very good: After 20 days not a single ksoftirqd problem.

I looked at the 2 .config files and there are some few differences.
Here is roughly what is activated in opensuse kernel (comparing to vanilla):
CONFIG_SUSE_KERNEL=y
CONFIG_SPLIT_PACKAGE=y
CONFIG_NF_CONNTRACK_SLP=m
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_CIPHER_TWOFISH=m
CONFIG_DM_RAID=m
CONFIG_DM_RAID45=m
CONFIG_TOUCHSCREEN_ELOUSB=m
CONFIG_CRASHER=m
CONFIG_BOOTSPLASH=y
CONFIG_SND_HDA_PATCH_LOADER=y
CONFIG_SND_HDA_CODEC_CIRRUS=y
CONFIG_SAMSUNG_LAPTOP=m
CONFIG_EXT3_DEFAULTS_TO_BARRIERS_ENABLED=y
CONFIG_EXT3_FS_NFS4ACL=y
CONFIG_REISERFS_DEFAULTS_TO_BARRIERS_ENABLED=y
CONFIG_FS_NFS4ACL=y
CONFIG_XFS_DMAPI=m
CONFIG_DMAPI=m
CONFIG_NOVFS=m
CONFIG_UNWIND_INFO=y
CONFIG_STACK_UNWIND=y
CONFIG_KDB=y
CONFIG_KDB_MODULES=m
CONFIG_KDB_OFF=y
CONFIG_KDB_CONTINUE_CATASTROPHIC=0
CONFIG_KDB_USB=y
CONFIG_KDB_KDUMP=y
CONFIG_SECURITY_DEFAULT="apparmor"
CONFIG_SECURITY_APPARMOR=y
CONFIG_SECURITY_APPARMOR_NETWORK=y
CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1
CONFIG_SECURITY_APPARMOR_DISABLE=y
CONFIG_KVM_KMP=y

What I found in my config:
CONFIG_SCHED_OMIT_FRAME_POINTER=y

No XEN is present in any config.
I just copied .config file to vanilla kernel, started menuconfig and exited.

Could it be some suse patches could cause the ksoftirqd problems?
I'm a bit concerned about lots of new FS options and anything that contains DMA in the name. Also I see there is something new about SND_HDA (intel hda? - which I have).

I think I'll try do compile directly suse kernel with the i386/desktop/.config and eventually try to disable all listed options above - and see what happens.

regards.
Comment 45 Elmar Stellnberger 2009-11-24 12:30:07 UTC
... but if you look at how many other open kernel bugs there are ...
 The ksoftirqd-problem is clearely the worst, most annoying and in the meanwhile the most oftenly appearing bug of all. It applies to all Linux users and should therefore have precedence over other minor issues. Please do push a resolution forward!
Comment 46 Elmar Stellnberger 2009-11-24 12:45:59 UTC
  Please provide me with the respective kernel rpms or tell me how to create them (at best with the buildservice). Where to download the sources and suse-patches?
Comment 47 Elmar Stellnberger 2009-11-26 09:59:15 UTC
  It is really a shame that kernel developers are simply unwilling to care about this problem! The fact that it is hard to reproduce is no excuse.
Comment 48 Elmar Stellnberger 2009-12-17 13:55:35 UTC
 Better with 2.6.32.1. However merely time can show whether the problem has gone completely. Perhaps we should mark as resolved and re-open as soon as it is discovered again.
Comment 49 Elmar Stellnberger 2010-01-14 13:15:34 UTC
  Wanna mark as resolved since it has not occured for a while now.
However please do have a look at another nasty property of current kernels: 
Bug 566391, s2disk fails.
using 2.6.32.3-0.0.15.68cba77-desktop in the meanwhile.
Comment 50 Michal Veselenyi 2010-12-25 08:41:13 UTC
Well it is really embarassing, but this bug persist also in newest opensuse 11.3 wth kernel 2.6.34.7.

My current uname -a:
Linux linux-wew7.site 2.6.34.7-0.5-desktop #1 SMP PREEMPT 2010-10-25 08:40:12 +0200 i686 i686 i386 GNU/Linux
Comment 51 Elmar Stellnberger 2011-01-03 16:25:43 UTC
Things have actually already improved for many users! 
The problem luckily didn`t plague me in the last time.

Michael, what kind of ksoftirqd problem was there?
CPU-usage only, or with massive disk access and a totally irresponsive system?
100% CPU-usage of both CPUs or only of one?
Did the problem go by itself or was a reboot the only escape?
How often and by what frequency did it occur so far?
What kind of system are you using: hardware, modules - perhaps someone can tell us what to look at.
Comment 52 Elmar Stellnberger 2011-01-08 10:48:37 UTC
Ouh; oops! The problem just hasn`t occurred at me because I was using the clocksource=jiffies boot option. However this isn`t ideal.
Comment 53 Jeff Mahoney 2011-01-17 16:48:42 UTC
Bumping product to 11.3 since it still exists. I'm tossing this one back into the open bug queue because it's not my area of expertise.
Comment 54 Elmar Stellnberger 2011-01-20 20:29:28 UTC
Created attachment 409384 [details]
clcoksource=jiffies, 2.6.37-8.99.14-desktop, 10x a 1s + stacktraces

  Help! Now not even clocksource=jiffies can help. I just got a 100% 2core CPU usage on a 2.6.37-8.99.14.138eeaa-desktop kernel. A short while after the snapshots (/proc/interrupts + stackdumps) were taken massive disk access followed.

** novelty **  The first time for the ksoftirqd 100% cpu usage problem several stack dumps were taken (by Alt-PrnScr-L) to let you see in which execution state the CPU was. So just have a look at this.
Comment 55 Brandon Philips 2011-02-16 17:51:59 UTC
Can you please try the Kernel of the Day? 
 http://en.opensuse.org/openSUSE:Kernel_of_the_day

If it still happens we should report it upstream so that it can get upstream attention. Also can you attach the output from `hwinfo --all` to this bug?

Thanks, Brandon
Comment 56 Jiri Slaby 2011-02-20 15:53:41 UTC
Created attachment 415179 [details]
tasklet debug patch

(In reply to comment #54)
> ** novelty **  The first time for the ksoftirqd 100% cpu usage problem several
> stack dumps were taken (by Alt-PrnScr-L) to let you see in which execution
> state the CPU was. So just have a look at this.

The 2.6.37 traces are useless. The 2.6.34.7 ones are helpful though. Also /proc/softirq clearly shows that some kind of shit schedules a tasklet way too often.

I'm attaching a patch to track that down. Also I'm building a kernel to test and it will appear at:
http://labs.suse.cz/jslaby/bug-465039

Watch for tasklet_action in the logs when this happens. Maybe there will be false positives. Then I'll increase the limit. Let's see.
Comment 57 Jiri Slaby 2011-02-20 16:45:53 UTC
Created attachment 415180 [details]
tasklet debug patch

s/time_after/time_before/ indeed. Rebuilding.
Comment 58 Brandon Philips 2011-03-03 16:31:16 UTC
Michal- Can you please test Jiri's Kernel?
Comment 59 Jiri Slaby 2011-03-13 15:34:11 UTC
(In reply to comment #58)
> Michal- Can you please test Jiri's Kernel?

Or maybe Elmar?
Comment 60 Elmar Stellnberger 2011-03-28 18:16:08 UTC
  Well, this is nowadays increasingly hard to test. I may run the patched kernel for three month without actually being able to tell whether the ksoftirqd bug has vanished because it occurs so scaresly and inordinately. Unfortunately I have currently been away and thus was not able to test. What we need is something that can trigger the ksoftirqd bug. 
  Michael, could you try to run partgui as provided by attachement 5 "erroneous partgui that kann trigger the ksoftirqd bug ". Then let us see if we still can trigger it.
Comment 61 Michal Veselenyi 2011-03-29 08:50:24 UTC
(In reply to comment #59)
> (In reply to comment #58)
> > Michal- Can you please test Jiri's Kernel?
> 
> Or maybe Elmar?

I'll try to find some time for it.
Even for me it was hard to reproduce. But it occured for me at least once on suse 11.3.

From my observations, it can happen under higher and long-lasting network load. More precisely it happened when I left running Azureus for several hours (with dektop locked), but it also happened when I was working on computer.
Comment 62 Greg Kroah-Hartman 2011-08-30 19:29:15 UTC
Closing due to lack of response.  If this is still an issue, please reopen with the requested information.