Bug 148176

Summary: Intermittent BUG: soft lockup detected on CPU#
Product: [openSUSE] SUSE Linux 10.1 Reporter: Garry Forsgren <Garry.Forsgren>
Component: HotplugAssignee: Hannes Reinecke <hare>
Status: VERIFIED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: kernel01
Version: Beta 3   
Target Milestone: ---   
Hardware: x86-64   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Garry Forsgren 2006-02-03 23:09:26 UTC
Detail: The problem occurs during hotplug cpu testing when bringing a previously offlined CPU online.

=========================================================================
Log output:

Feb  3 13:16:06 linux klogd: Booting processor 3/4 APIC 0x7
Feb  3 13:16:06 linux klogd: Initializing CPU#3
Feb  3 13:16:06 linux klogd: Calibrating delay using timer specific routine.. 5985.52 BogoMIPS (lpj=11971045)
Feb  3 13:16:06 linux klogd: CPU: Trace cache: 12K uops, L1 D cache: 16K
Feb  3 13:16:06 linux klogd: CPU: L2 cache: 2048K
Feb  3 13:16:06 linux klogd: CPU 3 -> Node 0
Feb  3 13:16:06 linux klogd: CPU: Physical Processor ID: 3
Feb  3 13:16:06 linux klogd:                   Intel(R) Xeon(TM) CPU 3.00GHz stepping 03
Feb  3 13:16:06 linux klogd: APIC error on CPU3: 00(40)
Feb  3 13:16:06 linux klogd: migration_cost=8,1792
Feb  3 13:16:06 linux klogd: BUG: soft lockup detected on CPU#3!
Feb  3 13:16:06 linux klogd: CPU 3:
Feb  3 13:16:06 linux klogd: Modules linked in: joydev xt_pkttype ipt_LOG xt_limit freq_table radeon drm edd button battery ac ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables x_tables ipv6 apparmor aamatch_pcre loop dm_mod e1000 generic ehci_hcd uhci_hcd e752x_edac edac_mc ide_cd cdrom i8xx_tco shpchp pci_hotplug usbcore hw_random floppy reiserfs fan thermal processor aic7xxx scsi_transport_spi sg megaraid_mbox megaraid_mm piix sd_mod scsi_mod ide_disk ide_core
Feb  3 13:16:06 linux klogd: Pid: 0, comm: swapper Tainted: G     U 2.6.16-rc1-git3-7-smp #1
Feb  3 13:16:06 linux klogd: RIP: 0010:[<ffffffff80109aff>] <ffffffff80109aff>{mwait_idle+54}
Feb  3 13:16:06 linux klogd: RSP: 0018:ffff810037f1fef8  EFLAGS: 00000246
Feb  3 13:16:06 linux klogd: RAX: 0000000000000000 RBX: 00000000000424d4 RCX: 0000000000000000
Feb  3 13:16:06 linux klogd: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff803bd518
Feb  3 13:16:06 linux klogd: RBP: 000011d30dde85cd R08: ffff810037f1fea0 R09: 0000000000000000
Feb  3 13:16:06 linux klogd: R10: 0000000000000296 R11: ffff810037f1fef0 R12: ffff810037fb60c0
Feb  3 13:16:06 linux klogd: R13: ffff810037fb60c0 R14: ffff810037fb62e8 R15: 0000000000000001
Feb  3 13:16:06 linux klogd: FS:  0000000000000000(0000) GS:ffff810037fc1c80(0000) knlGS:0000000000000000
Feb  3 13:16:06 linux klogd: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Feb  3 13:16:06 linux klogd: CR2: 00002b73ec458d70 CR3: 00000000334ec000 CR4: 00000000000006a0
Feb  3 13:16:06 linux klogd:
Feb  3 13:16:06 linux klogd: Call Trace: <ffffffff80109aa6>{cpu_idle+151} <ffffffff8011875a>{start_secondary+1144}
=========================================================================

To Reproduce:  Run the following script on a 4 processor system and check the log periodically for 'BUG'.

#!/bin/sh
#Test offline/online
LIMIT=1000
COUNT=0
while [ $COUNT -lt $LIMIT ]
do
echo LOOP $COUNT
date
#offline processors
        for i in 3 2 1
        do
                echo 0 > /sys/devices/system/cpu/cpu${i}/online
                date
                cat /proc/cpuinfo | grep processor
                sleep 15
        done
#online processors
        for i in 1 2 3
        do
                echo 1 > /sys/devices/system/cpu/cpu${i}/online
                date
                cat /proc/cpuinfo | grep processor
                sleep 15
        done
        COUNT=`expr $COUNT + 1`
done
date
echo $0 test complete

=========================================================================

Proposed Resolution:
Modify watchdog() function in softlockup.c to execute the touch_softlockup_watchdog() function prior to calling msleep_interruptible().  This will update the timestamp before the timer interrupt can call softlockup_tick() which detects a false soft lockup.

--- softlockup.c.orig   2006-02-03 13:16:52.000000000 -0700
+++ softlockup.c        2006-02-03 13:17:12.000000000 -0700
@@ -85,8 +85,8 @@
         * in softlockup_tick():
         */
        while (!kthread_should_stop()) {
-               msleep_interruptible(1000);
                touch_softlockup_watchdog();
+               msleep_interruptible(1000);
        }
        __set_current_state(TASK_RUNNING);

=========================================================================
Comment 1 Hannes Reinecke 2006-03-09 11:44:39 UTC
Well, if it helps ... as the softlockup detection will be switched off for GA this is hardly a crucial fix. Nevertheless, if it's valid we should include it.

I'll ask for a review.
Comment 2 Andreas Kleen 2006-03-09 12:03:19 UTC
Change looks ok to me.
Comment 3 Andreas Kleen 2006-03-09 13:01:22 UTC
I enabled the patch now.

There's also another softlockup change I wanted to do - disable it during
bootup. Will add that to the patch too.
Comment 4 Ihno Krumreich 2007-06-05 10:34:11 UTC
Closed.