Bug 641900

Summary: xen kernel crash after about 16 hours, network stoped later/sometime disk control crash too
Product: [openSUSE] openSUSE 11.3 Reporter: Paul Pinault <disk_91>
Component: KernelAssignee: Jan Beulich <jbeulich>
Status: VERIFIED NORESPONSE QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P4 - Low CC: ihno, tonyj
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 11.3   
Whiteboard:
Found By: Community User Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: full kernel log since opensuse 11.3 install ; see after sept 23 for last kernel update
boot.msg normal kernel (no xen)
boot.msg xen kernel
First log booting Dom0 and crashing dom0
Second Log booting dom0 with acpi=on
Third Log booting Dom0 with acpi=off

Description Paul Pinault 2010-09-25 15:48:01 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.2.10) Gecko/20100914 SUSE/3.6.10-0.3.1 Firefox/3.6.10

Sep 25 05:50:21 saturn kernel: [58361.780047] ------------[ cut here ]------------
Sep 25 05:50:21 saturn kernel: [58361.780058] WARNING: at /usr/src/packages/BUILD/kernel-xen-2.6.34.7/linux-2.6.34/net/sched/sch_generic.c:256 dev_watchdog+0x25b/0x270()
Sep 25 05:50:21 saturn kernel: [58361.780060] Hardware name: System Product Name
Sep 25 05:50:21 saturn kernel: [58361.780062] NETDEV WATCHDOG: eth3 (forcedeth): transmit queue 0 timed out
Sep 25 05:50:21 saturn kernel: [58361.780063] Modules linked in: ip6t_LOG xt_tcpudp xt_pkttype xt_physdev ipt_LOG xt_limit usbbk gntdev netbk it87 blkbk blkback_pagemap hwmon_vid snd_pcm_oss blktap domctl xenbus_be coretemp evtchn snd_mixer_oss snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs bridge stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables fuse loop snd_hda_codec_realtek firewire_ohci firewire_core crc_itu_t snd_hda_intel snd_hda_codec snd_hwdep snd_pcm ohci1394 ieee1394 8139too snd_timer usblp usbhid i2c_nforce2 hid i2c_core 8139cp forcedeth pcspkr snd soundcore snd_page_alloc sr_mod sg shpchp pci_hotplug ext4 jbd2 crc16 dm_mirror dm_region_hash dm_log ohci_hcd ehci_hcd sd_mod usbcore dm_snapshot dm_mod xenblk cdrom xennet processor ata_generic pata_amd sata_nv libata scsi_mod thermal_sys h
Sep 25 05:50:21 saturn kernel: wmon
Sep 25 05:50:21 saturn kernel: [58361.780122] Pid: 0, comm: swapper Not tainted 2.6.34.7-0.3-xen #1
Sep 25 05:50:21 saturn kernel: [58361.780124] Call Trace:
Sep 25 05:50:21 saturn kernel: [58361.780135]  [<ffffffff80009646>] dump_trace+0x76/0x1a0
Sep 25 05:50:21 saturn kernel: [58361.780139]  [<ffffffff8040a79b>] dump_stack+0x69/0x6f
Sep 25 05:50:21 saturn kernel: [58361.780143]  [<ffffffff80043943>] warn_slowpath_common+0x73/0xb0
Sep 25 05:50:21 saturn kernel: [58361.780146]  [<ffffffff800439e0>] warn_slowpath_fmt+0x40/0x50
Sep 25 05:50:21 saturn kernel: [58361.780149]  [<ffffffff8034d04b>] dev_watchdog+0x25b/0x270
Sep 25 05:50:21 saturn kernel: [58361.780155]  [<ffffffff80053d34>] run_timer_softirq+0x1d4/0x3d0
Sep 25 05:50:21 saturn kernel: [58361.780159]  [<ffffffff8004b8c8>] __do_softirq+0xe8/0x220
Sep 25 05:50:21 saturn kernel: [58361.780162]  [<ffffffff80007efc>] call_softirq+0x1c/0x30
Sep 25 05:50:21 saturn kernel: [58361.780166]  [<ffffffff80009595>] do_softirq+0xa5/0xe0
Sep 25 05:50:21 saturn kernel: [58361.780175]  [<ffffffff8004bafd>] irq_exit+0x8d/0xa0
Sep 25 05:50:21 saturn kernel: [58361.780182]  [<ffffffff802d27d2>] evtchn_do_upcall+0x222/0x270
Sep 25 05:50:21 saturn kernel: [58361.780188]  [<ffffffff80007a4e>] do_hypervisor_callback+0x1e/0x30
Sep 25 05:50:21 saturn kernel: [58361.780207]  [<ffffffff800033aa>] 0xffffffff800033aa
Sep 25 05:50:21 saturn kernel: [58361.780213]  [<ffffffff80009c0c>] xen_safe_halt+0xc/0x10
Sep 25 05:50:21 saturn kernel: [58361.780216]  [<ffffffff8000e763>] xen_idle+0x43/0xc0
Sep 25 05:50:21 saturn kernel: [58361.780220]  [<ffffffff80005255>] cpu_idle+0x55/0xa0
Sep 25 05:50:21 saturn kernel: [58361.780223] ---[ end trace 92ba00751c8e0e8f ]---
Sep 25 05:50:21 saturn kernel: [58361.780226] eth3: Got tx_timeout. irq: 00000036
Sep 25 05:50:21 saturn kernel: [58361.780228] eth3: Ring at 80f8000
Sep 25 05:50:21 saturn kernel: [58361.780229] eth3: Dumping tx registers
Sep 25 05:50:21 saturn kernel: [58361.780235]   0: 00000036 000000df 00000003 0009000d 00000000 00000000 00000000 00000000
Sep 25 05:50:21 saturn kernel: [58361.780241]  20: 00000000 f0000000 00000000 00000000 00000000 00000000 00000000 00000000
Sep 25 05:50:21 saturn kernel: [58361.780246]  40: 0420e20e 0000a455 00002e20 00000000 00000000 00000000 00000000 00000000
Sep 25 05:50:21 saturn kernel: [58361.780254]  60: 00000000 00000000 00000000 0000ffff 0000ffff 0000ffff 0000ffff 00000000

[...]

continue ...

saturn:/home/disk # uname -a
Linux saturn 2.6.34.7-0.3-xen #1 SMP 2010-09-20 15:27:38 +0200 x86_64 x86_64 x86_64 GNU/Linux

saturn:/home/disk # lspci
00:00.0 Host bridge: nVidia Corporation C55 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.3 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.4 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.5 RAM memory: nVidia Corporation C55 Memory Controller (rev a2)
00:00.6 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.7 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.0 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.1 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.3 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.4 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.5 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.6 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:02.0 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:02.1 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:02.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:03.0 PCI bridge: nVidia Corporation C55 PCI Express bridge (rev a1)
00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3)
00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3)
00:0a.2 RAM memory: nVidia Corporation MCP51 Memory Controller 0 (rev a3)
00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 RAID bus controller: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2)
00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2)
00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3)
01:00.0 PCI bridge: nVidia Corporation Device 05bf (rev a2)
02:00.0 PCI bridge: nVidia Corporation Device 05bf (rev a2)
02:01.0 PCI bridge: nVidia Corporation Device 05bf (rev a2)
02:02.0 PCI bridge: nVidia Corporation Device 05bf (rev a2)
02:03.0 PCI bridge: nVidia Corporation Device 05bf (rev a2)
03:00.0 VGA compatible controller: nVidia Corporation NV44 [GeForce 6200 LE] (rev a1)
07:06.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
07:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
07:08.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller (rev c0)

Hardware : 
Asus P5N-D last bios version
Cpu :
model name	: Intel(R) Core(TM)2 Quad CPU    Q8400  @ 2.66GHz
stepping	: 10
cpu MHz		: 2666.728

Loaded modules:
saturn:/home/disk # lsmod
Module                  Size  Used by
ip6t_LOG                5898  7 
xt_tcpudp               2859  26 
xt_pkttype              1288  3 
xt_physdev              1867  2 
ipt_LOG                 6067  17 
xt_limit                2495  24 
usbbk                  23847  0 
gntdev                  8579  3 
netbk                  41414  0 [permanent]
blkbk                  28814  0 [permanent]
it87                   38738  0 
blkback_pagemap         2806  1 blkbk
snd_pcm_oss            53487  0 
hwmon_vid               3226  1 it87
snd_mixer_oss          18913  1 snd_pcm_oss
blktap                126702  2 [permanent]
domctl                  3227  2 blkbk,blktap
xenbus_be               3706  4 usbbk,netbk,blkbk,blktap
coretemp                6523  0 
snd_seq                67827  0 
evtchn                 38482  4 
snd_seq_device          7834  1 snd_seq
edd                    10176  0 
nfsd                  330017  9 
lockd                  84204  1 nfsd
nfs_acl                 3107  1 nfsd
auth_rpcgss            49079  1 nfsd
sunrpc                255540  15 nfsd,lockd,nfs_acl,auth_rpcgss
exportfs                4715  1 nfsd
bridge                 85911  2 
stp                     2331  1 bridge
llc                     6103  2 bridge,stp
ip6t_REJECT             4828  3 
nf_conntrack_ipv6      21550  4 
ip6table_raw            1627  1 
xt_NOTRACK              1192  4 
ipt_REJECT              2672  3 
xt_state                1618  18 
iptable_raw             1686  1 
iptable_filter          1946  1 
ip6table_mangle         2036  0 
nf_conntrack_netbios_ns     1758  0 
nf_conntrack_ipv4      10379  14 
nf_conntrack           87570  5 nf_conntrack_ipv6,xt_NOTRACK,xt_state,nf_conntrack_netbios_ns,nf_conntrack_ipv4
nf_defrag_ipv4          1673  1 nf_conntrack_ipv4
ip_tables              21762  2 iptable_raw,iptable_filter
ip6table_filter         1887  1 
ip6_tables             23384  4 ip6t_LOG,ip6table_raw,ip6table_mangle,ip6table_filter
x_tables               25752  17 ip6t_LOG,xt_tcpudp,xt_pkttype,xt_physdev,ipt_LOG,xt_limit,ip6t_REJECT,ip6table_raw,xt_NOTRACK,ipt_REJECT,xt_state,iptable_raw,iptable_filter,ip6table_mangle,ip_tables,ip6table_filter,ip6_tables
fuse                   77021  3 
loop                   18239  6 
snd_hda_codec_realtek   324063  1 
firewire_ohci          26970  0 
snd_hda_intel          29229  2 
firewire_core          61434  1 firewire_ohci
crc_itu_t               1747  1 firewire_core
snd_hda_codec         112811  2 snd_hda_codec_realtek,snd_hda_intel
snd_hwdep               7676  1 snd_hda_codec
snd_pcm               107771  3 snd_pcm_oss,snd_hda_intel,snd_hda_codec
snd_timer              27312  2 snd_seq,snd_pcm
snd                    83454  14 snd_pcm_oss,snd_mixer_oss,snd_seq,snd_seq_device,snd_hda_codec_realtek,snd_hda_intel,snd_hda_codec,snd_hwdep,snd_pcm,snd_timer
ohci1394               33542  0 
soundcore               8757  1 snd
8139too                35962  0 
i2c_nforce2             7561  0 
usblp                  13961  0 
pcspkr                  2222  0 
snd_page_alloc          9473  2 snd_hda_intel,snd_pcm
forcedeth              61485  0 
8139cp                 25731  0 
i2c_core               32104  1 i2c_nforce2
ieee1394              104214  1 ohci1394
sr_mod                 16364  0 
shpchp                 34692  0 
sg                     33047  0 
pci_hotplug            31949  1 shpchp
ext4                  399185  2 
jbd2                   98208  1 ext4
crc16                   1715  1 ext4
usbhid                 52713  0 
hid                    85698  1 usbhid
dm_mirror              15871  1 
dm_region_hash         13661  1 dm_mirror
dm_log                 10948  3 dm_mirror,dm_region_hash
ohci_hcd               36442  0 
ehci_hcd               60996  0 
sd_mod                 41170  2 
usbcore               231747  6 usbbk,usblp,usbhid,ohci_hcd,ehci_hcd
dm_snapshot            35225  0 
dm_mod                 86467  17 dm_mirror,dm_log,dm_snapshot
xenblk                 26098  0 
cdrom                  43051  2 sr_mod,xenblk
xennet                 37357  0 
processor              42760  0 
ata_generic             3739  0 
pata_amd               12922  0 
sata_nv                25589  2 
libata                211385  3 ata_generic,pata_amd,sata_nv
scsi_mod              191240  4 sr_mod,sg,sd_mod,libata
thermal_sys            18006  1 processor
hwmon                   2712  3 it87,coretemp,thermal_sys


Other informations:
The system is running no specific application on Dom0 
The system is running 2 virtual machine based on OpenSuse 11.1


Hope this will help to fix that bug !!! Do not hesitate to contact me to get more informations ... Easy to replicate to me ... just wait a couple of hours to made crash appen !

Reproducible: Always

Steps to Reproduce:
1. Start the system
2. Wait
3.
Actual Results:  
The system crash, usually the Ethernet crash and the system becomes unstable ; sometime I'm able to /etc/rc.d/network restart to get it back for a few time but not allways, generally, after the disks disapear (I can't report these logs actually as they are missed due to disk loss...)

Expected Results:  
continue to run normally
Comment 1 Paul Pinault 2010-09-27 19:41:14 UTC
It seems that this problem is more related with bridge : I changed my setup to stop using eth3 and use eth0 and eth1 instead. Now, my kernel is not crashing but the network is stoping on some interfaces ... 
Additional information to help :
saturn:/home/disk # ifconfig 
br0       Link encap:Ethernet  HWaddr 00:48:54:67:E3:F9  
          inet adr:10.0.0.20  Bcast:10.0.0.255  Masque:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:668 errors:0 dropped:0 overruns:0 frame:0
          TX packets:555 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:0 
          RX bytes:316246 (308.8 Kb)  TX bytes:95004 (92.7 Kb)

br1       Link encap:Ethernet  HWaddr 00:48:54:6F:78:AB  
          inet adr:10.0.1.20  Bcast:10.0.1.255  Masque:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:72 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:0 
          RX bytes:7516 (7.3 Kb)  TX bytes:3769 (3.6 Kb)

eth0      Link encap:Ethernet  HWaddr 00:48:54:67:E3:F9  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:744 errors:0 dropped:0 overruns:0 frame:0
          TX packets:646 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:1000 
          RX bytes:335746 (327.8 Kb)  TX bytes:108131 (105.5 Kb)
          Interruption:10 Adresse de base:0xc000 

eth1      Link encap:Ethernet  HWaddr 00:48:54:6F:78:AB  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:45 errors:0 dropped:0 overruns:0 frame:0
          TX packets:79 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:1000 
          RX bytes:2885 (2.8 Kb)  TX bytes:11800 (11.5 Kb)
          Interruption:11 Adresse de base:0x2000 

lo        Link encap:Boucle locale  
          inet adr:127.0.0.1  Masque:255.0.0.0
          adr inet6: ::1/128 Scope:Hôte
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:87 errors:0 dropped:0 overruns:0 frame:0
          TX packets:87 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:0 
          RX bytes:9508 (9.2 Kb)  TX bytes:9508 (9.2 Kb)

vif1.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:264 errors:0 dropped:0 overruns:0 frame:0
          TX packets:295 errors:0 dropped:1 overruns:0 carrier:0
          collisions:0 lg file transmission:32 
          RX bytes:27623 (26.9 Kb)  TX bytes:32612 (31.8 Kb)

vif1.1    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:65 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 lg file transmission:32 
          RX bytes:7307 (7.1 Kb)  TX bytes:1241 (1.2 Kb)


saturn:/home/disk # brctl show
bridge name	bridge id		STP enabled	interfaces
br0		8000.00485467e3f9	no		eth0
							vif1.0
br1		8000.0048546f78ab	no		eth1
							vif1.1

The configuration does not seems to be in cause as after a reboot everything is going well ... for a few time :(
Comment 2 Tony Jones 2010-09-28 15:14:32 UTC
Jan. Do you want to take a look at this since it's Xen related.   Feel free to reassign back if not appropriate.
Comment 3 Jan Beulich 2010-09-29 08:51:07 UTC
Without seeing the full kernel log we can't really judge whether the netdev watchdog kicking in was just a secondary effect. Please attach the full /var/log/messages fragment(s) of the session(s) in question.
Comment 4 Paul Pinault 2010-09-29 10:25:45 UTC
(In reply to comment #3)
> Without seeing the full kernel log we can't really judge whether the netdev
> watchdog kicking in was just a secondary effect. Please attach the full
> /var/log/messages fragment(s) of the session(s) in question.

I put all what was interestin in the log, most of the time when crash, the log is empty (no more message that when the system work correctly)
Do we have a way to get more log messages that could help ?
Comment 5 Jan Beulich 2010-09-29 11:05:25 UTC
(In reply to comment #4)
> Do we have a way to get more log messages that could help ?

Without knowing what we're looking for - no.
Comment 6 Paul Pinault 2010-09-29 15:16:15 UTC
> Without knowing what we're looking for - no.

I'm quite sure it is related with Bridge device as network traffic is the crash trigger. 

Since I changed my config to use my two RTL ethernet cards instead of the MPC51 one, I have no log into /var/log/message but the network is still crashing : in most of the case, the internal network (between VM and Dom0) is working corectly but the external communication (Dom0 or VM communicating to an external machine) is not working. At this point I can type any command you want to get analysis.
In some other cases, the global server simply crash and I get no acces to anything (need to reboot) the /var/log/message have no messages related to this.

I can add this point (it may help) it seems that it appens each time on the eth with the higher number : initially eth3 ; now eth1 even when I switch eth0 and eth1 networks (by realocating br0 an eth0 and br1 on eth1 ) always eth1 crash

Then I can also add that is apears more frequently when I start a second VM ; in this case the br0 (eth0) is shared by 3 systems (Dom0, VM1, VM2) instead of 2 (Dom0 + VM1)

Hope this can help ...

Let me know what I can do to help to fix this
Comment 7 Jan Beulich 2010-09-29 15:25:30 UTC
(In reply to comment #6)
> I'm quite sure it is related with Bridge device as network traffic is the crash
> trigger. 

Your newer setup is using bridging just like the older one (just on different NICs), so I can't see how you would want to distinguish the two.

> Since I changed my config to use my two RTL ethernet cards instead of the MPC51
> one, I have no log into /var/log/message but the network is still crashing : in
> most of the case, the internal network (between VM and Dom0) is working
> corectly but the external communication (Dom0 or VM communicating to an
> external machine) is not working. At this point I can type any command you want
> to get analysis.
> In some other cases, the global server simply crash and I get no acces to
> anything (need to reboot) the /var/log/message have no messages related to
> this.

Again, we'll need a full log (up to and including any messages generated during an eventual full machine crash - those typically don't make it to persistent store, so you'll have to set up a serial console, at once allowing you to collect both kernel and hypervisor messages at the same time).
Comment 8 Paul Pinault 2010-09-29 17:08:36 UTC
Created attachment 392185 [details]
full kernel log since opensuse 11.3 install ; see after sept 23 for last kernel update

full kernel log as requested
Comment 9 Paul Pinault 2010-09-29 17:24:33 UTC
looking for a serial cable to activate console trace ... it should be in place tonight
Comment 10 Paul Pinault 2010-09-29 18:49:34 UTC
Unfortunatly, no X serial cable :( ... will have to wait more for this, hope the kernel trace will help
Comment 11 Jan Beulich 2010-09-30 07:40:56 UTC
The log doesn't tell much, but at least it clarifies it's not the problem I was suspecting. Instead, especially the instance on Sep 16 suggest a more general interrupt handling problem, as a SATA device also suffered. Later instances with the 8139 don't, however - did you reconfigure the system in some way (e.g. was the interrupt shared originally, and now it isn't)?

We'll need /var/log/boot.msg for both a native and a Xen kernel boot, and we'll need access to Xen's console (if the system is still usable once this state is reached, "xm debug-key" and "xm dmesg" command will do, but if it isn't a serial console is going to be unavoidable).

One other thing to try would be passing "cpuidle=0" to Xen. And of course I assume you already installed the recently released Xen update, and know the issue is not solved by this.

Finally, it would also be useful to know whether the latest kernel-of-the-day (ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.3/x86_64/, 2.6.36-rc based, but specifically with some rework of the interrupt handling) would help.
Comment 12 Paul Pinault 2010-09-30 08:13:03 UTC
(In reply to comment #11)
> The log doesn't tell much, but at least it clarifies it's not the problem I was
> suspecting. Instead, especially the instance on Sep 16 suggest a more general
> interrupt handling problem, as a SATA device also suffered. Later instances
> with the 8139 don't, however - did you reconfigure the system in some way (e.g.
> was the interrupt shared originally, and now it isn't)?

I did not changed anything like this ; just change my network config to get my system stable for a longer time. SATA was a second side effect, when it crashed, firstly eth3 crashed, then I stoped & restard it ; it worked some time then SATA crashed ... but has you say this seems not to be the root cause, they are side effects on something else.


> We'll need /var/log/boot.msg for both a native and a Xen kernel boot,
ok, i'll provide this

> and we'll
> need access to Xen's console (if the system is still usable once this state is
> reached, "xm debug-key" and "xm dmesg" command will do, but if it isn't a
> serial console is going to be unavoidable).
When only network is crashed, the VM continue to work well but w/o external network (internal network with dom0 continue to work) ... until the Dom0 crash.


 
> One other thing to try would be passing "cpuidle=0" to Xen. And of course I
> assume you already installed the recently released Xen update, and know the
> issue is not solved by this.
All the systems : Dom0 and VMs are patched with the latest version of each systms, I have Opensuse 11.3 as Dom0 and Opensuse 11.1 and Opensuse 11.2 as VMs
cpuidle=0 : ok I will chnage this


> Finally, it would also be useful to know whether the latest kernel-of-the-day
> (ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.3/x86_64/, 2.6.36-rc
> based, but specifically with some rework of the interrupt handling) would help.
Something possible to do after the others test ... no pbm...

I hope to find a serial cable for this weekend to be able to reproduce with all log info ..
Comment 13 Paul Pinault 2010-09-30 16:26:50 UTC
Created attachment 392393 [details]
boot.msg normal kernel (no xen)

Normal boot.msg log
Comment 14 Paul Pinault 2010-09-30 16:27:45 UTC
Created attachment 392394 [details]
boot.msg xen kernel

boot.msg xen kernel log file
Comment 15 Paul Pinault 2010-09-30 17:31:01 UTC
The serial cable is in place ... start capturating logs ...
Comment 16 Paul Pinault 2010-09-30 17:59:26 UTC
Created attachment 392409 [details]
First log booting Dom0 and crashing dom0

This log has been get from serial console. It boots the Xen kernel, start a VM, start a second VM manually , then I crash the system by generating a NFS transfer on BR0/eth0 (it takes less than 5 min to crash) at this point of time I was not able to use the system anymore (no keyboard, no mouse .. screen up but frozen) had to reset.
Comment 17 Paul Pinault 2010-09-30 18:01:07 UTC
Created attachment 392410 [details]
Second Log booting dom0 with acpi=on

As usually I boot with acpi=off, i changed this (removing the option), xen kernel start booting but crashed before the end of the boot ... log contains more information on crash.
Comment 18 Paul Pinault 2010-09-30 18:05:07 UTC
Created attachment 392411 [details]
Third Log booting Dom0 with acpi=off

Third test : back to acpi=off, so the context is the same as in the first log, but this time I was not able to finish to boot before crash appends. 

Right now the fourth log is in progress, I just reboot and the system finished to boot correctly (as in Log1) (compare to log2 and log3 I did a switch on/off of the machine instead of just using the reset button)

I will try to crash it differently to be able to get keyboard access to type the xm dmesg command
Comment 19 Paul Pinault 2010-09-30 20:09:58 UTC
Other testing done tonight ... 
- I'm able to make it crash easily just by generating traffic on any interface
- I'm actually not able to access console when crashed to type xm dmesg or simply dmesg ... may be later
- during each "home made" crash I did not see any interesting logs on console
- after crash I usually get Input/Output error on any command (including dmesg), sometime I don't have keyboard, sometime I have
- Normal kernel is stable ( I transfered about 12G w/o any issue when I never transfer more than 2G on Xen kernel (interresting limit ..) but generally less is sufficient)
- Actually I boot my system with acpi=on and it works as bad as acpi=off

- I'll try latest kernel version ... no more test idea as nothing interesting in the log I see... I hope you will decode the matrix in the one I attached today.
Comment 20 Paul Pinault 2010-09-30 20:25:35 UTC
> Finally, it would also be useful to know whether the latest kernel-of-the-day
> (ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.3/x86_64/, 2.6.36-rc
> based, but specifically with some rework of the interrupt handling) would help.

Only found ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.3/x86_64/kernel-xen-debuginfo-2.6.34.7-0.3.99.8.0873825.x86_64.rpm
But try that one ... expecting more trace !
Comment 21 Paul Pinault 2010-09-30 21:18:57 UTC
So, to finish test tonight : I choose the kernel of the day and it crash exactly the same way :(

saturn:/home/disk # uname -a
Linux saturn 2.6.34.7-0.3.99.8.0873825-xen #1 SMP 2010-09-27 20:56:41 +0200 x86_64 x86_64 x86_64 GNU/Linux

When crash I got the following elements:

Sep 30 22:51:44 saturn kernel: [  201.068023]   alloc kstat_irqs on node 0
Sep 30 22:52:37 saturn kernel: [  254.200733] br0: port 3(vif2.0) entering disabled state
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/vif-bridge: offline XENBUS_PATH=backend/vif/2/0
Sep 30 22:52:37 saturn kernel: [  254.220087] br0: port 3(vif2.0) entering disabled state
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/vif-bridge: brctl delif br0 vif2.0 failed
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/vif-bridge: ifconfig vif2.0 down failed
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif2.0, bridge br0.
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vkbd/2/0
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/console/2/0
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vfb/2/0
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/2/51712
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vif/2/0
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/2/51728
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/2/51760
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/2/51760
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/2/51728
Sep 30 22:52:37 saturn logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/2/51712
Sep 30 22:55:01 saturn /usr/sbin/cron[6050]: (root) CMD (/opt/stats/execstat.sh > /dev/null)
Sep 30 22:55:28 saturn kernel: [  425.812012] ------------[ cut here ]------------
Sep 30 22:55:28 saturn kernel: [  425.812023] WARNING: at /usr/src/packages/BUILD/kernel-xen-2.6.34.7/linux-2.6.34/net/sched/sch_generic.c:256 dev_watchdog+0x25b/0x270()
Sep 30 22:55:28 saturn kernel: [  425.812025] Hardware name: System Product Name
Sep 30 22:55:28 saturn kernel: [  425.812027] NETDEV WATCHDOG: eth1 (8139too): transmit queue 0 timed out
Sep 30 22:55:28 saturn kernel: [  425.812029] Modules linked in: ip6t_LOG xt_tcpudp xt_pkttype xt_physdev ipt_LOG xt_limit usbbk gntdev netbk blkbk blkback_pagemap blktap domctl hwmon_vid xenbus_be snd_pcm_oss evtchn snd_mixer_oss coretemp snd_seq snd_seq_device edd nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs bridge stp llc ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables fuse loop snd_hda_codec_realtek firewire_ohci firewire_core crc_itu_t snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer ohci1394 snd usbhid 8139too soundcore ppdev 8250_pnp hid usblp ieee1394 8139cp forcedeth pcspkr shpchp i2c_nforce2 snd_page_alloc parport_pc sg 8250 sr_mod pci_hotplug parport serial_core floppy asus_atk0110 ext4 jbd2 crc16 dm_mirror dm_region_hash dm_log nouveau ttm drm_kms_helper ohci_hcd drm agpgart i2c_algo_bit i2c_core ehci_hcd sd_m
Sep 30 22:55:28 saturn kernel: od usbcore button dm_snapshot dm_mod xenblk cdrom xennet fan processor ata_generic pata_amd sata_nv libata scsi_mod thermal thermal_sys hwmon
Sep 30 22:55:28 saturn kernel: [  425.812110] Pid: 0, comm: swapper Not tainted 2.6.34.7-0.3.99.8.0873825-xen #1
Sep 30 22:55:28 saturn kernel: [  425.812128]  [<ffffffff8040a79b>] dump_stack+0x69/0x6f
Sep 30 22:55:28 saturn kernel: [  425.812134]  [<ffffffff80043943>] warn_slowpath_common+0x73/0xb0
Sep 30 22:55:28 saturn kernel: [  425.812138]  [<ffffffff800439e0>] warn_slowpath_fmt+0x40/0x50
Sep 30 22:55:28 saturn kernel: [  425.812142]  [<ffffffff8034d04b>] dev_watchdog+0x25b/0x270
Sep 30 22:55:28 saturn kernel: [  425.812149]  [<ffffffff80053d34>] run_timer_softirq+0x1d4/0x3d0
Sep 30 22:55:28 saturn kernel: [  425.812154]  [<ffffffff8004b8c8>] __do_softirq+0xe8/0x220
Sep 30 22:55:28 saturn kernel: [  425.812159]  [<ffffffff80007efc>] call_softirq+0x1c/0x30
Sep 30 22:55:28 saturn kernel: [  425.812163]  [<ffffffff80009595>] do_softirq+0xa5/0xe0
Sep 30 22:55:28 saturn kernel: [  425.812168]  [<ffffffff8004bafd>] irq_exit+0x8d/0xa0
Sep 30 22:55:28 saturn kernel: [  425.812174]  [<ffffffff802d27d2>] evtchn_do_upcall+0x222/0x270
Sep 30 22:55:28 saturn kernel: [  425.812179]  [<ffffffff80007a4e>] do_hypervisor_callback+0x1e/0x30
Sep 30 22:55:28 saturn kernel: [  425.812190]  [<ffffffff800033aa>] 0xffffffff800033aa
Sep 30 22:55:28 saturn kernel: [  425.812199]  [<ffffffff80009c0c>] xen_safe_halt+0xc/0x10
Sep 30 22:55:28 saturn kernel: [  425.812202]  [<ffffffff8000e763>] xen_idle+0x43/0xc0
Sep 30 22:55:28 saturn kernel: [  425.812207]  [<ffffffff80005255>] cpu_idle+0x55/0xa0
Sep 30 22:55:28 saturn kernel: [  425.812213]  [<ffffffff80761b0a>] start_kernel+0x3d2/0x3dd
Sep 30 22:55:28 saturn kernel: [  425.812216] ---[ end trace b6b372b1b3719054 ]---
Sep 30 22:55:31 saturn kernel: [  428.812028] eth1: link up, 100Mbps, full-duplex, lpa 0x45E1
Sep 30 22:59:13 saturn shutdown[6123]: shutting down for system halt
Sep 30 22:59:13 saturn init: Switching to runlevel: 0
Sep 30 22:59:19 saturn sshd[3412]: Received signal 15; terminating.
Sep 30 22:59:19 saturn avahi-daemon[3592]: Leaving mDNS multicast group on interface br1.IPv4 with address 10.0.1.20.
Sep 30 22:59:19 saturn avahi-daemon[3592]: Leaving mDNS multicast group on interface br0.IPv4 with address 10.0.0.20.
Sep 30 22:59:19 saturn auditd[3340]: The audit daemon is exiting.
Sep 30 22:59:19 saturn smartd[4254]: smartd received signal 15: Terminated
Sep 30 22:59:19 saturn smartd[4254]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.ST3250620AS-5QF15S8C.ata.state
Sep 30 22:59:19 saturn smartd[4254]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.ST3250620AS-9QE06V9D.ata.state
Sep 30 22:59:19 saturn smartd[4254]: smartd is exiting (exit status 0)
Sep 30 22:59:19 saturn gnome-keyring-daemon[4985]: dbus failure unregistering from session: Connection is closed
Sep 30 22:59:19 saturn gnome-keyring-daemon[4985]: dbus failure unregistering from session: Connection is closed
Sep 30 22:59:19 saturn polkitd(authority=local): Unregistered Authentication Agent for session /org/freedesktop/ConsoleKit/Session2 (system bus name :1.56, object path /org/gnome/PolicyKit1/AuthenticationAgent, locale fr_FR.utf8) (disconnected from bus)
Sep 30 22:59:19 saturn kernel: [  656.464725] [drm] nouveau 0000:03:00.0: nouveau_channel_free: freeing fifo 2


On the VM side, I got
mm.c 799:d2 non-privileged(2) attenpt tp map I/O space 0000...f0

Hope it will help ...
Comment 22 Jan Beulich 2010-10-01 09:35:18 UTC
(In reply to comment #16)
> Created an attachment (id=392409) [details]
> First log booting Dom0 and crashing dom0

Did you see "(XEN) APIC error on CPU3: 00(40)"? Are you having problems with your hardware?
Comment 23 Jan Beulich 2010-10-01 09:37:38 UTC
(In reply to comment #17)
> Created an attachment (id=392410) [details]
> Second Log booting dom0 with acpi=on
> 
> As usually I boot with acpi=off, i changed this (removing the option), xen
> kernel start booting but crashed before the end of the boot ... log contains
> more information on crash.

The log here is completely meaningless. You pressed arbitrary keys on the serial console (or the remote end sent them without you asking for them) - one can't even tell whether the box was hung, or how far the boot progressed.

BUT: if you think you need to disable ACPI, that may be part of your problem. I have yet to understand why you need to...
Comment 24 Jan Beulich 2010-10-01 09:42:39 UTC
(In reply to comment #18)
> Third test : back to acpi=off, so the context is the same as in the first log,
> but this time I was not able to finish to boot before crash appends. 

Just like for the previous one - there's no evidence that the box crashed, you just had it print huge piles of information. If you didn't ask for it yourself, you'll need to tweak your "other end" of the serial cable (also indicated by the extra blank lines inserted, which make the logs quite hard to read).

> I will try to crash it differently to be able to get keyboard access to type
> the xm dmesg command

No need for "xm dmesg" once you have a serial cable. You get all messages there, and you issue debug keys from the serial console (after switching input to Xen).
Comment 25 Jan Beulich 2010-10-01 09:43:58 UTC
(In reply to comment #20)
> Only found
> ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.3/x86_64/kernel-xen-debuginfo-2.6.34.7-0.3.99.8.0873825.x86_64.rpm
> But try that one ... expecting more trace !

Sorry, I really intended to direct you to ftp://ftp.suse.com/pub/projects/kernel/kotd/master/x86_64/.
Comment 26 Jan Beulich 2010-10-01 09:52:16 UTC
Turning off ACPI only for Xen makes things even more suspicious. What's the deal here?

Also, can you reproduce your problems on other, very different hardware?

Finally, one thing you definitely want to try is disabling the use of the nouveau driver in the Xen case.
Comment 27 Paul Pinault 2010-10-01 10:04:22 UTC
(In reply to comment #22)
> (In reply to comment #16)
> > Created an attachment (id=392409) [details] [details]
> > First log booting Dom0 and crashing dom0
> 
> Did you see "(XEN) APIC error on CPU3: 00(40)"? Are you having problems with
> your hardware?

I don't think so, system is not crashing when I choose a non xen kernel. CPU is a fresh one never overclocked of something like this. I had a problem with a previous motherboard but I had the problem before and I continue to have it ...
Comment 28 Paul Pinault 2010-10-01 10:08:00 UTC
(In reply to comment #23)
> (In reply to comment #17)
> > Created an attachment (id=392410) [details] [details]
> > Second Log booting dom0 with acpi=on
> > 
> > As usually I boot with acpi=off, i changed this (removing the option), xen
> > kernel start booting but crashed before the end of the boot ... log contains
> > more information on crash.
> 
> The log here is completely meaningless. You pressed arbitrary keys on the
> serial console (or the remote end sent them without you asking for them) - one
> can't even tell whether the box was hung, or how far the boot progressed.
> 
> BUT: if you think you need to disable ACPI, that may be part of your problem. I
> have yet to understand why you need to...

In fact with or without acpi it does not change anything, what I detect is that acpi with a slow serial console in crashing, here, i don't kno why but my remote uart is set at 9600bps and can't be set to a higher baudrate. At the baudrate I can't boot the acpi on xen kernel ... I do not this this log is really interesting regarding the network problem ; it was in case of ..
Comment 29 Paul Pinault 2010-10-01 10:13:03 UTC
> Turning off ACPI only for Xen makes things even more suspicious. What's the
> deal here?
The deal was to be able to detect my sensors but right now, acpi is on and my sensors worked well so I removed acpi=off. This does not affect the crash (that was the purpose of the different test - unvalidate this setting impact)

 
> Also, can you reproduce your problems on other, very different hardware?
I do not have other hadware actually available for this.


> Finally, one thing you definitely want to try is disabling the use of the
> nouveau driver in the Xen case.
I do not understand what you mean by this. what is the "nouveau driver" ?
Comment 30 Paul Pinault 2010-10-01 10:36:28 UTC
> > Finally, one thing you definitely want to try is disabling the use of the
> > nouveau driver in the Xen case.
> I do not understand what you mean by this. what is the "nouveau driver" ?

Sorry ... got it ! I'll try asap.
Comment 31 Paul Pinault 2010-10-01 20:22:12 UTC
Tonight test
- To blacklist nouveau ... I tryed to add "blacklist nouveau" into /etc/modprobe.d/50-blacklist.conf and 00-system.conf ... after a reboot, nouveau module is still here ... so any idea to blacklist it really ? other than moving nouveau.ko out of the fs ?

- kernel patch to 2.6.36
The system crashed ... this time ata1 then ata2 crashed ... i'll try to attach log file tomorrow
Comment 32 Paul Pinault 2010-10-05 20:09:01 UTC
For your information I finished to migrate my VM from xen to qemu-kvm ... now, the system look stable : all vm running in parallel and actually worked well.
I'm still able to reproduce the crash if you need my assistance to fix it.
Comment 33 Jan Beulich 2010-11-04 15:45:50 UTC
Still missing the log promised in #31.

Also please try disabling IRQ balancing in Xen ("noirqbalance" on the Xen command line) and/or in Linux (disabling the irq balance daemon in case it is enabled).
Comment 34 Paul Pinault 2010-11-05 16:13:15 UTC
(In reply to comment #33)
> Still missing the log promised in #31.
Nothing attached as the log is the same as the previous one. Nothing to see on it.


> Also please try disabling IRQ balancing in Xen ("noirqbalance" on the Xen
> command line) and/or in Linux (disabling the irq balance daemon in case it is
> enabled).
Next time i will have to reboot my kvmqemu config i will do the test... Actually it works since at least one month. Not sure to answer quickly.
Comment 35 Jan Beulich 2010-11-05 16:55:23 UTC
(In reply to comment #34)
> Nothing attached as the log is the same as the previous one. Nothing to see on
> it.

How that, if now you don't load the nouveau driver, while previously you did?
Comment 36 Jan Beulich 2011-04-26 15:11:06 UTC
Ping?
Comment 37 Jan Beulich 2011-07-06 08:09:53 UTC
No response in over half a year. Feel free to re-open if you're ready to continue providing necessary information.