Bug 300496

Summary: X11 applications hang in XOpenDisplay on Intel Conroe/Core 2 Duo CPUs
Product: [openSUSE] openSUSE 10.3 Reporter: James Oakley <jfunk>
Component: XenAssignee: Pat Campbell <plc>
Status: RESOLVED NORESPONSE QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P4 - Low CC: alex, andreas_graf, carnold, coolo, eich, haicheng.li, jbeulich, jdouglas, marc.ruehrschneck, michel.munnix, mmeeks, sshaw, yongkang.you, yunfeng.zhao
Version: RC 3Flags: coolo: SHIP_STOPPER-
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Xorg log
hwinfo log
'd' over the serial console while hung
sysrq 't'
sysrq 't'
output from /proc/modules
XOpenDisplay.c

Description James Oakley 2007-08-14 23:09:48 UTC
If I boot the Xen kernel, X, with the intel driver, uses 100% cpu and hangs. The cursor still moves, but very slowly.

Everything works fine with the default kernel.

Unfortunately, there's nothing obvious in the /var/log/Xorg.0.log, but I will attach it anyway, along with the output of hwinfo.
Comment 1 James Oakley 2007-08-14 23:12:22 UTC
Created attachment 157546 [details]
Xorg log
Comment 2 James Oakley 2007-08-14 23:13:00 UTC
Created attachment 157547 [details]
hwinfo log
Comment 3 Charles Coffing 2007-08-28 14:56:44 UTC
Stephen, try to reproduce this in the lab.
Comment 4 Stephen Shaw 2007-08-30 16:33:58 UTC
I wonder if this is similar to my bug that I have, but I'm using nvidia.  Just the nv X.org driver.

bug #304642

I'll look into this
Comment 5 Stefan Dirsch 2007-08-30 16:43:23 UTC
*** Bug 304642 has been marked as a duplicate of this bug. ***
Comment 6 Stephen Shaw 2007-09-04 17:18:57 UTC
Not sure what to do with this, there are logs here and I haven't found time to research this more.  I also haven't had this problem in my lab on 9 or 10 different machines.
Comment 7 Stefan Dirsch 2007-09-04 19:36:03 UTC
Indeed nothing obvious in the Xserver logfile. Just a wild guess. It could be related to enabled 3D. Does it still happen with 3D disabled?
Comment 8 Stefan Dirsch 2007-09-13 12:10:55 UTC
Stephen can't reproduce it (any longer) and still no response by the original reporter after 9 days.
Comment 9 Stephen Shaw 2007-09-13 15:11:42 UTC
This might have everything to do with the processor type.  So far I've only seen this on conroe chips.  I wish I could do some logs, but my machine locks up.  I recommend that you find a Core 2 Duo and run it on there.

James, is your computer a core 2 duo/ conroe
Comment 10 Stephen Shaw 2007-09-13 15:18:09 UTC
Reopening based on new information
Comment 11 James Oakley 2007-09-13 16:50:02 UTC
Yes, it's a Core 2 Duo 6300.

Sorry I haven't tested without 3d yet, I've been very busy and haven't had a chance. I'll try it tonight.
Comment 12 Stephen Shaw 2007-09-13 17:19:36 UTC
I'm setting up a machine in my lab and hooking up a serial cable for output.  Hopefully in the next hour or two we'll have some more information
Comment 13 Stephen Shaw 2007-09-13 19:44:20 UTC
what's the need info for?
Comment 15 Stefan Dirsch 2007-09-13 19:46:55 UTC
Does it still happen with 3D disabled?
Comment 16 Stephen Shaw 2007-09-13 20:08:35 UTC
mine has always had the 3d disabled
Comment 17 Stefan Dirsch 2007-09-13 20:19:19 UTC
It is enabled in the log James attached.

> (II) intel(0): direct rendering: Enabled
Comment 18 Stephen Shaw 2007-09-13 20:25:55 UTC
sorry, I don't have his machine and I'm having the same problem.  This would lead me to believe that this isn't a 3D/non 3D issue.

I did notice that this problem has this issue when starting X from the with startx.  Does this mean that this has nothing to do with gdm?  Also you lose the keyboard.  I noticed on a pstree that the next process running was numlock (or something like that).  

Are you able to log into that machine I listed above?  You are welcome to do whatever you want with that box.  This would give you access to whatever logs and environment that you might want without having to set it up yourself.  
Comment 20 Stephen Shaw 2007-09-13 20:43:29 UTC
uptime
xen53:~ # uptime
  2:38pm  up   1:11,  4 users,  load average: 0.82, 0.32, 0.38
xen53:~ # 


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                 
 3136 root      24  -1  322m 7500 3596 R  100  0.4   2:03.57 X

After that much time I can't image why X would still be using 100% cpu usage.

Nothing is coming up on the screen other than a grey background and a really slow moving cursor.
Comment 21 Stephen Shaw 2007-09-13 20:44:30 UTC
making last comment open,  James do you see this too if you remote into the problem machine
Comment 22 Stefan Dirsch 2007-09-13 21:00:47 UTC
Thanks for the hint. I figured out with strace that any X11 applications hang when reading cookies from $HOME/.Xauthority. It does not happen when the Xserver is started without authentication (option "-ac").
Comment 23 Stephen Shaw 2007-09-13 21:03:22 UTC
THANKS!!!  This is a big help.  Any idea why this is happening to a XEN  boot and not to a non XEN boot?

Comment 24 Stefan Dirsch 2007-09-13 21:32:21 UTC
Not really. :-( Definitely this is not related to any graphics driver at all.

# DISPLAY=:0.0 ltrace xsetroot -solid green
__libc_start_main(0x4017d0, 3, 0x7fffde613928, 0x402660, 0x402650 <unfinished ...>
XOpenDisplay(NULL

I wonder why this only happens with conroe/Core 2 Duo chips.
Comment 25 Stephen Shaw 2007-09-13 22:14:59 UTC
I'm working on finding a core 2 quad or whatever they are called to see if I can replicate it there too,  I'm off for the day, but will be back tomorrow.

This is really strange.  That machine that I mentioned earlier is free to use.  You can reboot/do whatever you want with it.  Its been setup just for this bug.

It would be great to see this fixed in RC1, but as time permits.  What else can I do to help you out with this?

Chuck any ideas?
Comment 26 Stefan Dirsch 2007-09-14 08:47:04 UTC
If Xen is an issue for openSUSE 10.3 this is probably a blocker, since Core 2 Duo CPUs are rather common ATM. At least it's critical IMHO.
Comment 27 Stephan Kulow 2007-09-14 12:42:47 UTC
Frank, do you have any resources who can help solving the problem?
Comment 28 Frank Kohler 2007-09-14 14:58:30 UTC
James, JTMS: Is XGL/compiz disabled?
Comment 30 James Oakley 2007-09-14 17:08:03 UTC
Yes, XGL/compiz is disabled on my machine.
Comment 31 Charles Coffing 2007-09-14 20:43:00 UTC
Seems to be related to PAE:

openSUSE 10.3 beta 3 plus, on a Conroe:
kernel-default:  works ok
kernel-xen:      works ok
kernel-xenpae:   100% CPU usage; unusably slow
Comment 32 Charles Coffing 2007-09-14 21:49:37 UTC
When it's stuck in the bad state, the stack of the X process is:  (Sorry for lack of symbols.  I don't see debuginfo packages on my beta 3 plus mirror)

#0  0xf57fe402 in __kernel_vsyscall ()
#1  0xb7d8a9a9 in fork () from /lib/libc.so.6
#2  0x081c14b8 in Popen ()
#3  0x081b6656 in XkbDDXCompileKeymapByNames ()
#4  0x081b68a6 in XkbDDXLoadKeymapByNames ()
#5  0x08194381 in ProcXkbGetKbdByName ()
#6  0x0819501a in ?? ()
#7  0x083c1a28 in ?? ()
#8  0x081e4ff4 in BitOrderInvert ()
#9  0xbfebeb98 in ?? ()
#10 0x08154a6e in ?? ()
#11 0x083c1a28 in ?? ()
#12 0x083c1a28 in ?? ()
#13 0x08201ec0 in ?? ()
#14 0x081e4ff4 in BitOrderInvert ()
#15 0x00000000 in ?? ()

The child forked by X (numlockx) has already started running and is stuck in the kernel similarly:
#0  0xf57fe402 in __kernel_vsyscall ()
#1  0xb7db455d in select () from /lib/libc.so.6
#2  0xb7ce0165 in ?? () from /usr/lib/libxcb.so.1
#3  0x00000004 in ?? ()
#4  0xbfa44fec in ?? ()
#5  0xbfa44f6c in ?? ()
#6  0x00000000 in ?? ()

The X process is still stuck in the clone(); strace -r output is:
     0.000000 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cfb708) = ? ERESTARTNOINTR (To be restarted)
     0.098460 --- SIGALRM (Alarm clock) @ 0 (0) ---
     0.000000 sigreturn()               = ? (mask now [])
     0.000000 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cfb708) = ? ERESTARTNOINTR (To be restarted)
     0.098023 --- SIGALRM (Alarm clock) @ 0 (0) ---
     0.000000 sigreturn()               = ? (mask now [])
Comment 33 Charles Coffing 2007-09-14 22:15:01 UTC
CC-ing Jan.
Jan, do you have any idea what might be going on, keeping in mind that this hangs on PAE i586 and x86_64, but works on non-PAE i586?
Comment 34 Stefan Dirsch 2007-09-16 18:06:24 UTC
Sounds like a kernel issue then.
Comment 36 Stephan Kulow 2007-09-17 09:20:52 UTC
ok, looks pretty specific. So I decided to remove the ship stopper flag
Comment 37 Jan Beulich 2007-09-17 10:16:08 UTC
Re #33: This would pretty much indicate a 32-bit DMA issue (i.e. something trying a DMA32 [or DMA] allocation, using its result (translated to phys/machine) directly (i.e. without going through xen_create_contiguous_region). I had screened the agp and drm code in the kernel for such issues quite a while ago, and the problems this revealed are believed all fixed meanwhile.

Also, I'm not clear about the final determination of whether 3D matters here (looking at the machine above, I can't e.g. see any drm module loaded, but the machine also isn't running Xen at the moment, nor is it using the bigsmp/xenpae kernels), as well as what configurations are affected (#33 states non-PAE is unaffected, but the machine pointed to in earlier comments doesn't appear to use or need PAE kernels).

It might help to get a state dump from Xen (perhaps a number of times in case execution turns out to be in user mode) by sending 'd' from the serial console (make sure input focus is with Xen).

Comment 38 Stefan Dirsch 2007-09-17 12:11:40 UTC
agp/drm is not involved here at all. It occurs somewhere in the X authentication code as I've written before.
Comment 39 Jan Beulich 2007-09-17 12:49:49 UTC
But then you mean a generic kernel issue, not a Xen specific one (although with no explanation why it happens on [certain?] Xen kernel only)?
Comment 40 Stefan Dirsch 2007-09-17 13:04:55 UTC
It must be a Xen kernel specific issue, as it only happens with a Xen kernel
on Conroe/Core 2 Duo CPUs.
Comment 41 Jan Beulich 2007-09-17 13:12:24 UTC
I see no proof of this so far, but let's wait until we have feedback on #37.
Comment 42 Charles Coffing 2007-09-17 13:56:06 UTC
Created attachment 172792 [details]
'd' over the serial console while hung

Booted with xen-pae-dbg

This looks to be the most common bit of stack:
(XEN)    [<ff138626>] smp_call_function_interrupt+0x86/0xa0
(XEN)    [<ff12965b>] call_function_interrupt+0x5b/0x70
(XEN)    [<ff17ed3c>] handle_exception+0x5c/0xab
Comment 43 Charles Coffing 2007-09-17 14:13:30 UTC
Sorry, duh, ignore that bit o' stack in the comment.  The hypervisor seems idle.

The xen-pae-dbg I used was stock beta 3 plus, so if needed you can look at xen-syms* in there.  But everything I've seen suggests to me that it's Linux that's hosed, not they hypervisor.

Removing NEEDINFO.
Comment 44 Jan Beulich 2007-09-17 14:32:38 UTC
Yes, this is what I concluded, too. And why I asked for doing a SysRq-t (telling the kernel to dump all tasks' states). I assume the kernel used then also is stock beta 3.
Comment 45 Jan Beulich 2007-09-17 15:10:30 UTC
Oh, please also attach a /proc/modules listing from the point when the dumps were taken.
Comment 46 Jan Beulich 2007-09-17 15:32:41 UTC
Hmm, two of the backtraces in #42 don't seem to match 2.6.22.5-10-xenpae (address c02be131 is not at an instruction boundary).
Comment 47 Charles Coffing 2007-09-17 16:48:22 UTC
I am running "beta 3 plus", not beta 3.  kernel-xenpae is 2.6.22.5-16.
Comment 49 Charles Coffing 2007-09-17 21:04:40 UTC
Created attachment 172878 [details]
sysrq 't'

System is in runlevel 1, with only syslog started.  Then ran "startx" and did sysrq 't' twice.

openSUSE 10.3 beta 3 plus, on xen-pae
Comment 50 Charles Coffing 2007-09-17 21:16:08 UTC
Created attachment 172880 [details]
sysrq 't'

System is in runlevel 1, with only syslog started.  Then ran "startx" and did
sysrq 't' twice.

openSUSE 10.3 beta 3 plus, on xen-pae
Comment 51 Charles Coffing 2007-09-17 21:17:48 UTC
Created attachment 172881 [details]
output from /proc/modules

Matching output from /proc/modules just before startx
Comment 52 Stephan Kulow 2007-09-18 06:28:48 UTC
I forgot to downgrade when I changed the flag
Comment 53 Jan Beulich 2007-09-18 12:37:17 UTC
Oh, sorry, I forgot that Linux doesn't dump the state of running tasks. The output in #50 is thus mostly useless.

The traces from #42, knowing the kernel version, point out that the state dump always happened at a (faulting) access to the m2p map (i.e. an attempt to translate an mfn that doesn't have an entry in the m2p table) - perhaps the frame buffer? But this may mean nothing, X may just be cloning child processes in a rapid fashion...
Comment 54 Frank Kohler 2007-09-19 03:42:33 UTC
This patch should be able to fix the issue:
http://xenbits.xensource.com/xen-unstable.hg?cs=cd89771ba550
The changeset on xen unstable tree is 13028.
Comment 55 Jan Beulich 2007-09-19 06:45:34 UTC
I don't think so, for two reasons:

- 10.3 is based on Xen 3.1, i.e. a changeset much newer than the one you indicate.
- You indicated earlier that the problem is seen on native, too.
Comment 56 Jason Douglas 2007-09-20 00:08:07 UTC
*** Bug 326573 has been marked as a duplicate of this bug. ***
Comment 57 Andreas Graf 2007-09-20 09:14:05 UTC
It seems not to be an X problem.
The following works:
Booting into runlevel 3, login as root, then
X & sleep 3 ; DISPLAY=:0 gnome &

I have a perfect X session without problems. I assume there is a problem withing the session management. Neither startx, nor gdm, nor xdm work.

One more comment:
I switched off the second kernel of my core2duo in my BIOS, absolutely the same behavior like with 2 kernels.
Comment 58 Andreas Graf 2007-09-20 09:20:56 UTC
raised severity to blocker
this problems touches too many mid-ranged companies using XEN for development or in small production environment.

BTW, does the same problem exist wit dual core XEONs? I can't check it within the next few days.
Comment 59 Stefan Dirsch 2007-09-20 09:53:53 UTC
(In reply to comment #57 from Andreas Graf)
> It seems not to be an X problem.
> The following works:
> Booting into runlevel 3, login as root, then
> X & sleep 3 ; DISPLAY=:0 gnome &

No longer authentication is used.

> I have a perfect X session without problems. I assume there is a problem
> withing the session management. Neither startx, nor gdm, nor xdm work.

See my comment above. See also comments #22-24.
Comment 60 Stephan Kulow 2007-09-20 10:07:26 UTC
Downgrading again. Please leave severities alone. A fix can very well be supplied by an online update
Comment 61 Stefan Dirsch 2007-09-20 17:26:02 UTC
This looks more and more like a reassignment party. I'll attach a minimalistic X11 program, with which you should be able to reproduce the problem as long as X authentication is in use.
Comment 62 Stefan Dirsch 2007-09-20 17:28:08 UTC
Created attachment 173729 [details]
XOpenDisplay.c
Comment 63 Stephen Shaw 2007-09-20 17:35:36 UTC
Is this a XEN problem if this happens with the native kernel (no XEN loaded)?  This in my opinion (not that it matters, but) should be a show stopper since it affects anyone (XEN or no XEN) running a core 2 duo processor.  If this is the case, I'm going to have to run a different distro at home until 11 or whenever this gets resolved.

Not to mention bad news for any companies using XEN on 10.3

Blocker/show stopper  or not, as stefan commented this should not be a reassigning party.  It doesn't get anything fixed and will only result in lose of users.
Comment 64 Stefan Dirsch 2007-09-20 18:01:18 UTC
I suggest to install all xorg-x11*debuginfo packages, attach gdb to Xorg process (started by startx/gdm/xdm to make sure authentication is in use) and start the sample X11 program. Maybe this gives us some more information, how the issue is triggered. Makes sense?

Unfortunately 151.155.190.53 is no longer available for testing. Otherwise I would have tried this myself.
Comment 65 Stefan Dirsch 2007-09-20 21:38:31 UTC
151.155.190.53 seems to be available again. Can you reboot Xen kernel again?
Comment 68 Stefan Dirsch 2007-09-21 07:19:20 UTC
151.155.190.53 is running Xen kernel again, but I can't reproduce the problem any longer. :-(
Comment 69 Stefan Dirsch 2007-09-21 07:23:54 UTC
(In reply to comment #64 from Stefan Dirsch)
> I suggest to install all xorg-x11*debuginfo packages, attach gdb to Xorg
> process (started by startx/gdm/xdm to make sure authentication is in use)
> and start the sample X11 program. 
Or start the sample X11 program in gdb first to see where it hangs.
Comment 70 Stephen Shaw 2007-09-21 16:52:38 UTC
stefan, I updated the box with the PAE kernel and XEN.  So it should fail again.
Comment 71 Stefan Dirsch 2007-09-21 17:45:51 UTC
My wild guessing about libxcb was not that wrong.

Program received signal SIGINT, Interrupt.
0xf57fe402 in __kernel_vsyscall ()
(gdb) bt
#0  0xf57fe402 in __kernel_vsyscall ()
#1  0xb7df955d in select () from /lib/libc.so.6
#2  0xb7d2931f in _xcb_in_read_block (c=0x804b610, buf=0x804d780, len=8)
    at xcb_in.c:248
#3  0xb7d28663 in xcb_connect_to_fd (fd=4, auth_info=0xbf947c30)
    at xcb_conn.c:133
#4  0xb7d2ad11 in xcb_connect (displayname=0x0, screenp=0x0) at xcb_util.c:276
#5  0xb7eb2d2a in _XConnectXCB (dpy=0x804b008, display=0x0, 
    fullnamep=0xbf947d78, screenp=0xbf947d74) at xcb_disp.c:78
#6  0xb7e9b389 in XOpenDisplay (display=0x0) at OpenDis.c:168
#7  0x080485ed in main ()
(gdb) 
Comment 72 Stefan Dirsch 2007-09-26 15:39:08 UTC
I can't reproduce this issue on 151.155.190.53 any more, although there is a xenpae kernel running.
Comment 73 Stephen Shaw 2007-09-28 18:02:24 UTC
its still a problem. I've install the 64 bit version of RC3 so that we don't have to worry about what its booted into and whether or not PAE is there or not.

Happy hacking!
Comment 74 Frank Kohler 2007-10-12 05:41:44 UTC
please retest after applying xorg-server patch (Oct11). thx
Comment 75 Stefan Dirsch 2007-10-12 08:03:08 UTC
(In reply to comment #74 from Frank Kohler)
> please retest after applying xorg-server patch (Oct11). thx
xorg-server patch? Patch for which package?

Comment 76 Stefan Dirsch 2007-10-12 10:19:35 UTC
From: Frank Kohler <FKohler@novell.com>

Just had this report. Personally I've seen an x package while doing an online update. Wasn't focused so unfortunately can't tell the name.
Comment 77 Stefan Dirsch 2007-10-12 10:25:33 UTC
Besides from the patch for xorg-x11-server (OpenOffice related fix) there were only security updates for X.Org. Therefore it's very unlikely that these updates fixes this issue. I'm not sure why you think they would possibly do.

xorg-x11:
-------------------------------------------------------------------
Mon Oct  1 15:40:00 CEST 2007 - sndirsch@suse.de

- bug-30806x.diff:
  * build_range() Server Integer Overflow Vulnerability in X Font
    Server (Bug #308064, IDEF2708)
  * fixes swap_char2b() Heap Overflow Vulnerability in X Font
    Server (Bug #308066, IDEF2709)

xorg-x11-libs:
-------------------------------------------------------------------
Tue Oct  2 09:08:43 CEST 2007 - sndirsch@suse.de

- libXfont-off_by_one.diff
  * prevent a one character overflow (Bug #327854)

-------------------------------------------------------------------
Wed Oct  3 15:03:27 CEST 2007 - sndirsch@suse.de

- xserver-1.3.0-xkb-and-loathing.patch
  * Ignore (not just block) SIGALRM around calls to Popen()/Pclose().
    Fixes a hang in openoffice when opening menus. (Bug #245711)
Comment 78 Stephen Shaw 2007-10-24 17:35:36 UTC
So, I'm guessing that XEN on opensuse 10.3 is not going to happen with Core 2 Duo.  Hopefully this will go away or be fixed in opensuse 11.0 and not appear in sles 10 sp2
Comment 79 Stefan Dirsch 2007-10-28 15:06:43 UTC
*** Bug 328310 has been marked as a duplicate of this bug. ***
Comment 84 Jason Douglas 2008-06-05 21:24:11 UTC
Is there any indication that this is still a problem with openSUSE 11.0?
Comment 85 Andreas Jaeger 2008-10-24 15:32:09 UTC
The requested information has not been provided for over 4 weeks.
The bug gets therefore closed as NORESPONSE.

Please reopen the bug if you can supply the requested information.
I would appreciate if you could test the current openSUSE 11.1 Beta
first and see whether that one fixes the problem.