|
Bugzilla – Full Text Bug Listing |
| Summary: | X11 applications hang in XOpenDisplay on Intel Conroe/Core 2 Duo CPUs | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 10.3 | Reporter: | James Oakley <jfunk> |
| Component: | Xen | Assignee: | Pat Campbell <plc> |
| Status: | RESOLVED NORESPONSE | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P4 - Low | CC: | alex, andreas_graf, carnold, coolo, eich, haicheng.li, jbeulich, jdouglas, marc.ruehrschneck, michel.munnix, mmeeks, sshaw, yongkang.you, yunfeng.zhao |
| Version: | RC 3 | Flags: | coolo:
SHIP_STOPPER-
|
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
Xorg log
hwinfo log 'd' over the serial console while hung sysrq 't' sysrq 't' output from /proc/modules XOpenDisplay.c |
||
|
Description
James Oakley
2007-08-14 23:09:48 UTC
Created attachment 157546 [details]
Xorg log
Created attachment 157547 [details]
hwinfo log
Stephen, try to reproduce this in the lab. I wonder if this is similar to my bug that I have, but I'm using nvidia. Just the nv X.org driver. bug #304642 I'll look into this *** Bug 304642 has been marked as a duplicate of this bug. *** Not sure what to do with this, there are logs here and I haven't found time to research this more. I also haven't had this problem in my lab on 9 or 10 different machines. Indeed nothing obvious in the Xserver logfile. Just a wild guess. It could be related to enabled 3D. Does it still happen with 3D disabled? Stephen can't reproduce it (any longer) and still no response by the original reporter after 9 days. This might have everything to do with the processor type. So far I've only seen this on conroe chips. I wish I could do some logs, but my machine locks up. I recommend that you find a Core 2 Duo and run it on there. James, is your computer a core 2 duo/ conroe Reopening based on new information Yes, it's a Core 2 Duo 6300. Sorry I haven't tested without 3d yet, I've been very busy and haven't had a chance. I'll try it tonight. I'm setting up a machine in my lab and hooking up a serial cable for output. Hopefully in the next hour or two we'll have some more information what's the need info for? Does it still happen with 3D disabled? mine has always had the 3d disabled It is enabled in the log James attached.
> (II) intel(0): direct rendering: Enabled
sorry, I don't have his machine and I'm having the same problem. This would lead me to believe that this isn't a 3D/non 3D issue. I did notice that this problem has this issue when starting X from the with startx. Does this mean that this has nothing to do with gdm? Also you lose the keyboard. I noticed on a pstree that the next process running was numlock (or something like that). Are you able to log into that machine I listed above? You are welcome to do whatever you want with that box. This would give you access to whatever logs and environment that you might want without having to set it up yourself. uptime xen53:~ # uptime 2:38pm up 1:11, 4 users, load average: 0.82, 0.32, 0.38 xen53:~ # PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3136 root 24 -1 322m 7500 3596 R 100 0.4 2:03.57 X After that much time I can't image why X would still be using 100% cpu usage. Nothing is coming up on the screen other than a grey background and a really slow moving cursor. making last comment open, James do you see this too if you remote into the problem machine Thanks for the hint. I figured out with strace that any X11 applications hang when reading cookies from $HOME/.Xauthority. It does not happen when the Xserver is started without authentication (option "-ac"). THANKS!!! This is a big help. Any idea why this is happening to a XEN boot and not to a non XEN boot? Not really. :-( Definitely this is not related to any graphics driver at all. # DISPLAY=:0.0 ltrace xsetroot -solid green __libc_start_main(0x4017d0, 3, 0x7fffde613928, 0x402660, 0x402650 <unfinished ...> XOpenDisplay(NULL I wonder why this only happens with conroe/Core 2 Duo chips. I'm working on finding a core 2 quad or whatever they are called to see if I can replicate it there too, I'm off for the day, but will be back tomorrow. This is really strange. That machine that I mentioned earlier is free to use. You can reboot/do whatever you want with it. Its been setup just for this bug. It would be great to see this fixed in RC1, but as time permits. What else can I do to help you out with this? Chuck any ideas? If Xen is an issue for openSUSE 10.3 this is probably a blocker, since Core 2 Duo CPUs are rather common ATM. At least it's critical IMHO. Frank, do you have any resources who can help solving the problem? James, JTMS: Is XGL/compiz disabled? Yes, XGL/compiz is disabled on my machine. Seems to be related to PAE: openSUSE 10.3 beta 3 plus, on a Conroe: kernel-default: works ok kernel-xen: works ok kernel-xenpae: 100% CPU usage; unusably slow When it's stuck in the bad state, the stack of the X process is: (Sorry for lack of symbols. I don't see debuginfo packages on my beta 3 plus mirror)
#0 0xf57fe402 in __kernel_vsyscall ()
#1 0xb7d8a9a9 in fork () from /lib/libc.so.6
#2 0x081c14b8 in Popen ()
#3 0x081b6656 in XkbDDXCompileKeymapByNames ()
#4 0x081b68a6 in XkbDDXLoadKeymapByNames ()
#5 0x08194381 in ProcXkbGetKbdByName ()
#6 0x0819501a in ?? ()
#7 0x083c1a28 in ?? ()
#8 0x081e4ff4 in BitOrderInvert ()
#9 0xbfebeb98 in ?? ()
#10 0x08154a6e in ?? ()
#11 0x083c1a28 in ?? ()
#12 0x083c1a28 in ?? ()
#13 0x08201ec0 in ?? ()
#14 0x081e4ff4 in BitOrderInvert ()
#15 0x00000000 in ?? ()
The child forked by X (numlockx) has already started running and is stuck in the kernel similarly:
#0 0xf57fe402 in __kernel_vsyscall ()
#1 0xb7db455d in select () from /lib/libc.so.6
#2 0xb7ce0165 in ?? () from /usr/lib/libxcb.so.1
#3 0x00000004 in ?? ()
#4 0xbfa44fec in ?? ()
#5 0xbfa44f6c in ?? ()
#6 0x00000000 in ?? ()
The X process is still stuck in the clone(); strace -r output is:
0.000000 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cfb708) = ? ERESTARTNOINTR (To be restarted)
0.098460 --- SIGALRM (Alarm clock) @ 0 (0) ---
0.000000 sigreturn() = ? (mask now [])
0.000000 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7cfb708) = ? ERESTARTNOINTR (To be restarted)
0.098023 --- SIGALRM (Alarm clock) @ 0 (0) ---
0.000000 sigreturn() = ? (mask now [])
CC-ing Jan. Jan, do you have any idea what might be going on, keeping in mind that this hangs on PAE i586 and x86_64, but works on non-PAE i586? Sounds like a kernel issue then. ok, looks pretty specific. So I decided to remove the ship stopper flag Re #33: This would pretty much indicate a 32-bit DMA issue (i.e. something trying a DMA32 [or DMA] allocation, using its result (translated to phys/machine) directly (i.e. without going through xen_create_contiguous_region). I had screened the agp and drm code in the kernel for such issues quite a while ago, and the problems this revealed are believed all fixed meanwhile. Also, I'm not clear about the final determination of whether 3D matters here (looking at the machine above, I can't e.g. see any drm module loaded, but the machine also isn't running Xen at the moment, nor is it using the bigsmp/xenpae kernels), as well as what configurations are affected (#33 states non-PAE is unaffected, but the machine pointed to in earlier comments doesn't appear to use or need PAE kernels). It might help to get a state dump from Xen (perhaps a number of times in case execution turns out to be in user mode) by sending 'd' from the serial console (make sure input focus is with Xen). agp/drm is not involved here at all. It occurs somewhere in the X authentication code as I've written before. But then you mean a generic kernel issue, not a Xen specific one (although with no explanation why it happens on [certain?] Xen kernel only)? It must be a Xen kernel specific issue, as it only happens with a Xen kernel on Conroe/Core 2 Duo CPUs. I see no proof of this so far, but let's wait until we have feedback on #37. Created attachment 172792 [details]
'd' over the serial console while hung
Booted with xen-pae-dbg
This looks to be the most common bit of stack:
(XEN) [<ff138626>] smp_call_function_interrupt+0x86/0xa0
(XEN) [<ff12965b>] call_function_interrupt+0x5b/0x70
(XEN) [<ff17ed3c>] handle_exception+0x5c/0xab
Sorry, duh, ignore that bit o' stack in the comment. The hypervisor seems idle. The xen-pae-dbg I used was stock beta 3 plus, so if needed you can look at xen-syms* in there. But everything I've seen suggests to me that it's Linux that's hosed, not they hypervisor. Removing NEEDINFO. Yes, this is what I concluded, too. And why I asked for doing a SysRq-t (telling the kernel to dump all tasks' states). I assume the kernel used then also is stock beta 3. Oh, please also attach a /proc/modules listing from the point when the dumps were taken. Hmm, two of the backtraces in #42 don't seem to match 2.6.22.5-10-xenpae (address c02be131 is not at an instruction boundary). I am running "beta 3 plus", not beta 3. kernel-xenpae is 2.6.22.5-16. Created attachment 172878 [details]
sysrq 't'
System is in runlevel 1, with only syslog started. Then ran "startx" and did sysrq 't' twice.
openSUSE 10.3 beta 3 plus, on xen-pae
Created attachment 172880 [details]
sysrq 't'
System is in runlevel 1, with only syslog started. Then ran "startx" and did
sysrq 't' twice.
openSUSE 10.3 beta 3 plus, on xen-pae
Created attachment 172881 [details]
output from /proc/modules
Matching output from /proc/modules just before startx
I forgot to downgrade when I changed the flag Oh, sorry, I forgot that Linux doesn't dump the state of running tasks. The output in #50 is thus mostly useless. The traces from #42, knowing the kernel version, point out that the state dump always happened at a (faulting) access to the m2p map (i.e. an attempt to translate an mfn that doesn't have an entry in the m2p table) - perhaps the frame buffer? But this may mean nothing, X may just be cloning child processes in a rapid fashion... This patch should be able to fix the issue: http://xenbits.xensource.com/xen-unstable.hg?cs=cd89771ba550 The changeset on xen unstable tree is 13028. I don't think so, for two reasons: - 10.3 is based on Xen 3.1, i.e. a changeset much newer than the one you indicate. - You indicated earlier that the problem is seen on native, too. *** Bug 326573 has been marked as a duplicate of this bug. *** It seems not to be an X problem. The following works: Booting into runlevel 3, login as root, then X & sleep 3 ; DISPLAY=:0 gnome & I have a perfect X session without problems. I assume there is a problem withing the session management. Neither startx, nor gdm, nor xdm work. One more comment: I switched off the second kernel of my core2duo in my BIOS, absolutely the same behavior like with 2 kernels. raised severity to blocker this problems touches too many mid-ranged companies using XEN for development or in small production environment. BTW, does the same problem exist wit dual core XEONs? I can't check it within the next few days. (In reply to comment #57 from Andreas Graf) > It seems not to be an X problem. > The following works: > Booting into runlevel 3, login as root, then > X & sleep 3 ; DISPLAY=:0 gnome & No longer authentication is used. > I have a perfect X session without problems. I assume there is a problem > withing the session management. Neither startx, nor gdm, nor xdm work. See my comment above. See also comments #22-24. Downgrading again. Please leave severities alone. A fix can very well be supplied by an online update This looks more and more like a reassignment party. I'll attach a minimalistic X11 program, with which you should be able to reproduce the problem as long as X authentication is in use. Created attachment 173729 [details]
XOpenDisplay.c
Is this a XEN problem if this happens with the native kernel (no XEN loaded)? This in my opinion (not that it matters, but) should be a show stopper since it affects anyone (XEN or no XEN) running a core 2 duo processor. If this is the case, I'm going to have to run a different distro at home until 11 or whenever this gets resolved. Not to mention bad news for any companies using XEN on 10.3 Blocker/show stopper or not, as stefan commented this should not be a reassigning party. It doesn't get anything fixed and will only result in lose of users. I suggest to install all xorg-x11*debuginfo packages, attach gdb to Xorg process (started by startx/gdm/xdm to make sure authentication is in use) and start the sample X11 program. Maybe this gives us some more information, how the issue is triggered. Makes sense? Unfortunately 151.155.190.53 is no longer available for testing. Otherwise I would have tried this myself. 151.155.190.53 seems to be available again. Can you reboot Xen kernel again? 151.155.190.53 is running Xen kernel again, but I can't reproduce the problem any longer. :-( (In reply to comment #64 from Stefan Dirsch) > I suggest to install all xorg-x11*debuginfo packages, attach gdb to Xorg > process (started by startx/gdm/xdm to make sure authentication is in use) > and start the sample X11 program. Or start the sample X11 program in gdb first to see where it hangs. stefan, I updated the box with the PAE kernel and XEN. So it should fail again. My wild guessing about libxcb was not that wrong.
Program received signal SIGINT, Interrupt.
0xf57fe402 in __kernel_vsyscall ()
(gdb) bt
#0 0xf57fe402 in __kernel_vsyscall ()
#1 0xb7df955d in select () from /lib/libc.so.6
#2 0xb7d2931f in _xcb_in_read_block (c=0x804b610, buf=0x804d780, len=8)
at xcb_in.c:248
#3 0xb7d28663 in xcb_connect_to_fd (fd=4, auth_info=0xbf947c30)
at xcb_conn.c:133
#4 0xb7d2ad11 in xcb_connect (displayname=0x0, screenp=0x0) at xcb_util.c:276
#5 0xb7eb2d2a in _XConnectXCB (dpy=0x804b008, display=0x0,
fullnamep=0xbf947d78, screenp=0xbf947d74) at xcb_disp.c:78
#6 0xb7e9b389 in XOpenDisplay (display=0x0) at OpenDis.c:168
#7 0x080485ed in main ()
(gdb)
I can't reproduce this issue on 151.155.190.53 any more, although there is a xenpae kernel running. its still a problem. I've install the 64 bit version of RC3 so that we don't have to worry about what its booted into and whether or not PAE is there or not. Happy hacking! please retest after applying xorg-server patch (Oct11). thx (In reply to comment #74 from Frank Kohler) > please retest after applying xorg-server patch (Oct11). thx xorg-server patch? Patch for which package? From: Frank Kohler <FKohler@novell.com> Just had this report. Personally I've seen an x package while doing an online update. Wasn't focused so unfortunately can't tell the name. Besides from the patch for xorg-x11-server (OpenOffice related fix) there were only security updates for X.Org. Therefore it's very unlikely that these updates fixes this issue. I'm not sure why you think they would possibly do. xorg-x11: ------------------------------------------------------------------- Mon Oct 1 15:40:00 CEST 2007 - sndirsch@suse.de - bug-30806x.diff: * build_range() Server Integer Overflow Vulnerability in X Font Server (Bug #308064, IDEF2708) * fixes swap_char2b() Heap Overflow Vulnerability in X Font Server (Bug #308066, IDEF2709) xorg-x11-libs: ------------------------------------------------------------------- Tue Oct 2 09:08:43 CEST 2007 - sndirsch@suse.de - libXfont-off_by_one.diff * prevent a one character overflow (Bug #327854) ------------------------------------------------------------------- Wed Oct 3 15:03:27 CEST 2007 - sndirsch@suse.de - xserver-1.3.0-xkb-and-loathing.patch * Ignore (not just block) SIGALRM around calls to Popen()/Pclose(). Fixes a hang in openoffice when opening menus. (Bug #245711) So, I'm guessing that XEN on opensuse 10.3 is not going to happen with Core 2 Duo. Hopefully this will go away or be fixed in opensuse 11.0 and not appear in sles 10 sp2 *** Bug 328310 has been marked as a duplicate of this bug. *** Is there any indication that this is still a problem with openSUSE 11.0? The requested information has not been provided for over 4 weeks. The bug gets therefore closed as NORESPONSE. Please reopen the bug if you can supply the requested information. I would appreciate if you could test the current openSUSE 11.1 Beta first and see whether that one fixes the problem. |