Bug 473302 - X.org busy-loops (100% CPU in a select() loop) while desktop is locked
Summary: X.org busy-loops (100% CPU in a select() loop) while desktop is locked
Status: RESOLVED DUPLICATE of bug 472046
Alias: None
Product: openSUSE 11.1
Classification: openSUSE
Component: X.Org (show other bugs)
Version: Final
Hardware: x86 Other
: P3 - Medium : Normal (vote)
Target Milestone: ---
Assignee: Egbert Eich
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-06 13:56 UTC by Michael Matz
Modified: 2009-02-28 17:07 UTC (History)
4 users (show)

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Matz 2009-02-06 13:56:48 UTC
This bug happens for me since I updated to 11.1 final from 11.1 beta4,
and it's extremely annoying for my office mates because my CPU fan is
very loud.  The error mode is a bit strange.  It happens after some time
after the desktop is locked (in my case via kdesktop_lock).  That first
spawns a screen-saver and after some time kills it (to not uselessly use
CPU) and switches off the monitor.

After the monitor is switched off it takes some more time for the bug
to show up.  The symptom of it is that X takes 100% CPU without anyone
doing anything on the system.  The reason is a busy-loop constructed with
zero timeout selects, as can be seen in an strace taken from a remote system
(i.e. without unlocking the desktop).  The relevant excerpt from that strace
looks like so:

13:48:02.254152 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout)
13:48:02.254289 clock_gettime(CLOCK_MONOTONIC, {174665, 645523431}) = 0
13:48:02.254383 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0
13:48:02.254483 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
13:48:02.254581 clock_gettime(CLOCK_MONOTONIC, {174665, 645815070}) = 0
13:48:02.254675 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout)
13:48:02.254812 clock_gettime(CLOCK_MONOTONIC, {174665, 646046604}) = 0
13:48:02.254906 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0
13:48:02.255006 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
13:48:02.255105 clock_gettime(CLOCK_MONOTONIC, {174665, 646338413}) = 0
13:48:02.255198 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout)

Note how select() is called with a zero timeout (i.e. return immediately no
matter if some FD is active or not).  This happens ad nauseam, and causes the
100% CPU load.

It goes away as soon as the desktop is unlocked (then Xorg starts using
reasonable timeout ala 175 seconds again).  It also goes away if someone
moves the mouse (which causes the monitor to be switched on, the graphical
screensaver to be started again, hence we then _have_ an active X11 client
again, which somehow breaks the busy loop).

Unfortunately it's slightly annoying to reproduce as it needs some time
after the monitor is switched off to materialize.  I.e. it doesn't start
immediately after the monitor is switched off, but only after some time
(Richard tells me that he had to move the mouse like every 20 minutes
which would translate to something like 10 minutes after the monitor is
switched off again).

I'm fairly sure this didn't happen in 11.1 beta4.
xorg-x11-server-7.4-17.3
x11-video-nvidiaG02-180.22-4.1
nvidia-gfxG02-kmp-pae-180.22_2.6.27.7_9.1-4.1

(yes, nvidia)
Comment 5 Michael Matz 2009-02-06 15:21:07 UTC
More data points: Ludwig seems to have the same problem (also KDE3, but
intel graphics on his laptop).  Not reproducible with just 'xset dpms ...'
it seems, though on this particular bug it can also mean that we just
didn't wait long enough for it to show up.
Comment 6 Michael Matz 2009-02-06 16:53:27 UTC
I can reproduce it now.  From a remote machine I simply do 
  xset dpms force standby
KDE3 will notice this, and starts kdesktop_lock and the kss module (the program
that paints something nice), immediately SIGSTOPs that kss module (because
due to DPMS it wouldn't be visible anyway), and then the countdown to the bug
starts.

stracing X then reveals that the select() loop starts with a 870 something
seconds timeout, which is reduced with each call to select, by the amount it
took for the select to return plus 20 milli-seconds.  So, for the first
~870 seconds after the screensaver becomes active all is well, as there's
still a time-out so select only returns seldomly.

After these ~870 seconds the timeout value is 0, and _doesn't_ get reset to
something larger again.  From then on X will take 100% CPU.  I can reproduce
this reliably, but it takes some time to show up (exactly the ~870 seconds).

I've tried looking at the patches that went in after beta4.  The only thing
which barely is related to screen-saving seems to be dpms_screensaver.diff,
from 2008-11-28.  I think this was shortly after beta4 was released.  The patch
itself looks innocent (just introduces a call to dixSaveScreens()), but
perhaps it triggers a bug somewhere else (in the timeout calculations for
select).

[probably irrelevant:
grepping for "select" in all diffs of xorg-x11-server reveals some occurences
in xorg-server-xf4vnc.patch and xorg-server-xf4vnc-busyloop.diff, where
at least some in xf4vnc.patch also explicitely use a {0,0} timeout.  That
at least looks dubious, as that usually is not what one wants.  But as this
is vnc, it's most probably unrelated.]
Comment 7 Michael Matz 2009-02-06 17:26:56 UTC
Man!  WaitFor.c is terrible spaghetti code.  I would see several reasons why
the select timeout could be set to zero in WaitForSomething.  In the SMART_SCHEDULE code, if there are any ClientsWithInput, or in general if
there are timers with an expire time in the past.  It might be that when
the screen is off (DPMS on) that either the timers don't go away (even when
they are expired) or that ClientsWithInput never goes empty.

Presumably that would make sense when the DPMS code wants to ensure that
when the display is off that as few CPU is used as possible, and due to a bug
there's possibly too much pruning or early-out so that events which would
normally remove expired timers or clients with input don't take action.

The dpms_screensaver.diff introduces a call to dixSaveScreens() which wasn't
there before (and without error handling) which does fiddle with timers a bit,
so maybe that's a lead.  Or possibly I'm completely off :-)
Comment 8 Ludwig Nussel 2009-02-09 09:32:09 UTC
I could reproduce using "xset dpms force standby" and patient waiting now. The behavior seen in strace is different for me though. select is called with one second timeout {1, 0} and then all of a sudden the timeout is lower {0, 726000} once and in all following calls zero. Just before the broken select call X does this:
writev(40, [{"B\1\232\0\1\0\200\3\0\0\0\0\221\274\r\0\0\0\0\0\217\274\r\0\340\212 \0\1\307\240\267", 32}], 1) = 32

fd 40 is one of the many fds on /tmp/.X11-unix/X0
Comment 9 Michael Matz 2009-02-09 10:41:52 UTC
So that means your initial select timeout is 1 second (where it is about 870
for me)?  That's fully possible, but still would hint at the same problem.
X has a list of timers (with an expire time).  The select timeout is calculated
in such a way that it times out at or before the next timer has to fire.
The list of timers depend on the applications running on the display, so
slight changes there are expected.  But the general theme is still the same.

My assumption is that this list of timers is not cleaned up in the bug case,
so that even already expired timers stay in the list.  Due to compensation code
for negative offsets this would lead exactly to the picture of zero-timeout
selects.

(the other possibility for that to happen is with the SMART_SCHEDULE code
as mentioned above)

Btw. Ludwig: You can find out which application is connected to the socket
corresponding to FD 40 in Xorg in the following way:
# lsof -p $(pidof X) | grep '40u.*unix'
X       6191 root   40u  unix 0xf7c94280      0t0      20618 /tmp/.X11-unix/X0
# lsof | grep 20617
kopete     6598       matz    4u     unix 0xf7c94080      0t0      20617 socket

What you need to look out for is a socket filedescriptor with an inode of one
less than the inode number of the socket on the X11 side (above 20618 on X
side, 20617 on kopete side).  This obviously depends on how the kernel invents
these inode numbers, but for my kernel (32bit kernel-pae-2.6.27.7-9.1) it uses
the above rules.
Comment 10 Ludwig Nussel 2009-02-09 12:32:55 UTC
Thanks for the hint. The receiving process was kded4. That suspicious write didn't show up again though. Yet if I kill kde4 stuff (kded4, kdeinit4) the busy loop stops.
Comment 11 Michael Matz 2009-02-09 14:43:37 UTC
And you can't reproduce it anymore when kde{d,init}4 isn't running anymore?
Then this one triggers the bug in Xorg, maybe a nice hint.
Comment 12 Magnus Boman 2009-02-09 21:50:12 UTC
This bug sounds related (if not identical) to https://bugzilla.novell.com/show_bug.cgi?id=472046 (I have the exact same trace as described in the description). The patch attached in bug#472046 fixes it for me.
Comment 13 Stefan Dirsch 2009-02-28 17:07:49 UTC
dup.

*** This bug has been marked as a duplicate of bug 472046 ***