Bug 157099

Summary: strace: bogus ERESTARTSYS return ...
Product: [openSUSE] SUSE Linux 10.1 Reporter: Michael Meeks <mmeeks>
Component: KernelAssignee: Andreas Kleen <ak>
Status: RESOLVED WONTFIX QA Contact: E-mail List <nld10-bugs-qa>
Severity: Normal    
Priority: P5 - None CC: federico
Version: Beta 7   
Target Milestone: Beta 8   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Michael Meeks 2006-03-10 15:38:32 UTC
I issue a 'read' syscall on a pipe:

static ReadResult
read_data (GString *str,
           gint     fd,
           GError **error)
{
  gssize bytes;        
  gchar buf[4096];    

 again:
  
  bytes = read (fd, buf, 4096);

  if (bytes == 0)
    return READ_EOF;
  else if (bytes > 0)
    {
      g_string_append_len (str, buf, bytes);
      return READ_OK;
    }
  else if (bytes < 0 && errno == EINTR)
    goto again;
  else if (bytes < 0)
    {
      g_set_error (error,
                   G_SPAWN_ERROR,
                   G_SPAWN_ERROR_READ,
                   _("Failed to read data from child process (%s)"),
                   g_strerror (errno));
      
      return READ_FAILED;
    }
  else
    return READ_OK;
}

I get an error message back from this method:

"Failed to read from child pipe (Unknown error 512)"

I notice #define ERESTARTSYS 512 in the relevant include file.

the strace [ *yes* I appreciate it's supposed to show erestartsys messages ] is:

6778  1142002038.398155 waitpid(6789,  <unfinished ...>
6778  1142002038.446495 <... waitpid resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 6789
6778  1142002038.450078 read(43,  <unfinished ...>
6778  1142002039.277533 <... read resumed> 0xb4999aa4, 8) = ? ERESTARTSYS (To be restarted)
6778  1142002039.277786 waitpid(6789,  <unfinished ...>
6778  1142002039.278148 <... waitpid resumed> NULL, 0) = -1 ECHILD (No child processes)
6778  1142002039.282166 close(43 <unfinished ...>
6778  1142002039.284476 <... close resumed> ) = 0
6778  1142002039.286248 close(45 <unfinished ...>

the waitpid/close etc. are the "bad things happened" case from above.

I'm reliably informed that ERESTARTSYS should not get to user-space, and the call should be re-started in the kernel: I have:

Linux linux 2.6.16-rc5-git9-2-smp #1 SMP Tue Mar 7 14:56:34 UTC 2006 i686 i686 i386 GNU/Linux

kernel-smp-2.6.16_rc5_git9-2

Interestingly, this is not an SMP machine - no idea why a yum upgrade would install (what is apparently) an SMP kernel.

HTH.
Comment 1 Michael Meeks 2006-03-10 16:15:13 UTC
Seemingly the problem only manifests itself under strace; so - perhaps an strace issue ?
Comment 2 Andreas Kleen 2006-03-10 16:39:16 UTC
Yes, ptrace sometimes funny stuff like this. But it has been doing this
forever so how is it suddenly critical?
Comment 3 Michael Meeks 2006-03-10 16:50:15 UTC
so - the problem occurs with default kernel too.

Andreas: is it fair to characterize your reply as:

"strace never works, we know it's broken & don't care much" ;-)

if so, that would tally with my strace experience of it hosing my system / applications in strange ways. To be fair, I've never bothered filing a bug before because it was not hard to pin the blame quickly.

In which case - I'll downgrade this to 'Normal' - it's a shame not to have an strace the works reliably though.
Comment 4 Andreas Kleen 2006-03-10 16:58:17 UTC
It usually works - there are just these nasty corner cases
If you can give me a simple self contained reproducer it might be able
to do something about it. But it's low priority for CODE10.
Comment 5 Federico Mena Quintero 2006-03-10 17:13:31 UTC
It's somewhat critical right now because we are using an strace-based tool to do login profiling:

http://primates.ximian.com/~federico/news-2006-03.html#login-time-2

... and the program that gets the ERESTARTSYS really doesn't like it.  We can have a temporary hack to "goto again" if we get ERESTARTSYS, just to make the program not break when being run under the profiling tool.

Andi, do you have any idea of why ptrace does this?
Comment 6 Andreas Kleen 2006-03-10 17:48:46 UTC
I have some guesses, but I won't touch this right now - far too risky.
Comment 7 Andreas Kleen 2006-03-10 17:49:40 UTC
...
Comment 8 Federico Mena Quintero 2006-03-10 18:53:47 UTC
Reopening to keep as a reference.
Comment 9 Andreas Kleen 2006-03-10 19:01:34 UTC
Ok, but you have to keep it then.
Comment 10 Olaf Kirch 2006-03-13 13:24:44 UTC
I tried to reproduce this on a current kernel on ia64, but the
my little test app doesn't see any ERESTARTSYS. It shows up in
the strace output, but for me the read syscall just resumes after
the SIGCHLD signal handler returns.
Comment 11 Andreas Kleen 2006-03-13 16:00:17 UTC
It probably only happens in complex situations - rescheduling at the right
time etc.
Comment 12 Michael Meeks 2006-03-15 11:34:03 UTC
Thanks for looking Olaf :-) the complex scenario FWIW is a full gnome-session on login, running under strace having threaded the activation server (custom libbonobo RPMs from http://go-oo.org/~michael/packages/). That lets us interleave the launch of 10 or so applets concurrently: so - quite some system load etc. either way while 100% reproducible in this setup, admittedly a difficult setup to reproduce & no (visible) problems without strace, though that may be a timing issue.

Thanks.
Comment 13 JP Rosevear 2006-03-22 20:41:45 UTC
Federico, our team is not going to fix this realistically.

Andreas, I'm going to reassign and mark resolved later instead of wontfix because you seem to agree its a bug.  Hopefully thats ok.
Comment 14 JP Rosevear 2006-03-22 21:16:23 UTC
This is for later.
Comment 15 Andreas Kleen 2006-03-23 00:41:28 UTC
From my side it's a WONTFIX
Comment 16 Andreas Kleen 2006-03-23 00:41:52 UTC
wontfix