Bug 536549

Summary: Frequent system freezes after kernel update to 2.6.27.29-0.1-default
Product: [openSUSE] openSUSE 11.1 Reporter: Yorck-Fabian Beensen <beensen>
Component: KernelAssignee: Neil Brown <nfbrown>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P2 - High CC: beensen, jeffm, jnelson-suse, kai.makisara, mmarek
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 11.1   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: output of dmesg
/var/log/messages
results of "echo w > /proc/sysrq-trigger" when experiencing the badness

Description Yorck-Fabian Beensen 2009-09-03 12:54:13 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.13) Gecko/2009080200 SUSE/3.0.13-0.1.2 Firefox/3.0.13

After updating the kernel to 2.6.27.29-0.1-default, we encounter frequently that the system freezes completely (mostly after 2-3 hours delay). The system can still be contacted from the outside (e.g. ssh) but neither X11 nor console is available.

We observed, that some process begins to grow just before the freeze (different processes each time). As soon as 100% CPU is occupied, the system freezes (even though we have quad-core machines). 

This bug is reproducible (we observe it on 5+ idential machines running suse 11.1).


Reproducible: Always

Steps to Reproduce:
1. Write dissertation and forget to save file.

Actual Results:  
 system freezes completely

Expected Results:  
worked flawlessly
Comment 1 Michal Marek 2009-09-04 07:12:32 UTC
Is there anything interesting in 'dmesg' output after the freeze?

Does it work again if you downgrade to the previous kernel (http://download.opensuse.org/update/11.1/rpm/x86_64/?P=kernel-default%2A-2.6.27.25%2Ax86_64.rpm)?
Comment 2 Yorck-Fabian Beensen 2009-09-07 08:35:22 UTC
Created attachment 316990 [details]
output of dmesg
Comment 3 Yorck-Fabian Beensen 2009-09-07 08:35:53 UTC
Created attachment 316991 [details]
/var/log/messages
Comment 4 Yorck-Fabian Beensen 2009-09-07 08:51:29 UTC
Hi Michal,
we do not see this phenomenon with the previous kernel (2.6.27.25).
There appears to be a relationship with NFS. We still can login remotely as root user (which is a local user) but not as a yp user which has an nfs mounted home directory. After the freeze when logged in as root remotely, we have no access to nfs mounted volumes and restarting the nfs client services fails (no output, frozen terminal)
Please find attached the output of dmesg after the freeze and the file /var/log/messages.
Bye
Yorck
Comment 5 Jon Nelson 2009-09-10 03:29:58 UTC
I think I'm having a similar issue. See bug 533610 which may be related, however:

for me the process that grows to 100% is *usually* but not always ksysguardd.
Whichever process it is, attempts to strace it always fail (rather, they always hang and strace is not killable).  The "100% process" is also unkillable. 

The iwlagn driver may also be involved, as 'rmmod iwlagn' never returns.

I have the results of "echo t > /proc/sysrq-trigger"
and "echo w > /proc/sysrq-trigger" available, if necessary.

This kernel (2.6.27.29) has been a big bummer for me - reverting to 2.6.27.25 and the issues just go away.
Comment 6 Jon Nelson 2009-09-10 03:36:56 UTC
Created attachment 317524 [details]
results of "echo w > /proc/sysrq-trigger" when experiencing the badness
Comment 7 Jeff Mahoney 2009-09-10 03:52:04 UTC
So I assume there was no oops, then? You're right on with your analysis of it begin wifi related. Sometimes when there are wierd mutex hangs, it can be because of an oops that went unnoticed.
Comment 8 Jon Nelson 2009-09-10 04:06:47 UTC
There was no oops this time or previous times (for me).
Comment 9 Yorck-Fabian Beensen 2009-09-10 07:59:57 UTC
We don't have an oops, either.
In our case, it cannot have anything to do with wifi, since the trouble appears also on machines where no wifi is installed.

Our troubling process is also unkillable and it is always a different process (we saw "bash", "tcsh", "sh", "rpciod", "kurl_runner").
Comment 10 Kai Mäkisara 2009-09-14 06:13:00 UTC
(In reply to comment #7)
> So I assume there was no oops, then? You're right on with your analysis of it
> begin wifi related. Sometimes when there are wierd mutex hangs, it can be
> because of an oops that went unnoticed.

We had similar symptoms with our server on Friday when we rebooted it to 2.6.27.29-0.1-default without any wifi hardware.

We downgraded it to 2.6.27.25-0.1-default and no problems so far. (Earlier it had run 2.6.27.21-0.1-default for months.)

This is a server having local system disk but mounting home directories and data disks from other servers (NFS v3, SuSE 10.0, various Tru64 versions).
Comment 11 Jon Nelson 2009-09-22 00:36:09 UTC
I had another hang today on a previously stable machine.
"cp" was copying a small file (about 65K) and sat there and spun the CPU at 100%. It was unkillable, un-strace-able, and in other ways exhibited symptoms similar to the other processes.

However, before the machine hung entirely I got an "echo t > /proc/sysrq-trigger" (or perhaps it was echo w, I don't recall....)

Here it is:


cp            R  running task        0  7793   5978
 000000000000000e ffff88003a14e160 ffff880037014840 ffff88003adbd800
 000000000000000e 0000000000000000 0000000000000000 000000000000000a
 0000000000000000 0000000000000011 ffff88003a14e160 ffff880000000000
Call Trace:
Inexact backtrace:

 [<ffffffff8041e2a2>] kernel_sendmsg+0x39/0x4f
 [<ffffffffa02f6e8b>] xs_send_kvec+0x78/0x7f [sunrpc]
 [<ffffffffa02f6f1b>] xs_sendpages+0x89/0x1a1 [sunrpc]
 [<ffffffffa02f7116>] xs_tcp_send_request+0x44/0x127 [sunrpc]
 [<ffffffffa02f55ca>] xprt_prepare_transmit+0x62/0x8c [sunrpc]
 [<ffffffffa02f3ab1>] rpc_xdr_encode+0xf8/0x155 [sunrpc]
 [<ffffffffa02f9eb2>] __rpc_execute+0x77/0x22d [sunrpc]
 [<ffffffffa02f4482>] rpc_run_task+0x4f/0x57 [sunrpc]
 [<ffffffffa02f4571>] rpc_call_sync+0x3d/0x5a [sunrpc]
 [<ffffffffa034f792>] nfs3_rpc_wrapper+0x19/0x50 [nfs]
 [<ffffffffa034fcf9>] nfs3_proc_access+0x12b/0x18c [nfs]
 [<ffffffff80211dcc>] read_tsc+0x9/0x1c
 [<ffffffff80255c55>] getnstimeofday+0x52/0xad
 [<ffffffff80252cf0>] ktime_get_ts+0x21/0x49
 [<ffffffff8027f457>] delayacct_end+0x7d/0x88
 [<ffffffffa0340f18>] nfs_do_access+0x15d/0x1c4 [nfs]
 [<ffffffffa0341059>] nfs_permission+0xda/0x142 [nfs]
 [<ffffffff802b8e11>] __inode_permission+0x7c/0xca
 [<ffffffff802b8e77>] path_permission+0x18/0x32
 [<ffffffff802ba67f>] __link_path_walk+0x12c/0xd68
 [<ffffffff802bb486>] path_walk+0x5e/0xba
 [<ffffffff802bb644>] do_path_lookup+0x162/0x1b9
 [<ffffffff802ba4f1>] getname+0x13e/0x1a0
 [<ffffffff802bbfdd>] user_path_at+0x48/0x79
 [<ffffffff802b475d>] cp_new_stat+0xe9/0xfc
 [<ffffffff802b4a4e>] vfs_stat_fd+0x18/0x44
 [<ffffffff802b4ad6>] sys_newstat+0x19/0x31
 [<ffffffff8049d4a9>] error_exit+0x0/0x51
 [<ffffffff8020bfbb>] system_call_fastpath+0x16/0x1b


Does this help?
2.6.27.29 is really turning out to be a bummer for me. Is there anything I can do here?  Would one of the other kernels help debug this?
Comment 12 Jon Nelson 2009-09-22 00:37:48 UTC
I will also note that using "umount -a -f -t nfs" seemed to kick the process, after which the box *really* hung - (keyboard lights flashing and such) requiring a hard reset.
Comment 13 Jon Nelson 2009-11-01 21:18:43 UTC
This continues to be a regression even with 2.6.27.37

It would appear that all kernels after 2.6.27.25 suffer from this issue.

Today, attempts to suspend showed two processes:

ksysgardd and keuphoria.kss (screen saver).


keuphoria seemed to exit on its own
ksysguardd cannot be straced (it's stuck in the kernel somewhere)
however, unlike before, ksysguardd *can* be killed (regular 'kill' works)
Comment 15 Neil Brown 2009-11-13 03:09:15 UTC
What would really help to narrow this down would be
several complete trace listings.
i.e. when a problem seems to be occuring, do

  echo t > /proc/sysrq-trigger
  sleep 1
  echo t > /proc/sysrq-trigger
  sleep 1
  echo t > /proc/sysrq-trigger

and then gather all of the kernel logs that were generated and 
attach them.

And if I could get several of those from different people experiencing
it on different systems that would help even more.

Thanks.
Comment 16 Neil Brown 2009-11-24 03:06:52 UTC
I really need more data to be able to approach this problem.
See comment #15
So setting needinfo to Jon in the hope that you can help...
Comment 17 Yorck-Fabian Beensen 2009-11-24 07:47:36 UTC
Hi,

we fixed this problem by ourselves after waiting such a long time and nothing happened.
The problem was known already since February this year and was related ton nfs client packet storm (the red hat people already fixed it then, novell only partially). One or two weeks after we fixed the kernel a new version of the suse kernel was released (2.6.27.37) and the problem was also fixed there (took long enough...).
Updating to the latest kernel release fixes the bug.

Yorck
Comment 18 Neil Brown 2009-11-24 10:31:27 UTC
Great!  Thanks for letting us know.
I'll close the bug as 'fixed'.
Comment 19 Jon Nelson 2009-11-24 13:22:21 UTC
Basically, ditto.  .29 was pretty unstable for me, and .37 was much better.
Since then, I've upgraded to openSUSE 11.2 and most things have gotten better.

Thanks!