|
Bugzilla – Full Text Bug Listing |
| Summary: | Frequent system freezes after kernel update to 2.6.27.29-0.1-default | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.1 | Reporter: | Yorck-Fabian Beensen <beensen> |
| Component: | Kernel | Assignee: | Neil Brown <nfbrown> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P2 - High | CC: | beensen, jeffm, jnelson-suse, kai.makisara, mmarek |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE 11.1 | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
output of dmesg
/var/log/messages results of "echo w > /proc/sysrq-trigger" when experiencing the badness |
||
|
Description
Yorck-Fabian Beensen
2009-09-03 12:54:13 UTC
Is there anything interesting in 'dmesg' output after the freeze? Does it work again if you downgrade to the previous kernel (http://download.opensuse.org/update/11.1/rpm/x86_64/?P=kernel-default%2A-2.6.27.25%2Ax86_64.rpm)? Created attachment 316990 [details]
output of dmesg
Created attachment 316991 [details]
/var/log/messages
Hi Michal, we do not see this phenomenon with the previous kernel (2.6.27.25). There appears to be a relationship with NFS. We still can login remotely as root user (which is a local user) but not as a yp user which has an nfs mounted home directory. After the freeze when logged in as root remotely, we have no access to nfs mounted volumes and restarting the nfs client services fails (no output, frozen terminal) Please find attached the output of dmesg after the freeze and the file /var/log/messages. Bye Yorck I think I'm having a similar issue. See bug 533610 which may be related, however: for me the process that grows to 100% is *usually* but not always ksysguardd. Whichever process it is, attempts to strace it always fail (rather, they always hang and strace is not killable). The "100% process" is also unkillable. The iwlagn driver may also be involved, as 'rmmod iwlagn' never returns. I have the results of "echo t > /proc/sysrq-trigger" and "echo w > /proc/sysrq-trigger" available, if necessary. This kernel (2.6.27.29) has been a big bummer for me - reverting to 2.6.27.25 and the issues just go away. Created attachment 317524 [details]
results of "echo w > /proc/sysrq-trigger" when experiencing the badness
So I assume there was no oops, then? You're right on with your analysis of it begin wifi related. Sometimes when there are wierd mutex hangs, it can be because of an oops that went unnoticed. There was no oops this time or previous times (for me). We don't have an oops, either. In our case, it cannot have anything to do with wifi, since the trouble appears also on machines where no wifi is installed. Our troubling process is also unkillable and it is always a different process (we saw "bash", "tcsh", "sh", "rpciod", "kurl_runner"). (In reply to comment #7) > So I assume there was no oops, then? You're right on with your analysis of it > begin wifi related. Sometimes when there are wierd mutex hangs, it can be > because of an oops that went unnoticed. We had similar symptoms with our server on Friday when we rebooted it to 2.6.27.29-0.1-default without any wifi hardware. We downgraded it to 2.6.27.25-0.1-default and no problems so far. (Earlier it had run 2.6.27.21-0.1-default for months.) This is a server having local system disk but mounting home directories and data disks from other servers (NFS v3, SuSE 10.0, various Tru64 versions). I had another hang today on a previously stable machine. "cp" was copying a small file (about 65K) and sat there and spun the CPU at 100%. It was unkillable, un-strace-able, and in other ways exhibited symptoms similar to the other processes. However, before the machine hung entirely I got an "echo t > /proc/sysrq-trigger" (or perhaps it was echo w, I don't recall....) Here it is: cp R running task 0 7793 5978 000000000000000e ffff88003a14e160 ffff880037014840 ffff88003adbd800 000000000000000e 0000000000000000 0000000000000000 000000000000000a 0000000000000000 0000000000000011 ffff88003a14e160 ffff880000000000 Call Trace: Inexact backtrace: [<ffffffff8041e2a2>] kernel_sendmsg+0x39/0x4f [<ffffffffa02f6e8b>] xs_send_kvec+0x78/0x7f [sunrpc] [<ffffffffa02f6f1b>] xs_sendpages+0x89/0x1a1 [sunrpc] [<ffffffffa02f7116>] xs_tcp_send_request+0x44/0x127 [sunrpc] [<ffffffffa02f55ca>] xprt_prepare_transmit+0x62/0x8c [sunrpc] [<ffffffffa02f3ab1>] rpc_xdr_encode+0xf8/0x155 [sunrpc] [<ffffffffa02f9eb2>] __rpc_execute+0x77/0x22d [sunrpc] [<ffffffffa02f4482>] rpc_run_task+0x4f/0x57 [sunrpc] [<ffffffffa02f4571>] rpc_call_sync+0x3d/0x5a [sunrpc] [<ffffffffa034f792>] nfs3_rpc_wrapper+0x19/0x50 [nfs] [<ffffffffa034fcf9>] nfs3_proc_access+0x12b/0x18c [nfs] [<ffffffff80211dcc>] read_tsc+0x9/0x1c [<ffffffff80255c55>] getnstimeofday+0x52/0xad [<ffffffff80252cf0>] ktime_get_ts+0x21/0x49 [<ffffffff8027f457>] delayacct_end+0x7d/0x88 [<ffffffffa0340f18>] nfs_do_access+0x15d/0x1c4 [nfs] [<ffffffffa0341059>] nfs_permission+0xda/0x142 [nfs] [<ffffffff802b8e11>] __inode_permission+0x7c/0xca [<ffffffff802b8e77>] path_permission+0x18/0x32 [<ffffffff802ba67f>] __link_path_walk+0x12c/0xd68 [<ffffffff802bb486>] path_walk+0x5e/0xba [<ffffffff802bb644>] do_path_lookup+0x162/0x1b9 [<ffffffff802ba4f1>] getname+0x13e/0x1a0 [<ffffffff802bbfdd>] user_path_at+0x48/0x79 [<ffffffff802b475d>] cp_new_stat+0xe9/0xfc [<ffffffff802b4a4e>] vfs_stat_fd+0x18/0x44 [<ffffffff802b4ad6>] sys_newstat+0x19/0x31 [<ffffffff8049d4a9>] error_exit+0x0/0x51 [<ffffffff8020bfbb>] system_call_fastpath+0x16/0x1b Does this help? 2.6.27.29 is really turning out to be a bummer for me. Is there anything I can do here? Would one of the other kernels help debug this? I will also note that using "umount -a -f -t nfs" seemed to kick the process, after which the box *really* hung - (keyboard lights flashing and such) requiring a hard reset. This continues to be a regression even with 2.6.27.37 It would appear that all kernels after 2.6.27.25 suffer from this issue. Today, attempts to suspend showed two processes: ksysgardd and keuphoria.kss (screen saver). keuphoria seemed to exit on its own ksysguardd cannot be straced (it's stuck in the kernel somewhere) however, unlike before, ksysguardd *can* be killed (regular 'kill' works) What would really help to narrow this down would be several complete trace listings. i.e. when a problem seems to be occuring, do echo t > /proc/sysrq-trigger sleep 1 echo t > /proc/sysrq-trigger sleep 1 echo t > /proc/sysrq-trigger and then gather all of the kernel logs that were generated and attach them. And if I could get several of those from different people experiencing it on different systems that would help even more. Thanks. I really need more data to be able to approach this problem. See comment #15 So setting needinfo to Jon in the hope that you can help... Hi, we fixed this problem by ourselves after waiting such a long time and nothing happened. The problem was known already since February this year and was related ton nfs client packet storm (the red hat people already fixed it then, novell only partially). One or two weeks after we fixed the kernel a new version of the suse kernel was released (2.6.27.37) and the problem was also fixed there (took long enough...). Updating to the latest kernel release fixes the bug. Yorck Great! Thanks for letting us know. I'll close the bug as 'fixed'. Basically, ditto. .29 was pretty unstable for me, and .37 was much better. Since then, I've upgraded to openSUSE 11.2 and most things have gotten better. Thanks! |