|
Bugzilla – Full Text Bug Listing |
| Summary: | KOTD locks up under heavy I/O load | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE LINUX 10.0 | Reporter: | Manfred Hollstein <mh> |
| Component: | X11 3rd Party | Assignee: | Matthias Hopf <mhopf> |
| Status: | RESOLVED WONTFIX | QA Contact: | Stefan Dirsch <sndirsch> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | aritger, sndirsch |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
CPU info, output from lspci
Output from running "strace -p `pidof X`" Output from running "dmesg" on the SL-10.0 system |
||
Created attachment 61894 [details]
CPU info, output from lspci
Forgot to mention, the file system used to store "file" on both sides is XFS. Try running these sync commands from a virtual console, after using "klogconsole -l8 -r0" to redirect all kernel messages to the current console. MAybe that shows some BUG() or some such Does it work if you use the open source nvidia driver? It's quite possible that the nvidia driver barfs when some memory allocation fails unexpectedly. I'm assigning this over to the X11 folks for the time being. Please reassign to us when there's evidence it's an issue in the kernel itself. Correction: neither the kernel, nor the X server "lock up"/freeze, I just realized that the system can still be reached via network, sorry for that. After remotely logging into the SL-10.0 box, running "top" shows that the X server is utilizing the CPU at appr. 99.9% Running "strace -p `pidof X`" shows that the X server is in an endless loop (see attachment "X-trace.log"). FWIW, I'll also attach the output from "dmesg". BTW, "klogconsole -l8 -r0" didn't produce any output :-( Created attachment 61918 [details]
Output from running "strace -p `pidof X`"
Created attachment 61919 [details]
Output from running "dmesg" on the SL-10.0 system
It's still unclear if this problem also exists with the open source nvidia driver. Just copied the file several times between the two machines while running the "nv" driver; result is that the failure doesn't occur. Now, if I would only be able to watch my DVDs using the open source driver... In the meantime I found the combo patch "NVIDIA_kernel-1.0-8178-U122205.diff.txt" on NVIDIA's website/user forum; applying it didn't fix anything, same behaviour as before; it really looks like any graphical activity after some real heavy I/O load can/will make the X server run into the same endless loop as described in the attachment from comment #5. In the case today I just moved the mouse from one terminal window to the other - at least, I tried to... It looks like the GPU encountered several errors (the "Xid" errors in the dmesg output), which then results in the NVIDIA X driver never seeing the GPU become free again (hence the X server busy waiting). I expect those are just symptoms, rather than the cause of the problem. My guess is that there is some interaction problem between the NVIDIA kernel module and something about doing those syncs; maybe the syncs take a long time, and the NVIDIA driver isn't able to service interrupts quickly enough? We should still be able to handle that, but perhaps are doing a good job with that in this case. Manfred: could you please capture an nvidia-bug-report.log? (run `nvidia-bug-report.sh` with the NVIDIA driver installed) Stefan: could you please collect this information and file a bug in NVIDIA's bug system? Hopefully one of our kernel module engineers can take a look. Thanks. Matthias, could you please file the bug in NVIDIA's bug system? I'll bounce you the bugreport, Manfred sent to linux-bugs@nvidia.com. Simply use it for the bugreport. Thanks. (In reply to comment #10) > Manfred: could you please capture an nvidia-bug-report.log? (run > `nvidia-bug-report.sh` with the NVIDIA driver installed) Done. BCC:ed both of you. NVidia report #204902. I'll keep you posted. Questions from NVidia: 0) Does this happen on PCI-Express based systems as well as AGP? 1) Are you able to reproduce this problem on more than one system? 2) Does this reproduce on filesystems other than XFS? 3) Does this require rsync to reproduce, or is any network based file copy (NFS, scp, etc) sufficient to reproduce? (In reply to comment #14) > Questions from NVidia: This is what I responded to them in e-mail: > 0) Does this happen on PCI-Express based systems as well as AGP? I don't have a PCI-Express system, so this happens on AGP only for me. > 1) Are you able to reproduce this problem on more than one system? I have only one x86_64 system. > 2) Does this reproduce on filesystems other than XFS? Yes. In the meantime I found out, that either a different filesystem or the time when I try to sync (if I can manage that at all...) only change the time when the X server runs into the endless loop. > 3) Does this require rsync to reproduce, or is any network based file copy > (NFS, scp, etc) sufficient to reproduce? Should be possible with any I/O bound application; in fact I faced these problems all the day when I tried to use my own kernel (2.6.14 ... 2.6.14.3). To speed up things I tried rsync'ing a huge file over my GigE network, and it worked out... FYI: Nvidia closed the bug as 'not an NVidia bug'. What to do now? We cannot look into binary only drivers... It seems somewhat related to the use of nvidia driver, since the problem does not occur when the nv driver is in use. Either let it open (it might get fixed by accident by a newer driver or kernel) or set it to WONTFIX. Nobody else than me had been able to reproduce it... It really appears to be a problem with my specific hardware configuration: when I installed a TV-card, the problem went away, when I installed the latest update BIOS from ASUS (which explicitly talked about fixes for the 2.4 Linux kernel, but not about 2.6) and pulled the TV-card, the problem went away. After re-shuffleing the PCI cards in the system, the situation became even worse. Right now, I found a combination how to insert the cards so that the system appears to work pretty smooth. As a result, I'll never buy a VIA chipset anymore; next system will have good old Intel chipset again... If they were only supporting AMD CPUs... ;-) This very much sounds like NVidia is right, and the bug lies somewhere else (VIA :-P ) but is surfaced only by using the NVidia driver. Closing as WONTFIX, cannot do anything here. |
Preamble: Performance (local disk and via NFS) appears to be rather bad on SL-10.0 compared with SLES 9; that's why I wanted to check how it works using the latest KOTD (version 2.6.15_rc7_git6-20060102172536). I'll upload details about my system as an additional attachment in a minute. Symptom: Copying a 2GiB file from one machine (SLES 9) to SL-10.0, followed by a "sync; sync; sync; sync" freezes the X server on the SL-10.0 box. Steps to reproduce: 1. Create a 2GiB file on some random server: dd if=/dev/zero of=/tmp/file bs=1M count=2048 2. Copy the file from random server to the SL-10.0 box: rsync -aH -e rsh -v -P --delete some-server:/tmp/file . 3. Sync the file systems: sync; sync; sync; sync Result: X server on the SL-10.0 box is frozen (although the mouse still moves...). Remarks: NVidia proprietary driver 1.0-81.78 is used (I know this is unsupported, but as most of our customers will use this combination, we should work with NVidia to get this fixed); FWIW, the same behaviour happened to me using various versions of vanilla 2.6.14 up to 2.6.14.3 - after which I gave up...). Running a 32-bit installation on the same hardware doesn't result in similar behaviour.