Bug 141241

Summary: KOTD locks up under heavy I/O load
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Manfred Hollstein <mh>
Component: X11 3rd PartyAssignee: Matthias Hopf <mhopf>
Status: RESOLVED WONTFIX QA Contact: Stefan Dirsch <sndirsch>
Severity: Normal    
Priority: P5 - None CC: aritger, sndirsch
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: CPU info, output from lspci
Output from running "strace -p `pidof X`"
Output from running "dmesg" on the SL-10.0 system

Description Manfred Hollstein 2006-01-03 15:34:37 UTC
Preamble: Performance (local disk and via NFS) appears to be rather bad on
          SL-10.0 compared with SLES 9; that's why I wanted to check how it
          works using the latest KOTD (version 2.6.15_rc7_git6-20060102172536).

I'll upload details about my system as an additional attachment in a minute.

Symptom: Copying a 2GiB file from one machine (SLES 9) to SL-10.0, followed by a
         "sync; sync; sync; sync" freezes the X server on the SL-10.0 box.

Steps to reproduce:

  1. Create a 2GiB file on some random server:

       dd if=/dev/zero of=/tmp/file bs=1M count=2048

  2. Copy the file from random server to the SL-10.0 box:

       rsync -aH -e rsh -v -P --delete some-server:/tmp/file .

  3. Sync the file systems:

       sync; sync; sync; sync

Result:

  X server on the SL-10.0 box is frozen (although the mouse still moves...).

Remarks:

  NVidia proprietary driver 1.0-81.78 is used (I know this is unsupported, but
  as most of our customers will use this combination, we should work with NVidia
  to get this fixed); FWIW, the same behaviour happened to me using various
  versions of vanilla 2.6.14 up to 2.6.14.3 - after which I gave up...).

Running a 32-bit installation on the same hardware doesn't result in similar
behaviour.
Comment 1 Manfred Hollstein 2006-01-03 15:36:46 UTC
Created attachment 61894 [details]
CPU info, output from lspci
Comment 2 Manfred Hollstein 2006-01-03 15:40:17 UTC
Forgot to mention, the file system used to store "file" on both sides is XFS.
Comment 3 Olaf Kirch 2006-01-03 15:44:37 UTC
Try running these sync commands from a virtual console, after using
"klogconsole -l8 -r0" to redirect all kernel messages to the current
console. MAybe that shows some BUG() or some such

Does it work if you use the open source nvidia driver?

It's quite possible that the nvidia driver barfs when some memory
allocation fails unexpectedly.

I'm assigning this over to the X11 folks for the time being. Please reassign
to us when there's evidence it's an issue in the kernel itself.
Comment 4 Manfred Hollstein 2006-01-03 20:31:37 UTC
Correction: neither the kernel, nor the X server "lock up"/freeze, I just realized that the system can still be reached via network, sorry for that. After remotely logging into the SL-10.0 box, running "top" shows that the X server is utilizing the CPU at appr. 99.9%

Running "strace -p `pidof X`" shows that the X server is in an endless loop (see attachment "X-trace.log").

FWIW, I'll also attach the output from "dmesg". BTW, "klogconsole -l8 -r0" didn't produce any output :-(
Comment 5 Manfred Hollstein 2006-01-03 20:32:32 UTC
Created attachment 61918 [details]
Output from running "strace -p `pidof X`"
Comment 6 Manfred Hollstein 2006-01-03 20:33:21 UTC
Created attachment 61919 [details]
Output from running "dmesg" on the SL-10.0 system
Comment 7 Stefan Dirsch 2006-01-03 22:23:23 UTC
It's still unclear if this problem also exists with the open source nvidia driver.
Comment 8 Manfred Hollstein 2006-01-04 09:14:52 UTC
Just copied the file several times between the two machines while running the "nv" driver; result is that the failure doesn't occur. Now, if I would only be able to watch my DVDs using the open source driver...
Comment 9 Manfred Hollstein 2006-01-08 20:38:00 UTC
In the meantime I found the combo patch "NVIDIA_kernel-1.0-8178-U122205.diff.txt" on NVIDIA's website/user forum; applying it didn't fix anything, same behaviour as before; it really looks like any graphical activity after some real heavy I/O load can/will make the X server run into the same endless loop as described in the attachment from comment #5. In the case today I just moved the mouse from one terminal window to the other - at least, I tried to... 
Comment 10 andy ritger 2006-01-08 21:41:48 UTC
It looks like the GPU encountered several errors (the "Xid" errors in the dmesg output), which then results in the NVIDIA X driver never seeing the GPU become free again (hence the X server busy waiting).  I expect those are just symptoms, rather than the cause of the problem.  My guess is that there is some interaction problem between the NVIDIA kernel module and something about doing those syncs; maybe the syncs take a long time, and the NVIDIA driver isn't able to service interrupts quickly enough?  We should still be able to handle that, but perhaps are doing a good job with that in this case.

Manfred: could you please capture an nvidia-bug-report.log? (run `nvidia-bug-report.sh` with the NVIDIA driver installed)

Stefan: could you please collect this information and file a bug in NVIDIA's bug system?  Hopefully one of our kernel module engineers can take a look.

Thanks.
Comment 11 Stefan Dirsch 2006-01-09 10:57:02 UTC
Matthias, could you please file the bug in NVIDIA's bug system? I'll bounce you the bugreport, Manfred sent to linux-bugs@nvidia.com. Simply use it for the bugreport. Thanks.
Comment 12 Manfred Hollstein 2006-01-09 10:59:44 UTC
(In reply to comment #10)
> Manfred: could you please capture an nvidia-bug-report.log? (run
> `nvidia-bug-report.sh` with the NVIDIA driver installed)

Done. BCC:ed both of you.
Comment 13 Matthias Hopf 2006-01-09 17:33:00 UTC
NVidia report #204902.

I'll keep you posted.
Comment 14 Matthias Hopf 2006-01-10 10:23:26 UTC
Questions from NVidia:

0) Does this happen on PCI-Express based systems as well as AGP?
1) Are you able to reproduce this problem on more than one system?
2) Does this reproduce on filesystems other than XFS?
3) Does this require rsync to reproduce, or is any network based file copy (NFS, scp, etc) sufficient to reproduce?
Comment 15 Manfred Hollstein 2006-01-10 10:37:37 UTC
(In reply to comment #14)
> Questions from NVidia:

This is what I responded to them in e-mail:

> 0) Does this happen on PCI-Express based systems as well as AGP?

I don't have a PCI-Express system, so this happens on AGP only for me.

> 1) Are you able to reproduce this problem on more than one system?

I have only one x86_64 system.

> 2) Does this reproduce on filesystems other than XFS?

Yes. In the meantime I found out, that either a different filesystem or
the time when I try to sync (if I can manage that at all...) only change
the time when the X server runs into the endless loop.

> 3) Does this require rsync to reproduce, or is any network based file copy
> (NFS, scp, etc) sufficient to reproduce?

Should be possible with any I/O bound application; in fact I faced these
problems all the day when I tried to use my own kernel (2.6.14 ...
2.6.14.3). To speed up things I tried rsync'ing a huge file over my GigE
network, and it worked out...
Comment 16 Matthias Hopf 2006-01-16 11:23:05 UTC
FYI:
Nvidia closed the bug as 'not an NVidia bug'.

What to do now? We cannot look into binary only drivers...
Comment 17 Stefan Dirsch 2006-01-16 11:32:05 UTC
It seems somewhat related to the use of nvidia driver, since the problem does not occur when the nv driver is in use. Either let it open (it might get fixed by accident by a newer driver or kernel) or set it to WONTFIX.
Comment 18 Manfred Hollstein 2006-01-16 11:39:12 UTC
Nobody else than me had been able to reproduce it... It really appears to be a problem with my specific hardware configuration: when I installed a TV-card, the problem went away, when I installed the latest update BIOS from ASUS (which explicitly talked about fixes for the 2.4 Linux kernel, but not about 2.6) and pulled the TV-card, the problem went away. After re-shuffleing the PCI cards in the system, the situation became even worse. Right now, I found a combination how to insert the cards so that the system appears to work pretty smooth.

As a result, I'll never buy a VIA chipset anymore; next system will have good old Intel chipset again... If they were only supporting AMD CPUs... ;-)
Comment 19 Matthias Hopf 2006-01-16 13:13:47 UTC
This very much sounds like NVidia is right, and the bug lies somewhere else (VIA :-P ) but is surfaced only by using the NVidia driver.

Closing as WONTFIX, cannot do anything here.