Bugzilla – Bug 115800
nvidia: XEN support
Last modified: 2008-05-28 14:32:46 UTC
Kurt Garloff wrote: With a patch similar to the one ine ... (see following attachment) I got nvidia to work under Xen. What it does is basically remove a confusion between gart and phys addresses. On plain x86, they happen to be the same. Many drm and agp drivers had that wrong, but it got fixed in 2.6.13. I'm pretty confident that the changes are good and won't hurt anyone. But checking is better ... certainly at this point in time. Do we have the possibility to run this on a nvidia machine and see whether it works. Let's first get it working on plain kernel ... I had a short look at the DRM drivers in 2.6.13; they seemed to have all been fixed.
Created attachment 49159 [details] nvidia driver fix for Xen kernel
Sorry, Kurt but this patch doesn't work for me. /tmp/NVIDIA-Linux-x86-1.0-7676-pkg1/usr/src/nv/nv-vm.c: In function ‘nv_vm_malloc_pages’: /tmp/NVIDIA-Linux-x86-1.0-7676-pkg1/usr/src/nv/nv-vm.c:241: error: implicit declaration of function ‘phys_to_gart’ make[4]: *** [/tmp/NVIDIA-Linux-x86-1.0-7676-pkg1/usr/src/nv/nv-vm.o] Error 1 make[3]: *** [_module_/tmp/NVIDIA-Linux-x86-1.0-7676-pkg1/usr/src/nv] Error 2 make[2]: *** [modules] Error 2 NVIDIA: left KBUILD. This is with Kernel 2.6.13-8.
Created attachment 49160 [details] This additional patch seems to fix the build
I currently have the nvidia driver with the two patches applied running. Works so far (GeForce4 Ti 4600, IA32, Kernel 2.6.13-8-default, 1.0-7676 NVIDIA driver). I didn't test this on x86_64 yet. Anyway, I would like to hear a comment by Kurt and Andy about the two patches before we consider to include these for RC2.
The patch doesn't build on x86_64 at all: nv-linux.h:1057: error: implicit declaration of function ‘virt_to_gart’ nv-linux.h:1138: error: implicit declaration of function ‘gart_to_virt’ nv-linux.h:1057: error: implicit declaration of function ‘virt_to_gart’ nv-linux.h:1138: error: implicit declaration of function ‘gart_to_virt’ nv-linux.h:1057: error: implicit declaration of function ‘virt_to_gart’ nv-linux.h:1138: error: implicit declaration of function ‘gart_to_virt’ nv-linux.h:1057: error: implicit declaration of function ‘virt_to_gart’ nv-linux.h:1138: error: implicit declaration of function ‘gart_to_virt’ nv-linux.h:1057: error: implicit declaration of function ‘virt_to_gart’ nv-linux.h:1138: error: implicit declaration of function ‘gart_to_virt’ Kurt, please comment.
Stefan, thanks for testing! Sorry for screwing up the initial patch :-( In agp.h, I read: #define virt_to_gart(x) (phys_to_gart(virt_to_phys(x))) #define gart_to_virt(x) (phys_to_virt(gart_to_phys(x))) which is the right implementation. Including asm/agp.h from nv-linux.h is the right solution. But let me do a compile test first ...
OK, got some more coffee. The definition is in drivers/char/agp/agp.h. So a solution that would work with older kernels as well could look like this hunk in nv-linux.h: #if defined (CONFIG_AGP) || defined (CONFIG_AGP_MODULE) #define AGPGART #include <linux/agp_backend.h> #include <linux/agpgart.h> #include <asm/agp.h> #ifndef phys_to_gart #define phys_to_gart(x) virt_to_bus(phys_to_virt(x)) #define gart_to_phys(x) virt_to_phys(bus_to_virt(x)) #define virt_to_gart(x) virt_to_bus(x) #define gart_to_virt(x) bus_to_virt(x) #else #ifndef virt_to_gart #define virt_to_gart(x) (phys_to_gart(virt_to_phys(x))) #define gart_to_virt(x) (phys_to_virt(gart_to_phys(x))) #endif #endif #endif I'll create a patch and (compile-) test it.
Created attachment 49511 [details] nv-fix-gartaddr-xen.diff This fix builds and seems to work. (More tests under Xen required to validate.)
build tests (x86 + x86_64) done. runtime tests (32bit + 64bit/32bit) will follow.
> runtime tests (32bit + 64bit/32bit) will follow. done. works fine for me.
Stefan, thanks for testing! While the patch is harmless for native x86/x86-64 nVidia, it's unfortunately not enough to make the nVidia driver work with Xen. __nv_disable_caches() and __nv_enable_caches() access cr4 which Xen won't allow you to do. These functions are called from __nv_setup_pat_entries() and __nv_restore_pat_entries(). Passing nv_disable_pat=1 to the module helps to solve this, so the module loads and initializes properly. Upon startup of X11, it still crashes (on x86-64): Sep 11 00:23:02 prescott kernel: general protection fault: 0000 [1] Sep 11 00:23:02 prescott kernel: CPU 0 Sep 11 00:23:02 prescott kernel: Modules linked in: nvidia bridge [...] Sep 11 00:23:02 prescott kernel: Pid: 17621, comm: X Tainted: P U 2.6.13-10-xen Sep 11 00:23:02 prescott kernel: RIP: e030:[<ffffffff884b0ed4>] <ffffffff884b0ed4>{:nvidia:_nv002491rm+0} Sep 11 00:23:02 prescott kernel: RSP: e02b:ffff880028d11bc0 EFLAGS: 00010202 Sep 11 00:23:02 prescott kernel: RAX: 0000000000000000 RBX: ffff880028d11be8 RCX: 00000000bfebfbff Sep 11 00:23:02 prescott kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880030712000 Sep 11 00:23:02 prescott kernel: RBP: ffffffff888b4480 R08: ffff880028d11bdc R09: ffff880028d11bd8 Sep 11 00:23:02 prescott kernel: R10: ffff880028d11be8 R11: 000000000000001c R12: ffff880030712000 Sep 11 00:23:02 prescott kernel: R13: ffff880028670000 R14: ffffc2001107b000 R15: ffff88002ba7a800 Sep 11 00:23:02 prescott kernel: FS: 00002aaaab35f0a0(0000) GS:ffffffff804bbc80(0000) knlGS:0000000000000000 Sep 11 00:23:02 prescott kernel: CS: e033 DS: 0000 ES: 0000 Sep 11 00:23:02 prescott kernel: Process X (pid: 17621, threadinfo ffff880028d10000, task ffff880038efa880) Sep 11 00:23:02 prescott kernel: Stack: ffffffff88497b5c ffff880028d11be0 ffffffff885d0e6a 000208000000651d Sep 11 00:23:02 prescott kernel: 00000f41bfebfbff 49656e69756e6547 bfebfbff6c65746e 0001040f00000000 Sep 11 00:23:02 prescott kernel: ffff880000000000 ffff88002b87b000 Sep 11 00:23:02 prescott kernel: Call Trace:<ffffffff88497b5c>{:nvidia:_nv001456rm+376} <ffffffff885d0e6a>{:nvidia:_nv004524rm+48} Sep 11 00:23:02 prescott kernel: <ffffffff884aac6c>{:nvidia:_nv003623rm+116} <ffffffff8861e78c>{:nvidia:_nv003247rm+126} Sep 11 00:23:02 prescott kernel: <ffffffff885d1920>{:nvidia:_nv004556rm+68} <ffffffff885d16fe>{:nvidia:_nv004385rm+104} Sep 11 00:23:02 prescott kernel: <ffffffff884aaad4>{:nvidia:_nv001453rm+96} <ffffffff88582308>{:nvidia:_nv000393rm+20} Sep 11 00:23:02 prescott kernel: <ffffffff88582483>{:nvidia:_nv000397rm+125} <ffffffff884ad921>{:nvidia:_nv001426rm+141} Sep 11 00:23:02 prescott kernel: <ffffffff884ab512>{:nvidia:_nv001458rm+668} <ffffffff884ae8c4>{:nvidia:rm_init_adapter+104} Sep 11 00:23:02 prescott kernel: <ffffffff886a40b7>{:nvidia:nv_kern_open+581} <ffffffff80183553>{chrdev_open+307} Sep 11 00:23:02 prescott kernel: <ffffffff80179df6>{dentry_open+246} <ffffffff80179f64>{filp_open+68} Sep 11 00:23:02 prescott kernel: <ffffffff801791fa>{get_unused_fd+90} <ffffffff8017a002>{sys_open+82} Sep 11 00:23:02 prescott kernel: <ffffffff80111a9d>{system_call+117} <ffffffff80111a28>{system_call+0} Sep 11 00:23:02 prescott kernel: Sep 11 00:23:02 prescott kernel: Sep 11 00:23:02 prescott kernel: Code: 0f 20 e0 c3 0f 20 d8 c3 53 48 89 cf 89 f0 89 d1 0f a2 89 07 Sep 11 00:23:02 prescott kernel: RIP <ffffffff884b0ed4>{:nvidia:_nv002491rm+0} RSP <ffff880028d11bc0> Sep 11 00:23:02 prescott kdm: :0[17622]: IO Error in XOpenDisplay Sep 11 00:23:02 prescott kdm[17614]: Display :0 cannot be opened Sep 11 00:23:02 prescott kdm[17614]: Unable to fire up local display :0; disabling. The machine survives, but you lose your console until the next reboot.
Ok. Let's reopen.
Created attachment 49567 [details] nv-fix-gartaddr-xen.diff Suggested patch, will disable pat support on Xen by default and will make the driver load fail under Xen unless overriden.
Should be safe to use the new patch.
Created attachment 49575 [details] Regenerated consecutive patch I'll take this one.
New package submitted.
Ok. I think it's time to hear an comment by NVIDIA. Andy, is Xen support sth. NVIDIA is focused on?
Sorry for the slow response. I've been soliciting review from some of our kernel engineers at NVIDIA. I should have more information to post soon. Any real technical issues aside, I'd be concerned about disabling PAT -- how does Xen handle per-page cache attributes? At this point, NVIDIA has no plans to support Xen.
Setting to enhancement.
The last two hunks in nv-fix-gartaddr-xen.diff no longer apply with 1.0-8174, since nv_sg_map_buffer() and nv_sg_load() have moved to nv-vm.c and have changed. Since I'm not familiar with this patch and I don't want to break the driver I'll disable this patch for now.
Of course the consecutive patch no longer works as well. I'll disable it for now.
Kurt, in case you want to look at the patches, please use the nvidia-gfx-1_0_7676 package. I'll submit it ASAP and let you know about.
JFYI: There are some efforts to get the nvidia drivers working with xen in the nvidia linux discussion forum: http://www.nvnews.net/vbulletin/showthread.php?t=65198 http://www.nvnews.net/vbulletin/showthread.php?t=60125
Lonni, Andy. Are there really no plans to support Xen in one of the next releases? More and more people begin to use it, also on their desktop machines ...
*** Bug 274597 has been marked as a duplicate of this bug. ***
*** Bug 307510 has been marked as a duplicate of this bug. ***
I found interesting statement at: http://www.nvnews.net/vbulletin/showthread.php?t=95483 zander, NVIDIA Corporation wrote: --- No, this patch won't be included in future driver releases. Please note that although doing so is unsupported, 100.14.11 can be built against RHEL5's Xen kernels without patches if the IGNORE_XEN_PRESENCE environment variable is set to a non-zero value (you may also need to create an include2 directory in the top-level directory of the kernel development files and place an asm symlink to /usr/src/kernels/2.6.18-8.el5-xen-i686/include/asm-i386 in it). Mileage with other Xen kernels will vary. I do not believe the Xen patches posted for 100.14.11 are correct. I hope to take a look at providing one for Fedora Core 7, etc., at some point in the future. ---
Comment #30 sounds interesting to me. I'm not sure what this means for SLES/openSUSE kernels though.
*** Bug 353513 has been marked as a duplicate of this bug. ***
I finally decided to no longer track proprietary NVIDIA driver bugs against openSUSE. Therefore I'm closing these now as WONTFIX. In case you're using our SLES/SLED products and can reproduce this issue also on thesed products feel free to reopen. These are still tracked, since customers of these products depend on the proprietary driver for newer NVIDIA hardware. Be aware that you need a privilleged account to track anything against our SLES/SLED products. So if this not an option for you I suggest to report the problem to the official NVIDIA driver feedback channels (forum/email; see NVIDIA driver download site) and refer to this bugreport.