NVIDIA GRID vGPU drivers will fail to load when used with XenServer 7.0 on systems with >512GB of RAM.

Answer ID 4249   |    Updated 02/14/2017 01:34 PM

NVIDIA GRID vGPU drivers will fail to load when used with XenServer 7.0 on systems with >512GB of RAM.

Symptoms or errors

NVIDIA GRID vGPU drivers will fail to load when used with XenServer 7.0 on systems with >512GB of RAM. Users may experience symptoms such as:

· Error messages such as "NVIDIA Installer cannot continue" and "The graphics driver could not find compatible graphics hardware".

* Within a VM itself, "error 43" may be seen on the display adapter (after installation of Nvidia drivers).

 

Root Cause and Debug

With the release of XenServer 7.0, Citrix XenServer changed their use of IOMMU addressing and as such some customers may have encountered this issue only after upgrading to XenServer 7.0, XenServer 6.5 and earlier had iommu addressing disabled by default.

Workaround / Solution

· For systems with between 512GB and 1TB of RAM, vGPU requires a workaround to config the behavior of iommu addressing to dom0-passthrough:

o Command line: (/opt/xensource/libexec/xen-cmdline --set-xen iommu=dom0-passthrough)

· Or by editing the bootloader (/etc/grub.conf) grub.conf to contain:

o iommu=dom0-passthrough

Then reboot the host.

If the host has more memory than a typical laptop/desktop system, then do not rely on dom0 ballooning. Instead set the dom0 memory to be something between 1 and 4GB adding dom0_mem=1024M to the Xen command line.

1GB is enough for a pretty large host, more will be needed if you expect your users to use advanced storage types as ZFS or distributed filesystems.

Dedicating fixed amount of memory for dom0 is good for two reasons:

  • First of all (dom0) Linux kernel calculates various network related parameters based on the boot time amount of memory.
  • The second reason is Linux needs memory to store the memory metadata (per page info structures), and this allocation is also based on the boot time amount of memory.

Now, if you boot up the system with dom0 having all the memory visible to it, and then balloon down dom0 memory every time you start up a new guest, you end up having only a small amount of the original (boot time) amount of memory available in the dom0 in the end. This means the calculated parameters are not correct anymore, and you end up wasting a lot of memory for the metadata for a memory that you don't have any more. Also. balooning down busy dom0 might have negative side effects.

· For systems with >1TB, the workaround doesn't fix the issue (users see runtime failures). The use of NVIDIA GRID cards used with Citrix XenServer and systems with >1TB is unsupported by Citrix and NVIDIA. Users seeking support, for such a system, are advised to contact Citrix and NVIDIA support quoting XenServer engineering reference: NVIDIA-436 or Citrix support reference: SR680982224.

Other links

· Unsupported users wishing to discuss this issue further are encouraged to use the NVIDIA GRID support forums: http://gridforums.nvidia.com

· Most NVIDIA GPUs are limited to 4GB (40-bit) addressing http://us.download.nvidia.com/XFree86/Linux-x86/349.12/README/addressingcapabilities.html

Applicable Products

NVIDIA GRID GPU cards including Kepler and Maxwell cards e.g. K1, K2, M6, M60, M10

Citrix XenServer 7.0 (current release), it is possible this may change in subsequent releases as NVIDIA and Citrix are actively investigating options to resolve this issue without the need for a workaround.

Users of Dell R720 and R730 are advised to ensure their BIOS is up to date as an additional issue in older BIOSs may result in similar symptoms, see: http://nvidia.custhelp.com/app/answers/detail/a_id/4163/~/nvidia-grid-vgpu-on-dell-r730-/-r720-servers,-on-upgrade-to-citrix-xenserver

Was this answer helpful?
 
Your rating has been submitted, please tell us how we can make this answer more useful.

Print