How to reset GPU's on DGX-1 Systems
NVIDIA provides a tool to monitor and manage the GPU's on the system called nvidia-smi. This tool can be used to reset GPU's either individually or as a group.
NOTE: In the case of the DGX-1 and DGX-1V platforms, individual GPU's can not be reset because they are linked via nvlink, so all the GPU's have to be reset simultaneously.
When a GPU presents Double Bit Errors or repeat Single Bit Errors in the same location, pages are retired. In order for the retired pages to be blacklisted (unavailable to the user/application), the GPU needs to be reset. This action also causes the driver to reload and prevent application use of the blacklisted memory. In order to reset the GPUs, all applications running on the GPUs must be shut down. One way to verify is to run nvidia-smi as follows:
dgxuser@dgx-1:~$ nvidia-smi -q -d PIDS
Timestamp : Fri Feb 23 11:56:41 2018
Attached GPUs : 8
If DCGM is being used to monitor GPUs, make sure to shut down the host engine (daemon).
dgxuser@dgx-1:~$ sudo nv-hostengine -t
Host engine successfully terminated.
If the DGX-1 system is being monitored by any application or agent that is watching the GPUs, those applications or agents should also be shut down (for example, Nagios).
Once no applications are running on the GPUs, the nvidia-docker and nvidia-persistenced services must be stopped as follows:
dgxuser@dgx-1:~$ sudo systemctl stop nvidia-persistenced
dgxuser@dgx-1:~$ sudo systemctl stop nvidia-docker
As a last check, to verify no applications or agents are running on the GPUs, run:
dgxuser@dgx-1:~$ lsof /dev/nvidia*
Make sure all processes (if any) listed are stopped or killed before proceeding with the next step.
To reset the GPU's run the nvidia-smi command as follows:
dgxuser@dgx-1:~$ sudo nvidia-smi -r
GPU 00000000:06:00.0 was successfully reset.
GPU 00000000:07:00.0 was successfully reset.
GPU 00000000:0A:00.0 was successfully reset.
GPU 00000000:0B:00.0 was successfully reset.
GPU 00000000:85:00.0 was successfully reset.
GPU 00000000:86:00.0 was successfully reset.
GPU 00000000:89:00.0 was successfully reset.
GPU 00000000:8A:00.0 was successfully reset.
Enable the nvidia-persistenced, nvidia-docker, and any other monitoring agents and applications that were stopped earlier in the process.
dgxuser@dgx-1:~$ sudo systemctl start nvidia-persistenced
dgxuser@dgx-1:~$ sudo systemctl start nvidia-docker
dgxuser@dgx-1:~$ sudo nv-hostengine dmon #only if DCGM is being used
Started host engine version 1.3.3 using port number: 5555