How to reset GPU's on DGX-1 Systems

NVIDIA provides a tool to monitor and manage the GPU's on the system called nvidia-smi. This tool can be used to reset GPU's either individually or as a group. In the case of the DGX-1 and DGX-1V platforms, individual GPU's can not be reset because they are linked via nvlink, so all the GPU's have to be reset simultaneously.


When a GPU presents Double Bit Errors or repeat Single Bit Errors in the same location, pages are retired. In order for the retired pages to be blacklisted (unavailable to the user/application), the GPU needs to be reset. This action also causes the driver to reload and prevent application use of the blacklisted memory. In order to reset the GPUs, all applications running on the GPUs must be shut down. One way to verify is to run nvidia-smi as follows:

dgxuser@dgx-1:~$ nvidia-smi -q -d PIDS

==============NVSMI LOG==============

Timestamp :                       Fri Feb 23 11:56:41 2018
Driver Version :                384.111

Attached GPUs :                  8
GPU 00000000:06:00.0
Processes :                         None

GPU 00000000:07:00.0
Processes :                         None

GPU 00000000:0A:00.0
Processes :                         None

GPU 00000000:0B:00.0
Processes :                         None

GPU 00000000:85:00.0
Processes :                         None

GPU 00000000:86:00.0
Processes :                       None

GPU 00000000:89:00.0
Processes :                      None

GPU 00000000:8A:00.0
Processes :                     None


Once no applications are running on the GPU's, the nvidia-docker and nvidia-persistenced services must be stopped as follows:

dgxuser@dgx-1:~$ sudo systemctl stop nvidia-persistenced
dgxuser@dgx-1:~$ sudo systemctl stop nvidia-docker

To reset the GPU's run the nvidia-smi command as follows:

dgxuser@dgx-1:~$ sudo nvidia-smi -r
 GPU 00000000:06:00.0 was successfully reset.
GPU 00000000:07:00.0 was successfully reset.
GPU 00000000:0A:00.0 was successfully reset.
GPU 00000000:0B:00.0 was successfully reset.
GPU 00000000:85:00.0 was successfully reset.
GPU 00000000:86:00.0 was successfully reset.
GPU 00000000:89:00.0 was successfully reset.
GPU 00000000:8A:00.0 was successfully reset.
All done.

Enable the nvidia-persistenced and nvidia-docker

dgxuser@dgx-1:~$ sudo systemctl start nvidia-persistenced
dgxuser@dgx-1:~$ sudo systemctl start nvidia-docker

