How to reset GPU's on DGX-1 Systems

Answer ID 4642   |    Updated 03/26/2018 04:38 AM

Background

NVIDIA provides a tool to monitor and manage the GPU's on the system called nvidia-smi. This tool can be used to reset GPU's either individually or as a group.

NOTE: In the case of the DGX-1 and DGX-1V platforms, individual GPU's can not be reset because they are linked via nvlink, so all the GPU's have to be reset simultaneously.

Troubleshooting

When a GPU presents Double Bit Errors or repeat Single Bit Errors in the same location, pages are retired. In order for the retired pages to be blacklisted (unavailable to the user/application), the GPU needs to be reset. This action also causes the driver to reload and prevent application use of the blacklisted memory. In order to reset the GPUs, all applications running on the GPUs must be shut down. One way to verify is to run nvidia-smi as follows:

dgxuser@dgx-1:~$ nvidia-smi -q -d PIDS

==============NVSMI LOG==============

Timestamp :                       Fri Feb 23 11:56:41 2018
Driver Version :                384.111

Attached GPUs :                  8
GPU 00000000:06:00.0
Processes :                         None

GPU 00000000:07:00.0
Processes :                         None

GPU 00000000:0A:00.0
Processes :                         None

GPU 00000000:0B:00.0
Processes :                         None

GPU 00000000:85:00.0
Processes :                         None

GPU 00000000:86:00.0
Processes :                       None

GPU 00000000:89:00.0
Processes :                      None

GPU 00000000:8A:00.0
Processes :                     None

dgxuser@dgx-1:~$  

 

If DCGM is being used to monitor GPUs, make sure to shut down the host engine (daemon).

dgxuser@dgx-1:~$ sudo nv-hostengine -t
Host engine successfully terminated.
dgxuser@dgx-1:~$

If the DGX-1 system is being monitored by any application or agent that is watching the GPUs, those applications or agents should also be shut down (for example, Nagios).

Once no applications are running on the GPUs, the nvidia-docker and nvidia-persistenced services must be stopped as follows:

dgxuser@dgx-1:~$ sudo systemctl stop nvidia-persistenced
dgxuser@dgx-1:~$ sudo systemctl stop nvidia-docker

As a last check, to verify no applications or agents are running on the GPUs, run:

dgxuser@dgx-1:~$ lsof /dev/nvidia*
dgxuser@dgx-1:~$

Make sure all processes (if any) listed are stopped or killed before proceeding with the next step.

To reset the GPU's run the nvidia-smi command as follows:

dgxuser@dgx-1:~$ sudo nvidia-smi -r
 GPU 00000000:06:00.0 was successfully reset.
 GPU 00000000:07:00.0 was successfully reset.
 GPU 00000000:0A:00.0 was successfully reset.
 GPU 00000000:0B:00.0 was successfully reset.
 GPU 00000000:85:00.0 was successfully reset.
 GPU 00000000:86:00.0 was successfully reset.
 GPU 00000000:89:00.0 was successfully reset.
 GPU 00000000:8A:00.0 was successfully reset.
 All done. 
dgxuser@dgx-1:~$

Enable the nvidia-persistenced, nvidia-docker, and any other monitoring agents and applications that were stopped earlier in the process.

dgxuser@dgx-1:~$ sudo systemctl start nvidia-persistenced
dgxuser@dgx-1:~$ sudo systemctl start nvidia-docker
dgxuser@dgx-1:~$ sudo nv-hostengine dmon #only if DCGM is being used 
Started host engine version 1.3.3 using port number: 5555
Was this answer helpful?
 
Your rating has been submitted, please tell us how we can make this answer more useful.

Print