How to reset GPU's on DGX-1 Systems
NVIDIA provides a tool to monitor and manage the GPU's on the system called nvidia-smi. This tool can be used to reset GPU's either individually or as a group. In the case of the DGX-1 and DGX-1V platforms, individual GPU's can not be reset because they are linked via nvlink, so all the GPU's have to be reset simultaneously.
When a GPU presents Double Bit Errors or repeat Single Bit Errors in the same location, pages are retired. In order for the retired pages to be blacklisted (unavailable to the user/application), the GPU needs to be reset. This action also causes the driver to reload and prevent application use of the blacklisted memory. In order to reset the GPUs, all applications running on the GPUs must be shut down. One way to verify is to run nvidia-smi as follows:
dgxuser@dgx-1:~$ nvidia-smi -q -d PIDS
Timestamp : Fri Feb 23 11:56:41 2018
Attached GPUs : 8
Once no applications are running on the GPU's, the nvidia-docker and nvidia-persistenced services must be stopped as follows:
dgxuser@dgx-1:~$ sudo systemctl stop nvidia-persistenced
To reset the GPU's run the nvidia-smi command as follows:
dgxuser@dgx-1:~$ sudo nvidia-smi -r
GPU 00000000:06:00.0 was successfully reset.
Enable the nvidia-persistenced and nvidia-docker
dgxuser@dgx-1:~$ sudo systemctl start nvidia-persistenced
dgxuser@dgx-1:~$ sudo systemctl start nvidia-docker