NVIDIA provides a tool called nvidia-smi to monitor and manage the system’s GPUs. GPUs can be reset individually or collectively with this tool. Individual GPUs on the DGX-1 and DGX-1V platforms cannot be reset because they are linked via nvlink, so all of the GPUs must be reset at the same time.
Errors that occur twice or more in the same place on a GPU are relegated to the trash bin. The GPU must be reset in order for the retired pages to be blacklisted (and thus no longer available to the user or application). An application cannot access memory that has been blacklisted because the driver has to be reloaded and reactivated. All GPU-based applications must be closed before the GPUs can be reset. nvidia-smi can be used to check this.
dgxuser@dgx-1:~$ nvidia-smi -q -d PIDS ==============NVSMI LOG============== Timestamp : Fri Feb 23 11:56:41 2018 Driver Version : 384.111 Attached GPUs : 8 GPU 00000000:06:00.0 Processes : None GPU 00000000:07:00.0 Processes : None GPU 00000000:0A:00.0 Processes : None GPU 00000000:0B:00.0 Processes : None GPU 00000000:85:00.0 Processes : None GPU 00000000:86:00.0 Processes : None GPU 00000000:89:00.0 Processes : None GPU 00000000:8A:00.0 Processes : None dgxuser@dgx-1:~$
The nvidia-docker and nvidia-persistenced services must be stopped as soon as no applications are running on the GPUs:
dgxuser@dgx-1:~$ sudo systemctl stop nvidia-persistenced dgxuser@dgx-1:~$ sudo systemctl stop nvidia-docker
To reset the GPU’s run the nvidia-smi command as follows:
dgxuser@dgx-1:~$ sudo nvidia-smi -r GPU 00000000:06:00.0 was successfully reset. GPU 00000000:07:00.0 was successfully reset. GPU 00000000:0A:00.0 was successfully reset. GPU 00000000:0B:00.0 was successfully reset. GPU 00000000:85:00.0 was successfully reset. GPU 00000000:86:00.0 was successfully reset. GPU 00000000:89:00.0 was successfully reset. =' GPU 00000000:8A:00.0 was successfully reset.