NVIDIA’s Tesla, Quadro, GRID, and GeForce devices from the Fermi and higher architecture families are all monitored and managed using nvidia-smi (also known as NVSMI). Most features are supported for GeForce Titan series devices, with very little information available for the rest of the Geforce line.
NVSMI is a cross-platform program that works with all NVIDIA-supported Linux distributions as well as 64-bit Windows versions beginning with Windows Server 2008 R2. Users can consume metrics directly via stdout, or files in CSV and XML formats can be provided for scripting reasons.
Much of the functionality of NVSMI is provided by the underlying NVML C-based library. The output of NVSMI is not guaranteed to be backwards compatible. However, both NVML and the Python bindings are backwards compatible. The output of NVSMI is not guaranteed to be backwards compatible. However, both NVML and the Python bindings are backwards compatible
On Linux, you can enable persistence mode on GPUs to keep the NVIDIA driver loaded even if no apps are using them. This is especially beneficial if you have a number of short jobs going at the same time. Persistence mode consumes a few more watts per idle GPU, but it avoids the lengthy pauses that occur when launching a GPU application.
It’s also required if the GPUs have been given specified clock rates or power constraints (as those changes are lost when the NVIDIA driver is unloaded). Run nvidia-smi -pm 1
on all GPUs to enable persistence mode.
nvidia-smi is unable to configure persistence mode on Windows. Instead, you should use TCC mode on your computational GPUs. NVIDIA’s graphical GPU device administration panel should be used for this.
NVIDIA’s SMI utility works with nearly every NVIDIA GPU released since 2011. These include Fermi and higher architectural families’ Tesla, Quadro, and GeForce devices (Kepler, Maxwell, Pascal, Volta, etc).
Ampere: A100, RTX AA6000, RTX A5000, RTX A4000.
Tesla: V100, S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100.
Quadro: 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series.
GeForce: Varying levels of support.
GPU Initialization & Info
root@server:~# nvidia-smi Sat Feb 12 19:36:14 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... Off | 00000000:31:00.0 Off | 0 | | N/A 40C P0 62W / 300W | 0MiB / 80994MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100 80G... Off | 00000000:CA:00.0 Off | 0 | | N/A 42C P0 66W / 300W | 0MiB / 80994MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
GPU Status Query
root@server:~# nvidia-smi -L GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-84ccface-663f-f5fd-8e8e-109d0f78bd2f) GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-e396c2fa-f6fd-57db-aea0-5aaf73ee6148)
GPU Details
root@server:~# nvidia-smi --query-gpu=index,name,uuid,serial --format=csv index, name, uuid, serial 0, NVIDIA A100 80GB PCIe, GPU-84ccface-663f-f5fd-8e8e-109d0f78bd2f, 1324021047639 1, NVIDIA A100 80GB PCIe, GPU-e396c2fa-f6fd-57db-aea0-5aaf73ee6148, 1324021046251
Monitor GPU Usage
root@server:~# nvidia-smi dmon # gpu pwr gtemp mtemp sm mem enc dec mclk pclk # Idx W C C % % % % MHz MHz 0 62 40 54 0 0 0 0 1512 1410 1 66 43 57 6 0 0 0 1512 1410 0 62 40 55 0 0 0 0 1512 1410 1 66 42 57 0 0 0 0 1512 1410 0 62 40 54 0 0 0 0 1512 1410 1 66 42 58 0 0 0 0 1512 1410 0 62 40 55 0 0 0 0 1512 1410 1 66 42 57 0 0 0 0 1512 1410 0 62 40 54 0 0 0 0 1512 1410 1 66 42 57 0 0 0 0 1512 1410 0 62 40 54 0 0 0 0 1512 1410 1 66 42 57 0 0 0 0 1512 1410
Monitor GPU Processes
root@server:~# nvidia-smi pmon # gpu pid type sm mem enc dec command # Idx # C/G % % % % name 0 - - - - - - - 1 - - - - - - - 0 - - - - - - - 1 - - - - - - - 0 - - - - - - - 1 - - - - - - -
List of Available Clocks
root@server:~# nvidia-smi -q -d SUPPORTED_CLOCKS ==============NVSMI LOG============== Timestamp : Sat Feb 12 20:21:18 2022 Driver Version : 470.103.01 CUDA Version : 11.4 Attached GPUs : 2 GPU 00000000:31:00.0 Supported Clocks Memory : 1512 MHz Graphics : 1410 MHz Graphics : 1395 MHz Graphics : 1380 MHz Graphics : 1365 MHz Graphics : 1350 MHz Graphics : 1335 MHz Graphics : 1320 MHz Graphics : 1305 MHz Graphics : 1290 MHz Graphics : 1275 MHz
Current GPU Clock Speed
root@server:~# nvidia-smi -q -d CLOCK ==============NVSMI LOG============== Timestamp : Sat Feb 12 20:23:25 2022 Driver Version : 470.103.01 CUDA Version : 11.4 Attached GPUs : 2 GPU 00000000:31:00.0 Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A
GPU Performance
root@server:~# nvidia-smi -q -d PERFORMANCE ==============NVSMI LOG============== Timestamp : Sat Feb 12 20:27:57 2022 Driver Version : 470.103.01 CUDA Version : 11.4 Attached GPUs : 2 GPU 00000000:31:00.0 Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active GPU 00000000:CA:00.0 Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active
GPU Topology
root@server:~# nvidia-smi topo --matrix GPU0 GPU1 CPU Affinity NUMA Affinity GPU0 X SYS 0-23,48-71 0 GPU1 SYS X 24-47,72-95 1 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NVLink Status
root@server:~# nvidia-smi nvlink --status GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-84ccface-663f-f5fd-8e8e-109d0f78bd2f) Link 0: <inactive> Link 1: <inactive> Link 2: <inactive> Link 3: <inactive> Link 4: <inactive> Link 5: <inactive> Link 6: <inactive> Link 7: <inactive> Link 8: <inactive> Link 9: <inactive> Link 10: <inactive> Link 11: <inactive> GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-e396c2fa-f6fd-57db-aea0-5aaf73ee6148) Link 0: <inactive> Link 1: <inactive> Link 2: <inactive> Link 3: <inactive> Link 4: <inactive> Link 5: <inactive> Link 6: <inactive> Link 7: <inactive> Link 8: <inactive> Link 9: <inactive> Link 10: <inactive> Link 11: <inactive>
Display GPU Details
root@server:~# nvidia-smi -i 0 -q ==============NVSMI LOG============== Timestamp : Sat Feb 12 20:41:51 2022 Driver Version : 470.103.01 CUDA Version : 11.4 Attached GPUs : 2 GPU 00000000:31:00.0 Product Name : NVIDIA A100 80GB PCIe Product Brand : NVIDIA Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : Disabled Pending : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1324021047639 GPU UUID : GPU-84ccface-663f-f5fd-8e8e-109d0f78bd2f Minor Number : 0 VBIOS Version : 92.00.68.00.01 MultiGPU Board : No Board ID : 0x3100 GPU Part Number : 900-21001-0020-000 Module ID : 0 Inforom Version Image Version : 1001.0230.00.03 OEM Object : 2.0 ECC Object : 6.16 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : 470.103.01 GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x31 Device : 0x00 Domain : 0x0000 Device Id : 0x20B510DE Bus Id : 00000000:31:00.0 Sub System Id : 0x153310DE GPU Link Info PCIe Generation Max : 4 Current : 4 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 80994 MiB Used : 0 MiB Free : 80994 MiB BAR1 Memory Usage Total : 131072 MiB Used : 1 MiB Free : 131071 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 0 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 640 bank(s) High : 0 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 40 C GPU Shutdown Temp : 92 C GPU Slowdown Temp : 89 C GPU Max Operating Temp : 85 C GPU Target Temperature : N/A Memory Current Temp : 55 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 62.86 W Power Limit : 300.00 W Default Power Limit : 300.00 W Enforced Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 893.750 mV Processes : None
GPU App Details
root@server:~# nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE ==============NVSMI LOG============== Timestamp : Sat Feb 12 20:44:55 2022 Driver Version : 470.103.01 CUDA Version : 11.4 Attached GPUs : 2 GPU 00000000:31:00.0 FB Memory Usage Total : 80994 MiB Used : 0 MiB Free : 80994 MiB BAR1 Memory Usage Total : 131072 MiB Used : 1 MiB Free : 131071 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % GPU Utilization Samples Duration : 1562.92 sec Number of Samples : 42 Max : 5 % Min : 0 % Avg : 0 % Memory Utilization Samples Duration : 1562.92 sec Number of Samples : 42 Max : 0 % Min : 0 % Avg : 0 % ENC Utilization Samples Duration : 1562.92 sec Number of Samples : 42 Max : 0 % Min : 0 % Avg : 0 % DEC Utilization Samples Duration : 1562.92 sec Number of Samples : 42 Max : 0 % Min : 0 % Avg : 0 % Power Readings Power Management : Supported Power Draw : 62.66 W Power Limit : 300.00 W Default Power Limit : 300.00 W Enforced Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.39 sec Number of Samples : 119 Max : 66.78 W Min : 57.45 W Avg : 60.17 W Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A
Change GPU Performance Level
Use the following command to check the current Performance state of your GPU
nvidia-smi -q -d PERFORMANCE
# nvidia-smi -q -d PERFORMANCE ==============NVSMI LOG============== Timestamp : Mon Feb 21 16:04:44 2022 Driver Version : 511.65 CUDA Version : 11.6 Attached GPUs : 1 GPU 00000000:65:00.0 Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active
Verify the maximum power status of your Nvidia GPUs when running Compute programs such as OpenCL or CUDA You need to find out what the maximum frequencies of the video card are for maximum performance in P0 mode. The command to use for this is as follows.
nvidia-smi -q -d SUPPORTED_CLOCKS | more
# nvidia-smi -q -d SUPPORTED_CLOCKS | more ==============NVSMI LOG============== Timestamp : Mon Feb 21 16:08:23 2022 Driver Version : 511.65 CUDA Version : 11.6 Attached GPUs : 1 GPU 00000000:65:00.0 Supported Clocks Memory : 8001 MHz Graphics : 2100 MHz Graphics : 2085 MHz Graphics : 2070 MHz Graphics : 2055 MHz Graphics : 2040 MHz Graphics : 2025 MHz Graphics : 2010 MHz Graphics : 1995 MHz Graphics : 1980 MHz Graphics : 1965 MHz Graphics : 1950 MHz Graphics : 1935 MHz Graphics : 1920 MHz Graphics : 1905 MHz Graphics : 1890 MHz Graphics : 1875 MHz Graphics : 1860 MHz Graphics : 1845 MHz Graphics : 1830 MHz Graphics : 1815 MHz Graphics : 1800 MHz Graphics : 1785 MHz Graphics : 1770 MHz Graphics : 1755 MHz Graphics : 1740 MHz Graphics : 1725 MHz Graphics : 1710 MHz Graphics : 1695 MHz Graphics : 1680 MHz Graphics : 1665 MHz Graphics : 1650 MHz Graphics : 1635 MHz Graphics : 1620 MHz Graphics : 1605 MHz Graphics : 1590 MHz Graphics : 1575 MHz Graphics : 1560 MHz Graphics : 1545 MHz -- More --
There is no need to examine the entire list because it will include all of the supported frequencies in the various power states that your video card may use. We need to know the memory and graphics frequencies at the top of the list, in this case, we are using a RTX A6000 video card, and the values we require are 8001 MHz for the VRAM 2100 MHz for the GPU. For the next phase, we will need these frequencies.
The next step is to force the video card to use the highest performance operating frequency by setting the power state to P0. We’ll need to run the following command to accomplish this.
nvidia-smi -ac 8001,2100
Note that the above command will apply the settings to all GPUs in your system; this should not be an issue for most GPU servers because they often include a number of cards of the same model, but there are some exceptions. As a result, you may need to examine each video card’s particular settings and apply the appropriate values for each one independently.
To do so, simply include the card ID on the command line, and the option will be executed only for the video card supplied. This is accomplished by adding the -i argument to the command line, where i is a number starting at 0 for the first GPU and increasing from there.
We have two different GPUs in the system in the example shown in the screenshot above, therefore we need to establish their P0 power states using two separate instructions, one for each card.
nvidia-smi -i 0 -ac 8001,2100
nvidia-smi -i 1 -ac 8001,2085
Rent GPU Dedicated Server
Dedicated GPU servers with A40, RTX 3090, NVIDIA A100 80GB, and RTX A6000.