Nvidia GPU
920

Created 3/16/2023
Updated 4/12/2023
Revision 2
Grafana Version >=9.0.1
Datasources
Prometheus

To be used with the gpu-operator helm chart. Some considerations related to the metrics available:

  • Graph GPU utilization does not take into consideration MIG partition size :(
  • Table Usage for TimeSliced MIG is not showing usage correctly.

This was tested using NODE wide definitions (no modes or individual cards tested)

Contributions welcome! Send contributions to dy090.guerra@gmail.com

CHANGELOG:

Revision 2

  • Correct usage of Memory Metrics (instead of Bandwidth)
  • Replacement of the fan speed graph for the SM metrics
  • Added Profiling metrics for FP64, FP32 and FP16 together with Tensor core

The added/refactored metrics require the usage of a custom dcgmExporter configMap that exports the following metrics in addition to defaults:

  • DCGM_FI_PROF_PIPE_FP64_ACTIVE
  • DCGM_FI_PROF_PIPE_FP32_ACTIVE
  • DCGM_FI_PROF_PIPE_FP16_ACTIVE
  • DCGM_FI_PROF_SM_ACTIVE
  • DCGM_FI_PROF_SM_OCCUPANCY
  • DCGM_FI_DEV_FB_TOTAL

NOTE: consider using DCGM_FI_DEV_FB_TOTAL instead of (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED) in memory dashboards.

Revision 1

First release

Contributors:

  • Diana Gaponcic
  • Diogo Guerra
Get Dashboard
Download
Copy to Clipboard
Source Grafana.com

Used Metrics 19

  • DCGM_FI_PROF_GR_ENGINE_ACTIVE

  • DCGM_FI_DEV_FB_USED

  • DCGM_FI_DEV_FB_FREE

  • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

  • DCGM_FI_PROF_PIPE_FP64_ACTIVE

  • DCGM_FI_PROF_PIPE_FP32_ACTIVE

  • DCGM_FI_PROF_PIPE_FP16_ACTIVE

  • GPU

  • GPU_I_PROFILE

  • modelName

  • DCGM_FI_PROF_SM_ACTIVE

  • DCGM_FI_PROF_SM_OCCUPANCY

  • kube_node_status_allocatable

  • kube_pod_container_resource_limits

  • DCGM_FI_DEV_GPU_TEMP

  • DCGM_FI_DEV_POWER_USAGE

  • DCGM_FI_PROF_PCIE_TX_BYTES

  • DCGM_FI_PROF_PCIE_RX_BYTES

  • DCGM_FI_PROF_DRAM_ACTIVE