Better NVIDIA DCGM Dashboard
2,086

Created 12/12/2024
Updated 12/12/2024
Revision 1
Categories
Host Metrics
Grafana Version >=11.3.0
Datasources
Prometheus

This dashboard is based on the original DCGM-Exporter dashboard by NVIDIA, but comes with an improved layout and a few additional visualizations.

Changes over upstream dashboard

  • Better layout, thinner lines
  • Uses the Hostname label for the host variable instead of instance
  • Legend labels are prefixed with hostnames
  • Displays cumulative energy draw over last 1h and last 24h
  • Larger range on total GPU power gauge (you should adjust this to your total max wattage)
  • Displays GPU memory usage as percentage in addition to absolute values
  • Power, GPU, and memory utilization graphs use stacked y axes
Get Dashboard
Download
Copy to Clipboard
Source Grafana.com

Used Metrics 8

  • DCGM_FI_DEV_GPU_TEMP

  • DCGM_FI_DEV_POWER_USAGE

  • DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

  • DCGM_FI_DEV_FB_USED

  • DCGM_FI_DEV_FB_FREE

  • DCGM_FI_DEV_GPU_UTIL

  • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

  • DCGM_FI_DEV_SM_CLOCK