NVIDIA DCGM Exporter Dashboard
615,914

Created 5/2/2020
Updated 5/6/2020
Revision 1
Grafana Version >=6.7.3
Datasources
Prometheus

Description

This dashboard monitors NVIDIA GPU health and performance by visualizing key DCGM metrics such as DCGM_FI_DEV_GPU_TEMP for thermal status and DCGM_FI_DEV_POWER_USAGE for power draw. It also highlights compute capability indicators like DCGM_FI_DEV_SM_CLOCK and DCGM_FI_DEV_MEM_CLOCK, along with utilization metrics such as DCGM_FI_DEV_GPU_UTIL and memory activity including DCGM_FI_DEV_MEM_COPY_UTIL and DCGM_FI_DEV_FB_USED to track bandwidth, memory usage, and efficiency across GPUs. Panels present real-time gauges and time-series plots to quickly identify hotspots, bottlenecks, and overall GPU utilization trends.

Screenshots

Source Grafana.com

Used Metrics 7

  • DCGM_FI_DEV_FB_USED

  • DCGM_FI_DEV_GPU_TEMP

  • DCGM_FI_DEV_GPU_UTIL

  • DCGM_FI_DEV_MEM_CLOCK

  • DCGM_FI_DEV_MEM_COPY_UTIL

  • DCGM_FI_DEV_POWER_USAGE

  • DCGM_FI_DEV_SM_CLOCK

Get Dashboard
Download
Copy to Clipboard