NVIDIA DCGM Exporter
4,144 5.0 (1 reviews)

Created 10/16/2021
Updated 9/26/2022
Revision 2
Categories
Host Metrics
Grafana Version >=9.1.6
Datasources
Prometheus

This dashboard displays GPU metrics collected from NVIDIA dcgm-exporter via a metric endpoint added to Prometheus. A separate endpoint is added to Prometheus via a Service Monitor.

Management Node: (download and build dcgm-exporter)

[yaoge123]$ git clone https://github.com/NVIDIA/dcgm-exporter.git
[yaoge123]$ cd dcgm-exporter
[yaoge123]$ make binary

Compute Node shell script:

#!/bin/sh
if [[ $(/sbin/lspci|/usr/bin/grep NVIDIA) ]];then
	wget -q -O /usr/local/sbin/dcgm-exporter http://mgmt/dcgm-exporter/cmd/dcgm-exporter/dcgm-exporter
	mkdir /etc/dcgm-exporter
	wget -q -O /etc/dcgm-exporter/default-counters.csv http://mgmt/dcgm-exporter/etc/default-counters.csv
	wget -q -O /etc/dcgm-exporter/dcp-metrics-included.csv http://mgmt/dcgm-exporter/etc/dcp-metrics-included.csv
	chmod +x /usr/local/sbin/dcgm-exporter
	
	if [[ "$(timeout 2s /usr/local/sbin/dcgm-exporter 2>&1|grep DCP)" =~ "\"Collecting DCP Metrics\"" ]];then 
		collectors="dcp-metrics-included.csv"
	else
		collectors="default-counters.csv"
	fi

	cat > /etc/systemd/system/dcgm-exporter.service <<EOF
[Unit]
Description=Prometheus DCGM exporter
Wants=network-online.target nvidia-dcgm.service
After=network-online.target nvidia-dcgm.service

[Service]
Type=simple
Restart=always
ExecStart=/usr/local/sbin/dcgm-exporter -f /etc/dcgm-exporter/$collectors

[Install]
WantedBy=multi-user.target
EOF
	systemctl daemon-reload
	systemctl enable dcgm-exporter.service
	systemctl restart dcgm-exporter.service
	curl -X PUT -d '{\"id\": \"${HOSTNAME}_dcgm-exporter\",\"name\": \"dcgm_exporter\",\"address\": \"${HOSTNAME}\",\"port\": 9400,\"tags\": ["prometheus","hpc","compute"],\"checks\": [{\"http\": \"http://${HOSTNAME}:9400/metrics\",\"interval\": \"60s\"}]}' http://consul:8500/v1/agent/service/register
fi
Get Dashboard
Download
Copy to Clipboard
Source Grafana.com

Used Metrics 15

  • DCGM_FI_DEV_GPU_TEMP

  • gpu

  • DCGM_FI_DEV_MEMORY_TEMP

  • DCGM_FI_DEV_GPU_UTIL

  • DCGM_FI_DEV_MEM_COPY_UTIL

  • DCGM_FI_DEV_POWER_USAGE

  • DCGM_FI_DEV_FB_USED

  • DCGM_FI_PROF_GR_ENGINE_ACTIVE

  • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

  • DCGM_FI_PROF_DRAM_ACTIVE

  • DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

  • DCGM_FI_PROF_PCIE_TX_BYTES

  • DCGM_FI_PROF_PCIE_RX_BYTES

  • DCGM_FI_DEV_SM_CLOCK

  • DCGM_FI_DEV_MEM_CLOCK