NVIDIA DCGM Exporter 4,1444,144 5.0 (1 reviews)
10/16/2021
9/26/2022
2
Host Metrics
>=9.1.6
Prometheus
This dashboard displays GPU metrics collected from NVIDIA dcgm-exporter via a metric endpoint added to Prometheus. A separate endpoint is added to Prometheus via a Service Monitor.
Management Node: (download and build dcgm-exporter)
[yaoge123]$ git clone https://github.com/NVIDIA/dcgm-exporter.git
[yaoge123]$ cd dcgm-exporter
[yaoge123]$ make binary
Compute Node shell script:
#!/bin/sh
if [[ $(/sbin/lspci|/usr/bin/grep NVIDIA) ]];then
wget -q -O /usr/local/sbin/dcgm-exporter http://mgmt/dcgm-exporter/cmd/dcgm-exporter/dcgm-exporter
mkdir /etc/dcgm-exporter
wget -q -O /etc/dcgm-exporter/default-counters.csv http://mgmt/dcgm-exporter/etc/default-counters.csv
wget -q -O /etc/dcgm-exporter/dcp-metrics-included.csv http://mgmt/dcgm-exporter/etc/dcp-metrics-included.csv
chmod +x /usr/local/sbin/dcgm-exporter
if [[ "$(timeout 2s /usr/local/sbin/dcgm-exporter 2>&1|grep DCP)" =~ "\"Collecting DCP Metrics\"" ]];then
collectors="dcp-metrics-included.csv"
else
collectors="default-counters.csv"
fi
cat > /etc/systemd/system/dcgm-exporter.service <<EOF
[Unit]
Description=Prometheus DCGM exporter
Wants=network-online.target nvidia-dcgm.service
After=network-online.target nvidia-dcgm.service
[Service]
Type=simple
Restart=always
ExecStart=/usr/local/sbin/dcgm-exporter -f /etc/dcgm-exporter/$collectors
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable dcgm-exporter.service
systemctl restart dcgm-exporter.service
curl -X PUT -d '{\"id\": \"${HOSTNAME}_dcgm-exporter\",\"name\": \"dcgm_exporter\",\"address\": \"${HOSTNAME}\",\"port\": 9400,\"tags\": ["prometheus","hpc","compute"],\"checks\": [{\"http\": \"http://${HOSTNAME}:9400/metrics\",\"interval\": \"60s\"}]}' http://consul:8500/v1/agent/service/register
fi
Get Dashboard✕
Download
Copy to Clipboard
Used Metrics 1515
DCGM_FI_DEV_GPU_TEMP
gpu
DCGM_FI_DEV_MEMORY_TEMP
DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_POWER_USAGE
DCGM_FI_DEV_FB_USED
DCGM_FI_PROF_GR_ENGINE_ACTIVE
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
DCGM_FI_PROF_DRAM_ACTIVE
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL
DCGM_FI_PROF_PCIE_TX_BYTES
DCGM_FI_PROF_PCIE_RX_BYTES
DCGM_FI_DEV_SM_CLOCK
DCGM_FI_DEV_MEM_CLOCK