AI智算-GPU资源监控概览-20241127 1,0161,016
11/27/2024
11/30/2024
3
>=10.3.3
Prometheus
Description
适用于AI智算场景中监控NVIDIA GPU资源概览,依赖组件: dcgm-exporter:3.3.0-3.2.0-ubuntu22.04、prometheus:v2.39.1、grafana:10.3.3
Screenshots
Used Metrics 1515
DCGM_FI_DEV_FB_FREE
DCGM_FI_DEV_FB_USED
DCGM_FI_DEV_GPU_TEMP
DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_CLOCK
DCGM_FI_DEV_POWER_USAGE
DCGM_FI_DEV_SM_CLOCK
Hostname
-
container_cpu_usage_seconds_total
-
container_memory_working_set_bytes
host_ip
kube_node_info
kube_node_status_allocatable
kube_pod_info
node
Get Dashboard✕
Download
Copy to Clipboard