GPU Monitoring
Monitor NVIDIA GPU utilization, memory, and costs across your Kubernetes clusters.
If DCGM Exporter is running in your cluster, Kubeadapt collects GPU metrics with zero configuration. The agent discovers it automatically.
Supported GPUs
| GPU Vendor | Exporter Required | Status |
|---|---|---|
| NVIDIA (A100, H100, V100, T4, L4, P100, and any DCGM-compatible GPU) | DCGM Exporter | Supported |
| AMD (MI250, MI300) | - | Not supported |
| Intel (Gaudi, Flex) | - | Not supported |
Prerequisites
- NVIDIA GPUs in your cluster nodes
- DCGM Exporter running as a DaemonSet, typically installed via GPU Operator or standalone
- Kubeadapt agent installed (Quick Start)
Kubeadapt does not install or manage GPU Operator or DCGM Exporter. These are prerequisites you manage separately. Once running, the agent picks them up automatically.
Check the Dashboard
Sign in to app.kubeadapt.io and navigate to your cluster. GPU nodes display:
- GPU count per node
- GPU model name (e.g., "NVIDIA A100-SXM4-80GB")
- GPU utilization percentage
- GPU memory usage
GPU Sharing Limitations
With GPU time-slicing or MPS, DCGM Exporter reports the same aggregate physical GPU value for every container sharing a device. The GPU hardware does not expose per-process utilization counters in shared mode.
With MIG (Multi-Instance GPU), DCGM Exporter reports metrics at the GPU Instance level (GPU_I_PROFILE, GPU_I_ID), but container-level attribution (pod, namespace, container labels) has known bugs (#272, #577).
Because of these exporter limitations, GPU right-sizing works at the node level only in shared GPU configurations. Kubeadapt can detect underutilized GPUs on a node, but per-workload attribution requires the exporter to expose that data.
Container-level GPU tracking via the eBPF agent is planned. This will enable per-pod utilization and MIG instance attribution without depending on DCGM Exporter.
Configuration
GPU metrics collection is enabled by default:
agent:
config:
gpuMetricsEnabled: true # default
dcgmPort: 9400 # default
dcgmNamespace: "" # auto-detect across all namespacesOverride Namespace
If the agent cannot find DCGM Exporter pods, restrict the search to a specific namespace:
helm upgrade kubeadapt kubeadapt/kubeadapt \
--namespace kubeadapt \
--reuse-values \
--set agent.config.dcgmNamespace="gpu-operator"Disable GPU Metrics
agent:
config:
gpuMetricsEnabled: falseFor the full list of Helm values, see the kubeadapt-helm chart on GitHub.