HOW-TO GUIDES

GPU Monitoring

Monitor NVIDIA GPU utilization, memory, and costs across your Kubernetes clusters.

If DCGM Exporter is running in your cluster, Kubeadapt collects GPU metrics with zero configuration. The agent discovers it automatically.


Supported GPUs

GPU VendorExporter RequiredStatus
NVIDIA (A100, H100, V100, T4, L4, P100, and any DCGM-compatible GPU)DCGM ExporterSupported
AMD (MI250, MI300)-Not supported
Intel (Gaudi, Flex)-Not supported

Prerequisites

Note

Kubeadapt does not install or manage GPU Operator or DCGM Exporter. These are prerequisites you manage separately. Once running, the agent picks them up automatically.


Check the Dashboard

Sign in to app.kubeadapt.io and navigate to your cluster. GPU nodes display:

  • GPU count per node
  • GPU model name (e.g., "NVIDIA A100-SXM4-80GB")
  • GPU utilization percentage
  • GPU memory usage

GPU Sharing Limitations

Warning

With GPU time-slicing or MPS, DCGM Exporter reports the same aggregate physical GPU value for every container sharing a device. The GPU hardware does not expose per-process utilization counters in shared mode.

With MIG (Multi-Instance GPU), DCGM Exporter reports metrics at the GPU Instance level (GPU_I_PROFILE, GPU_I_ID), but container-level attribution (pod, namespace, container labels) has known bugs (#272, #577).

Because of these exporter limitations, GPU right-sizing works at the node level only in shared GPU configurations. Kubeadapt can detect underutilized GPUs on a node, but per-workload attribution requires the exporter to expose that data.

Planned: per-pod GPU profiling

Container-level GPU tracking via the eBPF agent is planned. This will enable per-pod utilization and MIG instance attribution without depending on DCGM Exporter.


Configuration

GPU metrics collection is enabled by default:

yaml
agent: config: gpuMetricsEnabled: true # default dcgmPort: 9400 # default dcgmNamespace: "" # auto-detect across all namespaces

Override Namespace

If the agent cannot find DCGM Exporter pods, restrict the search to a specific namespace:

bash
helm upgrade kubeadapt kubeadapt/kubeadapt \ --namespace kubeadapt \ --reuse-values \ --set agent.config.dcgmNamespace="gpu-operator"

Disable GPU Metrics

yaml
agent: config: gpuMetricsEnabled: false

For the full list of Helm values, see the kubeadapt-helm chart on GitHub.