KubeadaptDocsBack to site
Sign inStart free
DocsAPI ReferenceCLI
  • Introduction
  • Getting Started
  • Capabilities
    • Connect a cluster
    • Right-size your workloads
    • Monitor GPU workloads
    • Trace cost to a team
    • Plan a safe spot migration
    • Clean up abandoned workloads
    • Invite teammates and set roles
    • Configure SAML SSO
    • Mint an API key
    • Rotate an API key
Docs homev1How To GuidesGpu Monitoring

How-to Guides

GPU Monitoring

Monitor NVIDIA GPU utilization, memory, and costs across your Kubernetes clusters.


If DCGM Exporter is running in your cluster, Kubeadapt collects GPU metrics with zero configuration. The agent discovers it automatically.


Supported GPUs

GPU VendorExporter RequiredStatus
NVIDIA (A100, H100, V100, T4, L4, P100, and any DCGM-compatible GPU)DCGM ExporterSupported
AMD (MI250, MI300)-Not supported
Intel (Gaudi, Flex)-Not supported

Prerequisites

  • NVIDIA GPUs in your cluster nodes
  • DCGM Exporter running as a DaemonSet, typically installed via GPU Operator or standalone
  • Kubeadapt agent installed (Quick Start)
Note

Kubeadapt does not install or manage GPU Operator or DCGM Exporter. These are prerequisites you manage separately. Once running, the agent picks them up automatically.


Check the Dashboard

Sign in to app.kubeadapt.io and navigate to your cluster. GPU nodes display:

  • GPU count per node
  • GPU model name (e.g., "NVIDIA A100-SXM4-80GB")
  • GPU utilization percentage
  • GPU memory usage

GPU Sharing Limitations

Warning

With GPU time-slicing or MPS, DCGM Exporter reports the same aggregate physical GPU value for every container sharing a device. The GPU hardware does not expose per-process utilization counters in shared mode.

With MIG (Multi-Instance GPU), DCGM Exporter reports metrics at the GPU Instance level (GPU_I_PROFILE, GPU_I_ID), but container-level attribution (pod, namespace, container labels) has known bugs (#272, #577).

Because of these exporter limitations, GPU right-sizing works at the node level only in shared GPU configurations. Kubeadapt can detect underutilized GPUs on a node, but per-workload attribution requires the exporter to expose that data.

Planned: per-pod GPU profiling

Container-level GPU tracking via the eBPF agent is planned. This will enable per-pod utilization and MIG instance attribution without depending on DCGM Exporter.


Configuration

GPU metrics collection is enabled by default:

yaml
agent:
  config:
    gpuMetricsEnabled: true # default
    dcgmPort: 9400 # default
    dcgmNamespace: "" # auto-detect across all namespaces

Override Namespace

If the agent cannot find DCGM Exporter pods, restrict the search to a specific namespace:

bash
helm upgrade kubeadapt kubeadapt/kubeadapt \
  --namespace kubeadapt \
  --reuse-values \
  --set agent.config.dcgmNamespace="gpu-operator"

Disable GPU Metrics

yaml
agent:
  config:
    gpuMetricsEnabled: false

For the full list of Helm values, see the kubeadapt-helm chart on GitHub.

Related

  • Cost Attribution
  • Rightsizing
  • Quick Start
PreviousRight-sizing GuideHow-to GuidesNextTrace cost to a teamHow-to Guides

On this page

  • Supported GPUs
  • Prerequisites
  • Check the Dashboard
  • GPU Sharing Limitations
  • Configuration
  • Override Namespace
  • Disable GPU Metrics
Edit this page
Kubeadapt

Kubernetes FinOps platform. Cost visibility, rightsizing, and capacity planning that pays for itself in 30 days.

Product

  • Cost Monitoring
  • Cost Attribution
  • Workload Rightsizing
  • Recommendations
  • Smart Alerting
  • Best Practices
  • Network Cross-AZ

Resources

  • Documentation
  • Status Page
  • Feature Requests

Company

  • About Us
  • Security
  • Careers
  • Contact

© 2026 Kubeadapt. All rights reserved.

PrivacyTermsSecurity