GPU Monitoring Setup

Overview

Kubeadapt supports GPU cost monitoring through integration with NVIDIA DCGM (Data Center GPU Manager) Exporter. This allows you to:

  • Track GPU costs alongside CPU and memory
  • Monitor GPU utilization per node and workload
  • Optimize GPU usage with rightsizing recommendations
  • Identify idle GPUs for cost savings

GPU Metrics Collected:

  • DCGM_FI_DEV_GPU_UTIL - GPU compute utilization percentage

Prerequisites

Before enabling GPU monitoring, ensure:

  • NVIDIA GPUs in your cluster nodes
  • NVIDIA device plugin installed (or will install with GPU Operator)
  • Helm 3.x for Kubeadapt installation
  • Cluster admin permissions

Supported GPU types:

  • NVIDIA A100, V100, T4, P100
  • Any NVIDIA GPU with DCGM support

Installation Options

You have two options for enabling GPU monitoring:

Use this if you don't have GPU Operator installed yet.

Pros:

  • Single Helm install for everything
  • Automatic DCGM Exporter deployment
  • GPU Operator manages NVIDIA drivers and device plugins
  • Simpler configuration

Cons:

  • Installs additional components (GPU Operator stack)

Option 2: Use Existing DCGM Exporter

Use this if you already have DCGM Exporter running.

Pros:

  • Reuses existing infrastructure
  • Lighter weight (no additional deployments)

Cons:

  • Requires manual scrape configuration
  • Need to know your DCGM Exporter namespace/labels

Option 1: Install with GPU Operator

Step 1: Enable GPU Operator in Helm

Install (or upgrade) Kubeadapt with GPU Operator enabled:

bash
1helm install kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --create-namespace \ 4 --set agent.enabled=true \ 5 --set agent.config.token=YOUR_TOKEN \ 6 --set gpu-operator.enabled=true

Or if already installed, upgrade:

bash
1helm upgrade kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --reuse-values \ 4 --set gpu-operator.enabled=true

Step 2: Enable DCGM Scraping

Create a values.yaml file with DCGM scrape configuration:

yaml
1gpu-operator: 2 enabled: true 3 operator: 4 defaultRuntime: containerd 5 dcgmExporter: 6 enabled: true 7 resources: 8 requests: 9 memory: "128Mi" 10 cpu: "50m" 11 limits: 12 memory: "512Mi" 13 cpu: "250m" 14 15prometheus: 16 serverFiles: 17 prometheus.yml: 18 scrape_configs: 19 # Enable DCGM Exporter scraping 20 - job_name: "dcgm-exporter" 21 kubernetes_sd_configs: 22 - role: pod 23 namespaces: 24 names: 25 - kubeadapt 26 relabel_configs: 27 - source_labels: [__meta_kubernetes_pod_label_app] 28 action: keep 29 regex: dcgm-exporter 30 - source_labels: [__meta_kubernetes_pod_container_port_name] 31 action: keep 32 regex: metrics 33 metric_relabel_configs: 34 - source_labels: [__name__] 35 action: keep 36 regex: ^DCGM_FI_DEV_GPU_UTIL$

Apply the configuration:

bash
1helm upgrade kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --reuse-values \ 4 -f gpu-monitoring-values.yaml

Step 3: Verify GPU Operator Installation

Check that GPU Operator components are running:

bash
1kubectl get pods -n kubeadapt | grep -E 'gpu-operator|dcgm'

Expected output:

text
1 2gpu-operator-6b8f9d7c4d-x7k9m 1/1 Running 0 2m 3nvidia-dcgm-exporter-abcde 1/1 Running 0 2m 4nvidia-device-plugin-daemonset-fghij 1/1 Running 0 2m 5

Step 4: Verify GPU Metrics

After configuration, GPU metrics should appear in the Kubeadapt dashboard.

Sign in to app.kubeadapt.io and navigate to your cluster to verify GPU cost data is visible.

Expected output:

json
1[ 2 { 3 "metric": { 4 "__name__": "DCGM_FI_DEV_GPU_UTIL", 5 "gpu": "0", 6 "instance": "10.0.1.42:9400", 7 "job": "dcgm-exporter" 8 }, 9 "value": [1705315200, "85.5"] 10 } 11]

Option 2: Use Existing DCGM Exporter

If you already have DCGM Exporter running in your cluster:

Step 1: Identify Your DCGM Exporter

Find the namespace and labels of your existing DCGM Exporter:

bash
1kubectl get pods --all-namespaces -l app=dcgm-exporter

Example output:

text
1NAMESPACE NAME READY STATUS RESTARTS AGE 2gpu-system dcgm-exporter-abc123 1/1 Running 0 10d

Step 2: Configure Prometheus Scraping

Create a values.yaml with your DCGM Exporter details:

yaml
1# existing-dcgm-values.yaml 2 3prometheus: 4 serverFiles: 5 prometheus.yml: 6 scrape_configs: 7 # Scrape existing DCGM Exporter 8 - job_name: "dcgm-exporter" 9 kubernetes_sd_configs: 10 - role: pod 11 namespaces: 12 names: 13 - gpu-system # Change to your namespace 14 relabel_configs: 15 - source_labels: [__meta_kubernetes_pod_label_app] 16 action: keep 17 regex: dcgm-exporter # Change if your label is different 18 - source_labels: [__meta_kubernetes_pod_container_port_name] 19 action: keep 20 regex: metrics 21 metric_relabel_configs: 22 - source_labels: [__name__] 23 action: keep 24 regex: ^DCGM_FI_DEV_GPU_UTIL$

Apply the configuration:

bash
1helm upgrade kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --reuse-values \ 4 -f existing-dcgm-values.yaml

Step 3: Verify Scraping

Verify DCGM Exporter pods are running:

bash
1kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

All pods should show Running status.


Agent Configuration

Enable GPU Monitoring in Agent

To enable GPU cost tracking in the Kubeadapt agent, add the GPU monitoring flag:

bash
1helm upgrade kubeadapt kubeadapt/kubeadapt \ 2 --namespace kubeadapt \ 3 --reuse-values \ 4 --set agent.config.enableGpuMonitoring=true

Or in values.yaml:

yaml
1agent: 2 enabled: true 3 config: 4 token: "YOUR_TOKEN" 5 enableGpuMonitoring: true

Verify Agent Configuration

Check agent logs to confirm GPU monitoring is enabled:

bash
1kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu

Expected output:

text
1INFO: GPU monitoring enabled 2INFO: Discovered 4 GPU nodes in cluster 3INFO: Collecting metrics from DCGM exporter

Viewing GPU Costs in Dashboard

Once configured, GPU metrics will appear in your Kubeadapt dashboard:

Dashboard Features

1. GPU Cost Overview

  • Total GPU spend per month
  • GPU utilization percentage
  • GPU-enabled nodes count

2. GPU Utilization by Node

  • Per-node GPU usage graphs
  • Idle GPU identification
  • GPU memory utilization

3. Workload GPU Usage

  • GPU allocation per pod
  • GPU request vs. actual usage
  • Rightsizing recommendations for GPU workloads

GPU Metrics Available

Node-level:

  • GPU count per node
  • GPU model and memory capacity

Workload-level:

  • GPU requests (nvidia.com/gpu)
  • GPU compute utilization percentage
  • GPU idle time detection

Troubleshooting

DCGM Exporter pods not running

Symptoms:

bash
1kubectl get pods -n kubeadapt | grep dcgm 2# No pods or CrashLoopBackOff

Common causes:

  1. No GPU nodes in cluster
  2. NVIDIA device plugin not installed
  3. GPU Operator failed to install

Solution:

bash
1# Check for GPU nodes 2kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu" != null)' 3 4# Check GPU Operator logs 5kubectl logs -n kubeadapt deployment/gpu-operator 6 7# Reinstall GPU Operator 8helm upgrade kubeadapt kubeadapt/kubeadapt \ 9 --namespace kubeadapt \ 10 --reuse-values \ 11 --set gpu-operator.enabled=false 12helm upgrade kubeadapt kubeadapt/kubeadapt \ 13 --namespace kubeadapt \ 14 --reuse-values \ 15 --set gpu-operator.enabled=true

No GPU metrics appearing

Symptoms:

GPU cost data not visible in Kubeadapt dashboard.

Common causes:

  1. DCGM Exporter not scraped by Prometheus
  2. Incorrect namespace or labels in scrape config
  3. DCGM Exporter not exposing metrics

Solution:

bash
1# Verify DCGM Exporter is running 2kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter 3 4# Check DCGM Exporter logs 5kubectl logs -n gpu-operator -l app=nvidia-dcgm-exporter 6 7# Verify Kubeadapt agent logs 8kubectl logs -n kubeadapt -l app=kubeadapt-agent | grep -i gpu

GPU costs not showing in dashboard

Symptoms:

  • DCGM metrics available in Prometheus
  • GPU costs not visible in Kubeadapt dashboard

Common causes:

  1. Agent GPU monitoring not enabled
  2. Agent not collecting GPU metrics
  3. GPU pricing not configured

Solution:

bash
1# Enable GPU monitoring in agent 2helm upgrade kubeadapt kubeadapt/kubeadapt \ 3 --namespace kubeadapt \ 4 --reuse-values \ 5 --set agent.config.enableGpuMonitoring=true 6 7# Check agent logs 8kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu 9 10# Verify GPU pricing is configured in dashboard 11# Navigate to Settings → Cloud Providers → GPU Pricing

MIG Mode Limitation

IMPORTANT: DCGM Exporter in Kubernetes mode does NOT support container-level GPU utilization mapping when MIG (Multi-Instance GPU) is enabled.

If using MIG:

  • Node-level GPU metrics: Available
  • Container-level GPU metrics: Not available
  • Future: eBPF-based agent will support MIG mode

Workaround:

  • Use GPU node labels for cost allocation
  • Manual GPU cost distribution based on GPU requests
  • Wait for eBPF agent support (roadmap)

Best Practices

1. Right-size GPU Requests

Monitor GPU utilization and adjust requests:

yaml
1# Before (over-provisioned) 2resources: 3 requests: 4 nvidia.com/gpu: 1 # GPU utilization: 25% 5 6# After (right-sized) 7resources: 8 requests: 9 nvidia.com/gpu: 0 # Moved to CPU-only node

2. Use GPU Node Taints

Prevent non-GPU workloads from running on expensive GPU nodes:

bash
1# Taint GPU nodes 2kubectl taint nodes <gpu-node> nvidia.com/gpu=present:NoSchedule
yaml
1# GPU workloads need toleration 2tolerations: 3 - key: nvidia.com/gpu 4 operator: Equal 5 value: present 6 effect: NoSchedule

3. Enable GPU Time-Slicing (Optional)

For multiple workloads sharing a single GPU:

yaml
1# GPU Operator configuration 2gpu-operator: 3 devicePlugin: 4 config: 5 name: time-slicing-config 6 default: any 7 sharing: 8 timeSlicing: 9 renameByDefault: false 10 failRequestsGreaterThanOne: false 11 resources: 12 - name: nvidia.com/gpu 13 replicas: 4 # 4 containers can share 1 GPU

GPU Pricing Configuration

Configure GPU Costs in Dashboard

  1. Navigate to Settings → Cloud Providers
  2. Select your cloud provider (AWS, GCP, Azure)
  3. GPU Pricing section:
    • Set hourly cost per GPU type
    • Or enable automatic pricing from cloud provider API

Example GPU pricing:

text
1NVIDIA A100 (80GB): $3.67/hour 2NVIDIA V100: $2.48/hour 3NVIDIA T4: $0.95/hour

On-Premises GPU Pricing

For on-prem clusters, calculate GPU cost based on:

text
1GPU Hourly Cost = (Hardware Cost / Depreciation Period) / Hours per Year 2 3Example: 4- Hardware: $10,000 per GPU 5- Depreciation: 3 years 6- Hourly cost: $10,000 / (3 × 365 × 24) = $0.38/hour

What's Next?

After enabling GPU monitoring:


Need Help?