Kubeadapt - Kubernetes Cost Optimization Platform

Overview

Kubeadapt supports GPU cost monitoring through integration with NVIDIA DCGM (Data Center GPU Manager) Exporter. This allows you to:

Track GPU costs alongside CPU and memory
Monitor GPU utilization per node and workload
Optimize GPU usage with rightsizing recommendations
Identify idle GPUs for cost savings

GPU Metrics Collected:

DCGM_FI_DEV_GPU_UTIL - GPU compute utilization percentage

Prerequisites

Before enabling GPU monitoring, ensure:

NVIDIA GPUs in your cluster nodes
NVIDIA device plugin installed (or will install with GPU Operator)
Helm 3.x for Kubeadapt installation
Cluster admin permissions

Supported GPU types:

NVIDIA A100, V100, T4, P100
Any NVIDIA GPU with DCGM support

Installation Options

You have two options for enabling GPU monitoring:

Option 1: Install GPU Operator with Kubeadapt (Recommended)

Use this if you don't have GPU Operator installed yet.

Pros:

Single Helm install for everything
Automatic DCGM Exporter deployment
GPU Operator manages NVIDIA drivers and device plugins
Simpler configuration

Cons:

Installs additional components (GPU Operator stack)

Option 2: Use Existing DCGM Exporter

Use this if you already have DCGM Exporter running.

Pros:

Reuses existing infrastructure
Lighter weight (no additional deployments)

Cons:

Requires manual scrape configuration
Need to know your DCGM Exporter namespace/labels

Option 1: Install with GPU Operator

Step 1: Enable GPU Operator in Helm

Install (or upgrade) Kubeadapt with GPU Operator enabled:

bash
1helm install kubeadapt kubeadapt/kubeadapt \
2  --namespace kubeadapt \
3  --create-namespace \
4  --set agent.enabled=true \
5  --set agent.config.token=YOUR_TOKEN \
6  --set gpu-operator.enabled=true

Or if already installed, upgrade:

bash
1helm upgrade kubeadapt kubeadapt/kubeadapt \
2  --namespace kubeadapt \
3  --reuse-values \
4  --set gpu-operator.enabled=true

Step 2: Enable DCGM Scraping

Create a values.yaml file with DCGM scrape configuration:

yaml
1gpu-operator:
2  enabled: true
3  operator:
4    defaultRuntime: containerd
5  dcgmExporter:
6    enabled: true
7    resources:
8      requests:
9        memory: "128Mi"
10        cpu: "50m"
11      limits:
12        memory: "512Mi"
13        cpu: "250m"
14
15prometheus:
16  serverFiles:
17    prometheus.yml:
18      scrape_configs:
19        # Enable DCGM Exporter scraping
20        - job_name: "dcgm-exporter"
21          kubernetes_sd_configs:
22            - role: pod
23              namespaces:
24                names:
25                  - kubeadapt
26          relabel_configs:
27            - source_labels: [__meta_kubernetes_pod_label_app]
28              action: keep
29              regex: dcgm-exporter
30            - source_labels: [__meta_kubernetes_pod_container_port_name]
31              action: keep
32              regex: metrics
33          metric_relabel_configs:
34            - source_labels: [__name__]
35              action: keep
36              regex: ^DCGM_FI_DEV_GPU_UTIL$

Apply the configuration:

bash
1helm upgrade kubeadapt kubeadapt/kubeadapt \
2  --namespace kubeadapt \
3  --reuse-values \
4  -f gpu-monitoring-values.yaml

Step 3: Verify GPU Operator Installation

Check that GPU Operator components are running:

bash
1kubectl get pods -n kubeadapt | grep -E 'gpu-operator|dcgm'

Expected output:

text
1
2gpu-operator-6b8f9d7c4d-x7k9m              1/1     Running   0          2m
3nvidia-dcgm-exporter-abcde                 1/1     Running   0          2m
4nvidia-device-plugin-daemonset-fghij       1/1     Running   0          2m
5

Step 4: Verify GPU Metrics

After configuration, GPU metrics should appear in the Kubeadapt dashboard.

Expected output:

json
1[
2  {
3    "metric": {
4      "__name__": "DCGM_FI_DEV_GPU_UTIL",
5      "gpu": "0",
6      "instance": "10.0.1.42:9400",
7      "job": "dcgm-exporter"
8    },
9    "value": [1705315200, "85.5"]
10  }
11]

Option 2: Use Existing DCGM Exporter

If you already have DCGM Exporter running in your cluster:

Step 1: Identify Your DCGM Exporter

Find the namespace and labels of your existing DCGM Exporter:

bash
1kubectl get pods --all-namespaces -l app=dcgm-exporter

Example output:

text
1NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
2gpu-system    dcgm-exporter-abc123       1/1     Running   0          10d

Step 2: Configure Prometheus Scraping

Create a values.yaml with your DCGM Exporter details:

yaml
1# existing-dcgm-values.yaml
2
3prometheus:
4  serverFiles:
5    prometheus.yml:
6      scrape_configs:
7        # Scrape existing DCGM Exporter
8        - job_name: "dcgm-exporter"
9          kubernetes_sd_configs:
10            - role: pod
11              namespaces:
12                names:
13                  - gpu-system # Change to your namespace
14          relabel_configs:
15            - source_labels: [__meta_kubernetes_pod_label_app]
16              action: keep
17              regex: dcgm-exporter # Change if your label is different
18            - source_labels: [__meta_kubernetes_pod_container_port_name]
19              action: keep
20              regex: metrics
21          metric_relabel_configs:
22            - source_labels: [__name__]
23              action: keep
24              regex: ^DCGM_FI_DEV_GPU_UTIL$

Apply the configuration:

bash
1helm upgrade kubeadapt kubeadapt/kubeadapt \
2  --namespace kubeadapt \
3  --reuse-values \
4  -f existing-dcgm-values.yaml

Step 3: Verify Scraping

Verify DCGM Exporter pods are running:

bash
1kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

All pods should show Running status.

Agent Configuration

Enable GPU Monitoring in Agent

To enable GPU cost tracking in the Kubeadapt agent, add the GPU monitoring flag:

bash
1helm upgrade kubeadapt kubeadapt/kubeadapt \
2  --namespace kubeadapt \
3  --reuse-values \
4  --set agent.config.enableGpuMonitoring=true

Or in values.yaml:

yaml
1agent:
2  enabled: true
3  config:
4    token: "YOUR_TOKEN"
5    enableGpuMonitoring: true

Verify Agent Configuration

Check agent logs to confirm GPU monitoring is enabled:

bash
1kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu

Expected output:

text
1INFO: GPU monitoring enabled
2INFO: Discovered 4 GPU nodes in cluster
3INFO: Collecting metrics from DCGM exporter

Viewing GPU Costs in Dashboard

Once configured, GPU metrics will appear in your Kubeadapt dashboard:

Dashboard Features

1. GPU Cost Overview

Total GPU spend per month
GPU utilization percentage
GPU-enabled nodes count

2. GPU Utilization by Node

Per-node GPU usage graphs
Idle GPU identification
GPU memory utilization

3. Workload GPU Usage

GPU allocation per pod
GPU request vs. actual usage
Rightsizing recommendations for GPU workloads

GPU Metrics Available

Node-level:

GPU count per node
GPU model and memory capacity

Workload-level:

GPU requests (nvidia.com/gpu)
GPU compute utilization percentage
GPU idle time detection

Troubleshooting

DCGM Exporter pods not running

Symptoms:

bash
1kubectl get pods -n kubeadapt | grep dcgm
2# No pods or CrashLoopBackOff

Common causes:

No GPU nodes in cluster
NVIDIA device plugin not installed
GPU Operator failed to install

Solution:

bash
1# Check for GPU nodes
2kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu" != null)'
3
4# Check GPU Operator logs
5kubectl logs -n kubeadapt deployment/gpu-operator
6
7# Reinstall GPU Operator
8helm upgrade kubeadapt kubeadapt/kubeadapt \
9  --namespace kubeadapt \
10  --reuse-values \
11  --set gpu-operator.enabled=false
12helm upgrade kubeadapt kubeadapt/kubeadapt \
13  --namespace kubeadapt \
14  --reuse-values \
15  --set gpu-operator.enabled=true

No GPU metrics appearing

Symptoms:

GPU cost data not visible in Kubeadapt dashboard.

Common causes:

DCGM Exporter not scraped by Prometheus
Incorrect namespace or labels in scrape config
DCGM Exporter not exposing metrics

Solution:

bash
1# Verify DCGM Exporter is running
2kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
3
4# Check DCGM Exporter logs
5kubectl logs -n gpu-operator -l app=nvidia-dcgm-exporter
6
7# Verify Kubeadapt agent logs
8kubectl logs -n kubeadapt -l app=kubeadapt-agent | grep -i gpu

GPU costs not showing in dashboard

Symptoms:

DCGM metrics available in Prometheus
GPU costs not visible in Kubeadapt dashboard

Common causes:

Agent GPU monitoring not enabled
Agent not collecting GPU metrics
GPU pricing not configured

Solution:

bash
1# Enable GPU monitoring in agent
2helm upgrade kubeadapt kubeadapt/kubeadapt \
3  --namespace kubeadapt \
4  --reuse-values \
5  --set agent.config.enableGpuMonitoring=true
6
7# Check agent logs
8kubectl logs -n kubeadapt deployment/kubeadapt-agent | grep -i gpu
9
10# Verify GPU pricing is configured in dashboard
11# Navigate to Settings → Cloud Providers → GPU Pricing

MIG Mode Limitation

IMPORTANT: DCGM Exporter in Kubernetes mode does NOT support container-level GPU utilization mapping when MIG (Multi-Instance GPU) is enabled.

If using MIG:

Node-level GPU metrics: Available
Container-level GPU metrics: Not available
Future: eBPF-based agent will support MIG mode

Workaround:

Use GPU node labels for cost allocation
Manual GPU cost distribution based on GPU requests
Wait for eBPF agent support (roadmap)

Best Practices

1. Right-size GPU Requests

Monitor GPU utilization and adjust requests:

yaml
1# Before (over-provisioned)
2resources:
3  requests:
4    nvidia.com/gpu: 1  # GPU utilization: 25%
5
6# After (right-sized)
7resources:
8  requests:
9    nvidia.com/gpu: 0  # Moved to CPU-only node

2. Use GPU Node Taints

Prevent non-GPU workloads from running on expensive GPU nodes:

bash
1# Taint GPU nodes
2kubectl taint nodes <gpu-node> nvidia.com/gpu=present:NoSchedule

yaml
1# GPU workloads need toleration
2tolerations:
3  - key: nvidia.com/gpu
4    operator: Equal
5    value: present
6    effect: NoSchedule

3. Enable GPU Time-Slicing (Optional)

For multiple workloads sharing a single GPU:

yaml
1# GPU Operator configuration
2gpu-operator:
3  devicePlugin:
4    config:
5      name: time-slicing-config
6      default: any
7      sharing:
8        timeSlicing:
9          renameByDefault: false
10          failRequestsGreaterThanOne: false
11          resources:
12            - name: nvidia.com/gpu
13              replicas: 4 # 4 containers can share 1 GPU

GPU Pricing Configuration

Configure GPU Costs in Dashboard

Navigate to Settings → Cloud Providers
Select your cloud provider (AWS, GCP, Azure)
GPU Pricing section:
- Set hourly cost per GPU type
- Or enable automatic pricing from cloud provider API

Example GPU pricing:

text
1NVIDIA A100 (80GB): $3.67/hour
2NVIDIA V100: $2.48/hour
3NVIDIA T4: $0.95/hour

On-Premises GPU Pricing

For on-prem clusters, calculate GPU cost based on:

text
1GPU Hourly Cost = (Hardware Cost / Depreciation Period) / Hours per Year
2
3Example:
4- Hardware: $10,000 per GPU
5- Depreciation: 3 years
6- Hourly cost: $10,000 / (3 × 365 × 24) = $0.38/hour

What's Next?

After enabling GPU monitoring:

Dashboard - View GPU costs and utilization
💎 Available Savings - GPU rightsizing recommendations
Workload Details - Per-pod GPU usage
Cost Query - Custom GPU cost queries

Need Help?

NVIDIA GPU Operator Docs
DCGM Exporter Docs
Support - Email support team