NVIDIA GPU operator

Overview

The NVIDIA GPU Operator is a Kubernetes operator that automates the deployment, configuration, and management of the NVIDIA software required to run GPU-accelerated workloads.

Getting Started

1. Setting Up Your Shoot Cluster with GPU Workers

To use NVIDIA GPUs in your shoot cluster, you need to configure worker groups with GPU-enabled machine types. At least one of the worker groups must have GPU-enabled machine types.

Steps:

In your project, navigate to the creation of the new Cluster
Fill in the necessary information
1. Select Machine Type with gpu capabilities
2. Machine Image should be set to the supported one with GPU capabilities - NVGPU Garden Linux
3. Select Zone where GPU nodes are available
Fill in the rest of the cluster creation form and hit 'Create'
Wait for the cluster to be created and reconciled

Warning

If only a limited number of GPU is available, set Autoscaler Max. equal to Autoscaler Min. and set maxSurge: 0 and maxUnavailable: 1 due to insufficient resources.

2. Verifying GPU Availability

Check Node Labels

The Node Feature Discovery component automatically labels nodes with GPU capabilities. You can verify this by checking the node labels:

kubectl get nodes -o json | jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia.com/")))'

Check GPU Resources

Verify that GPU resources are advertised on your nodes:

kubectl get nodes -o json |  jq -r '.items[] | .metadata.name, .status.allocatable'

Check NVIDIA GPU Operator Status

Verify that the NVIDIA GPU Operator components are running:

kubectl get pods -n kube-system -l app.kubernetes.io/managed-by=nvgpu-operator

You should see various NVIDIA operator pods running:

kube-system   nvidia-container-toolkit-daemonset-mkdkx         1/1     Running   0          129m
kube-system   nvidia-dcgm-exporter-m69rd                       1/1     Running   0          129m
kube-system   nvidia-device-plugin-daemonset-lq5jd             1/1     Running   0          129m
kube-system   nvidia-mig-manager-lf9xz                         1/1     Running   0          126m
kube-system   nvidia-operator-validator-8r4q8                  1/1     Running   0          129m

Check NVIDIA GPU Operator Validator

The NVIDIA GPU Operator includes a built-in validator that sanity-checks driver, device plugin, and CUDA toolkit readiness. Let’s verify it by inspecting the nvidia-operator-validator DaemonSet and its pod logs.

kubectl logs -n kube-system -l app=nvidia-operator-validator -c nvidia-operator-validator

Expected output:

all validations are successful

Verification: Running Sample GPU Applications

To verify that GPU operations are working correctly, you can run a sample CUDA application.

CUDA VectorAdd Example

This example demonstrates a simple CUDA vector addition operation to verify GPU functionality.

Create `cuda-vectoradd` pod

kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  runtimeClassName: nvidia
  restartPolicy: Never
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0"
    resources:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
EOF

The pod starts, runs the vectorAdd command, and then exits.

View the logs from the container

kubectl logs pod/cuda-vectoradd

Expected Output:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

If you see "Test PASSED", your GPU is working correctly!

Remove the pod

kubectl delete pods cuda-vectoradd

Alternative: Simple nvidia-smi test

For a quicker verification, you can run a simple pod that executes nvidia-smi:

Apply and check the logs:

kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  runtimeClassName: nvidia
  restartPolicy: Never
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:12.2.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
EOF

kubectl logs nvidia-smi

Expected output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:06:00.0 Off |                    0 |
| N/A   28C    P0             30W /  165W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Using GPUs in Your Applications

Resource Requests

To use a GPU in your pod, specify the GPU resource and runtimeClassName in the pod specification:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-app
spec:
  runtimeClassName: nvidia
  containers:
  - name: my-container
    image: my-cuda-app:latest
    resources:
      requests:
        nvidia.com/gpu: 1

Note

Make sure you have enough resources to satisfy the requests.

Multi-Instance GPU

MIG (Multi-Instance GPU) is an NVIDIA technology that slices a single physical GPU into multiple isolated GPU instances. Each instance has dedicated compute, memory, and cache, so different workloads can run securely without resource contention. In Kubernetes, the GPU Operator’s MIG Manager applies the desired layout on nodes, and GPU Feature Discovery advertises the resulting MIG resources for scheduling.

Applying MIG config

Warning

Using a different MIG config is disruptive and causes pods on the node to be restarted or even removed.

You must select the correct MIG profile for the specific GPU model; see the list on page NVIDIA Supported MIG Profiles.

Set MIG layout via label

In our example we will use MIG config all-1g.6gb. In our case this means the physical GPU will be partitioned into four isolated MIG instances.

kubectl label nodes <node> nvidia.com/mig.config=all-1g.6gb --overwrite

Wait for GPU Operator to reconcile

nvidia-mig-manager DaemonSet will reconfigure MIG on each node if applicable.

MIG instances visible in cluster

kubectl get node <node> -o json | jq '.status.allocatable | with_entries(select(.key|test("nvidia.com")))'

Expected output:

{
  "nvidia.com/gpu": "4"
}

Or we can use nvidia-smi.

Disable MIG instances

To disable MIG and revert to using the full physical GPU, change the node label to mig.config=all-disabled.

kubectl label nodes <node> nvidia.com/mig.config=all-disabled --overwrite

NVIDIA GPU operator

Overview

Getting Started

1. Setting Up Your Shoot Cluster with GPU Workers

2. Verifying GPU Availability

Check Node Labels

Check GPU Resources

Check NVIDIA GPU Operator Status

Check NVIDIA GPU Operator Validator

Verification: Running Sample GPU Applications

CUDA VectorAdd Example

Create cuda-vectoradd pod

View the logs from the container

Remove the pod

Alternative: Simple nvidia-smi test

Using GPUs in Your Applications

Resource Requests

Multi-Instance GPU

Applying MIG config

Set MIG layout via label

Wait for GPU Operator to reconcile

MIG instances visible in cluster

Disable MIG instances

Create `cuda-vectoradd` pod