NVIDIA GPU operator
Overview
The NVIDIA GPU Operator is a Kubernetes operator that automates the deployment, configuration, and management of the NVIDIA software required to run GPU-accelerated workloads.
Getting Started
1. Setting Up Your Shoot Cluster with GPU Workers
To use NVIDIA GPUs in your shoot cluster, you need to configure worker groups with GPU-enabled machine types. At least one of the worker groups must have GPU-enabled machine types.
Steps:
- In your project, navigate to the creation of the new Cluster
- Fill in the necessary information
- Select
Machine Typewithgpucapabilities Machine Imageshould be set to the supported one with GPU capabilities -NVGPU Garden Linux- Select
Zonewhere GPU nodes are available
- Select
- Fill in the rest of the cluster creation form and hit 'Create'
- Wait for the cluster to be created and reconciled
Warning
If only a limited number of GPU is available, set Autoscaler Max. equal to Autoscaler Min. and set maxSurge: 0 and maxUnavailable: 1 due to insufficient resources.
2. Verifying GPU Availability
Check Node Labels
The Node Feature Discovery component automatically labels nodes with GPU capabilities. You can verify this by checking the node labels:
kubectl get nodes -o json | jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia.com/")))'
Check GPU Resources
Verify that GPU resources are advertised on your nodes:
kubectl get nodes -o json | jq -r '.items[] | .metadata.name, .status.allocatable'
Check NVIDIA GPU Operator Status
Verify that the NVIDIA GPU Operator components are running:
kubectl get pods -n kube-system -l app.kubernetes.io/managed-by=nvgpu-operator
You should see various NVIDIA operator pods running:
kube-system nvidia-container-toolkit-daemonset-mkdkx 1/1 Running 0 129m
kube-system nvidia-dcgm-exporter-m69rd 1/1 Running 0 129m
kube-system nvidia-device-plugin-daemonset-lq5jd 1/1 Running 0 129m
kube-system nvidia-mig-manager-lf9xz 1/1 Running 0 126m
kube-system nvidia-operator-validator-8r4q8 1/1 Running 0 129m
Check NVIDIA GPU Operator Validator
The NVIDIA GPU Operator includes a built-in validator that sanity-checks driver, device plugin, and CUDA toolkit readiness. Let’s verify it by inspecting the nvidia-operator-validator DaemonSet and its pod logs.
kubectl logs -n kube-system -l app=nvidia-operator-validator -c nvidia-operator-validator
Expected output:
all validations are successful
Verification: Running Sample GPU Applications
To verify that GPU operations are working correctly, you can run a sample CUDA application.
CUDA VectorAdd Example
This example demonstrates a simple CUDA vector addition operation to verify GPU functionality.
Create cuda-vectoradd pod
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
runtimeClassName: nvidia
restartPolicy: Never
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0"
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
EOF
The pod starts, runs the vectorAdd command, and then exits.
View the logs from the container
kubectl logs pod/cuda-vectoradd
Expected Output:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
If you see "Test PASSED", your GPU is working correctly!
Remove the pod
kubectl delete pods cuda-vectoradd
Alternative: Simple nvidia-smi test
For a quicker verification, you can run a simple pod that executes nvidia-smi:
Apply and check the logs:
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
runtimeClassName: nvidia
restartPolicy: Never
containers:
- name: nvidia-smi
image: nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
EOF
kubectl logs nvidia-smi
Expected output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A30 Off | 00000000:06:00.0 Off | 0 |
| N/A 28C P0 30W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Using GPUs in Your Applications
Resource Requests
To use a GPU in your pod, specify the GPU resource and runtimeClassName in the pod specification:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-app
spec:
runtimeClassName: nvidia
containers:
- name: my-container
image: my-cuda-app:latest
resources:
requests:
nvidia.com/gpu: 1
Note
Make sure you have enough resources to satisfy the requests.
Multi-Instance GPU
MIG (Multi-Instance GPU) is an NVIDIA technology that slices a single physical GPU into multiple isolated GPU instances. Each instance has dedicated compute, memory, and cache, so different workloads can run securely without resource contention. In Kubernetes, the GPU Operator’s MIG Manager applies the desired layout on nodes, and GPU Feature Discovery advertises the resulting MIG resources for scheduling.
Applying MIG config
Warning
Using a different MIG config is disruptive and causes pods on the node to be restarted or even removed.
You must select the correct MIG profile for the specific GPU model; see the list on page NVIDIA Supported MIG Profiles.
Set MIG layout via label
In our example we will use MIG config all-1g.6gb. In our case this means the physical GPU will be partitioned into four isolated MIG instances.
kubectl label nodes <node> nvidia.com/mig.config=all-1g.6gb --overwrite
Wait for GPU Operator to reconcile
nvidia-mig-manager DaemonSet will reconfigure MIG on each node if applicable.
MIG instances visible in cluster
kubectl get node <node> -o json | jq '.status.allocatable | with_entries(select(.key|test("nvidia.com")))'
Expected output:
{
"nvidia.com/gpu": "4"
}
Or we can use nvidia-smi.
Disable MIG instances
To disable MIG and revert to using the full physical GPU, change the node label to mig.config=all-disabled.
kubectl label nodes <node> nvidia.com/mig.config=all-disabled --overwrite