From 646eba67efc2b6815ab18999eee1368928e090a5 Mon Sep 17 00:00:00 2001 From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Date: Thu, 30 Apr 2026 15:58:44 -0400 Subject: [PATCH] Add prereq, install, and upgrade page Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --- docs/install.md | 430 ++++++++++++++++++++++++++++++++++++++++++ docs/prerequisites.md | 85 +++++++++ docs/upgrade.md | 124 ++++++++++++ 3 files changed, 639 insertions(+) create mode 100644 docs/install.md create mode 100644 docs/prerequisites.md create mode 100644 docs/upgrade.md diff --git a/docs/install.md b/docs/install.md new file mode 100644 index 000000000..827538835 --- /dev/null +++ b/docs/install.md @@ -0,0 +1,430 @@ +# Install + +Install the DRA Driver for NVIDIA GPUs and validate that GPU or ComputeDomain allocation is working on your cluster. + +Before you begin: + +- Confirm all [prerequisites](prerequisites.md) are met. +- If you have the NVIDIA GPU Operator installed, follow the [GPU Operator install guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-intro-install.html) instead. + +--- + +## Install the chart + +The following command installs the DRA Driver with both GPU allocation and ComputeDomain support enabled. + +By default, both resource plugins are enabled. If you only need one, the other can be left enabled with no impact on the cluster, or you can disable it explicitly: + +- To disable ComputeDomain support, add `--set resources.computeDomains.enabled=false`. +- To disable GPU allocation, add `--set resources.gpus.enabled=false`. + +> **Note:** On GKE, include `--set nvidiaDriverRoot=/home/kubernetes/bin/nvidia` so the driver uses the default NVIDIA driver install path on GKE. + +```bash +helm install dra-driver-nvidia-gpu oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \ + --version 0.4.0 \ + --create-namespace \ + --namespace dra-driver-nvidia-gpu \ + --set gpuResourcesEnabledOverride=true +``` + +Example output: + +``` +NAME: dra-driver-nvidia-gpu +LAST DEPLOYED: Wed Apr 29 02:21:24 2026 +NAMESPACE: dra-driver-nvidia-gpu +STATUS: deployed +REVISION: 1 +DESCRIPTION: Install complete +TEST SUITE: None +``` + +For additional configuration options, see [Optional: Configure Helm values](#optional-configure-helm-values). + +## Verify installation + +After install, confirm all components are running and the expected DeviceClasses are registered. + +1. Check that all pods are `Running` and `Ready`: + +```bash +kubectl get pod -n dra-driver-nvidia-gpu +``` + +Example output (with GPU allocation and ComputeDomains enabled): + +``` +NAME READY STATUS RESTARTS AGE +dra-driver-nvidia-gpu-controller-7fb7956988-4kv59 1/1 Running 0 1m +dra-driver-nvidia-gpu-kubelet-plugin-5qhc7 2/2 Running 0 1m +``` + +The `controller` pod runs the ComputeDomain controller (1 container). The `kubelet-plugin` pod runs two containers, one for GPU resources (`gpus`) and one for ComputeDomain resources (`compute-domains`), so it shows `2/2` when both are enabled. One `kubelet-plugin` pod appears per GPU node. + +If you installed with `--set resources.computeDomains.enabled=false`, the `controller` pod will not be present and the `kubelet-plugin` pod will show `1/1`. The same is true if you disabled GPU allocation during install. + +2. Confirm the DeviceClasses were registered: + +```bash +kubectl get deviceclass +``` + +Example output: + +``` +NAME AGE +compute-domain-daemon.nvidia.com 1m +compute-domain-default-channel.nvidia.com 1m +gpu.nvidia.com 1m +mig.nvidia.com 1m +vfio.gpu.nvidia.com 1m +``` + +`gpu.nvidia.com` is used for standard GPU allocation. `mig.nvidia.com` and `vfio.gpu.nvidia.com` are registered but only usable with the appropriate hardware and configuration. The `compute-domain-*` classes are used by the ComputeDomain controller. + +If you installed with only ComputeDomain support, `gpu.nvidia.com`, `mig.nvidia.com`, `vfio.gpu.nvidia.com` will not be installed. + +If you installed with only GPU allocation support, `compute-domain-daemon.nvidia.com`, `compute-domain-default-channel.nvidia.com` will not be installed. + +3. Confirm GPU nodes have advertised their ResourceSlices: + +```bash +kubectl get resourceslice -o wide +``` + +Example output: + +``` +NAME NODE DRIVER POOL AGE +00-gpu.nvidia.com-worker-gpu-01-kx9f2 worker-gpu-01 gpu.nvidia.com worker-gpu-01 3m +00-compute-domain.nvidia.com-worker-gpu-01-ab3d7 worker-gpu-01 compute-domain.nvidia.com worker-gpu-01 3m +``` + +The ResourceSlice name is auto-generated from the driver name, node name, and a random suffix. +The pool name matches the node name, since each node gets its own pool. + +When GPU allocation support is enabled, each GPU node should appear with `gpu.nvidia.com` slices listing its available devices. + +When ComputeDomain support is enabled, each GPU node should also appear with `compute-domain.nvidia.com` slices listing +its available IMEX daemon and channel devices. + +If no slices appear, the kubelet plugin is not communicating with the API server. +Check that the driver pods are running and your GPUs are in a healthy state. + +```bash +kubectl logs dra-driver-nvidia-gpu-kubelet-plugin- -n dra-driver-nvidia-gpu +``` + +For additional help, consider filing an [issue in the DRA Driver repository](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/issues). + +## Optional: Configure Helm values + +The following parameters are most commonly set at install time. + +| Parameter | Default | Description | +|---|---|---| +| `nvidiaDriverRoot` | `/` | Path to the GPU driver root on the host. If the NVIDIA GPU Operator manages the NVIDIA GPU driver on your nodes, set to `/run/nvidia/driver`, the default location for Operator managed drivers. For GKE, use `/home/kubernetes/bin/nvidia`. Incorrect values are a common error during install. | +| `resources.gpus.enabled` | `true` | Enable the GPU kubelet plugin. Requires `gpuResourcesEnabledOverride=true`. | +| `resources.computeDomains.enabled` | `true` | Enable the ComputeDomain controller and kubelet plugin. | +| `gpuResourcesEnabledOverride` | `false` | Required to enable GPU allocation resources. | +| `featureGates` | `{}` | Map of feature gate name to boolean. See [Feature gates](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/blob/main/deployments/helm/dra-driver-nvidia-gpu/values.yaml#L84) in repo for more details on available feature gates. | +| `logVerbosity` | `4` | Log verbosity level (0–7). Higher values produce more output. | + +To list all available parameters: + +```bash +helm show values oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu --version 0.4.0 +``` + +## Optional: Admission webhook + +The admission webhook validates opaque configuration in `ResourceClaim` and `ResourceClaimTemplate` specs, providing early feedback on invalid values. It is disabled by default. +Refer to the [API reference](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/tree/main/api/nvidia.com/resource/v1beta1) for more details on this configuration. + +Prerequisite: [cert-manager](https://cert-manager.io/) must be installed in your cluster. + +1. Install cert-manager: + +```bash +helm install \ + cert-manager oci://quay.io/jetstack/charts/cert-manager \ + --namespace cert-manager \ + --create-namespace \ + --set crds.enabled=true \ + --wait +``` + +2. Enable the webhook by including `--set webhook.enabled=true` in the Helm install command. To use a pre-existing TLS secret instead of cert-manager, set `webhook.tls.mode=secret` and provide `webhook.tls.secret.name` and `webhook.tls.secret.caBundle`. + +```bash +helm install dra-driver-nvidia-gpu oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \ + --version 0.4.0 \ + --create-namespace \ + --namespace dra-driver-nvidia-gpu \ + --set gpuResourcesEnabledOverride=true \ + --set webhook.enabled=true +``` + +Example output: + +``` +NAME: dra-driver-nvidia-gpu +LAST DEPLOYED: Wed Apr 29 02:21:24 2026 +NAMESPACE: dra-driver-nvidia-gpu +STATUS: deployed +REVISION: 1 +DESCRIPTION: Install complete +TEST SUITE: None +``` + +3. Optionally: Verify webhook pod is `Running` and `Ready`: + +```bash +kubectl get pod -n dra-driver-nvidia-gpu +``` + +Example output: + +``` +NAME READY STATUS RESTARTS AGE +dra-driver-nvidia-gpu-controller-7fb7956988-4kv59 1/1 Running 0 25s +dra-driver-nvidia-gpu-kubelet-plugin-5qhc7 2/2 Running 0 25s +dra-driver-nvidia-gpu-webhook-6c9dd4956d-r4r7z 1/1 Running 0 25s +``` + +## Run a sample GPU allocation workload + +Deploy a sample workload that allocates a GPU through the DRA Driver and verifies it is shared correctly between containers. For additional examples, see the [`demo/` folder](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/tree/main/demo) in the repository. + +> **Note:** GPU resource allocation must be enabled at install time (`--set gpuResourcesEnabledOverride=true`). If you installed with `--set resources.gpus.enabled=false`, skip this section. + +1. Create a namespace for the test workload: + +```bash +kubectl create namespace dra-gpu-share-test +``` + +Example output: + +``` +namespace/dra-gpu-share-test created +``` + +2. Create a `ResourceClaimTemplate` like the following example. This defines the type of GPU resource to request, a single device from the `gpu.nvidia.com` device class. When a pod references this template, Kubernetes creates a per-pod `ResourceClaim` from it: + +```yaml +apiVersion: resource.k8s.io/v1 # Kubernetes 1.34+ +# apiVersion: resource.k8s.io/v1beta2 # Kubernetes 1.32 and 1.33 +kind: ResourceClaimTemplate +metadata: + namespace: dra-gpu-share-test + name: single-gpu +spec: + spec: + devices: + requests: + - name: gpu + exactly: + deviceClassName: gpu.nvidia.com +``` + +3. Apply the manifest: + +```bash +kubectl apply -f dra-gpu-share-claim-template.yaml +``` + +Example output: + +``` +resourceclaimtemplate.resource.k8s.io/single-gpu created +``` + +4. Create the test pod in `dra-gpu-share-pod.yaml`. Both containers (`ctr0` and `ctr1`) reference the same claim (`shared-gpu`), demonstrating that DRA allows multiple containers within a pod to share a single GPU: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + namespace: dra-gpu-share-test + name: pod + labels: + app: pod +spec: + containers: + - name: ctr0 + image: ubuntu:22.04 + command: ["bash", "-c"] + args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] + resources: + claims: + - name: shared-gpu + - name: ctr1 + image: ubuntu:22.04 + command: ["bash", "-c"] + args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] + resources: + claims: + - name: shared-gpu + resourceClaims: + - name: shared-gpu + resourceClaimTemplateName: single-gpu + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" +``` + +5. Apply the manifest: + +```bash +kubectl apply -f dra-gpu-share-pod.yaml +``` + +Example output: + +``` +pod/pod created +``` + +6. Verify both containers use the same GPU: + +```bash +kubectl logs pod -n dra-gpu-share-test --all-containers --prefix +``` + +Example output shows the same GPU UUID from both containers: + +``` +[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c) +[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c) +``` + +7. Clean up: + +```bash +kubectl delete -f dra-gpu-share-pod.yaml -f dra-gpu-share-claim-template.yaml +kubectl delete namespace dra-gpu-share-test +``` + +Example output: + +``` +pod "pod" deleted +resourceclaimtemplate.resource.k8s.io "single-gpu" deleted +namespace "dra-gpu-share-test" deleted +``` + +--- + +## Run a sample ComputeDomain workload + +Deploy a sample workload that provisions an IMEX channel across NVLink-connected nodes and verifies the channel device is injected into the pod. + +> **Note:** This section requires Multi-Node NVLink (MNNVL) hardware. + +1. Validate clique node labels. GPU Feature Discovery labels each MNNVL-capable node with `nvidia.com/gpu.clique`. Confirm all expected nodes have this label: + +```bash +(echo -e "NODE\tLABEL\tCLIQUE"; kubectl get nodes -o json | \ + jq -r '.items[] | [.metadata.name, "nvidia.com/gpu.clique", .metadata.labels["nvidia.com/gpu.clique"]] | @tsv') | \ + column -t +``` + +Example output: + +``` +NODE LABEL CLIQUE +gpu-node-001 nvidia.com/gpu.clique a1b2c3d4-e5f6-7890-abcd-ef1234567890.0 +gpu-node-002 nvidia.com/gpu.clique a1b2c3d4-e5f6-7890-abcd-ef1234567890.0 +``` + +Each value should have the shape `.`. If any nodes are missing the label, confirm that GPU Feature Discovery is deployed and running on the affected nodes. + +2. Create a `ComputeDomain`. This groups nodes connected via NVLink fabric and provisions the IMEX channels needed for cross-node GPU communication. The `channel.resourceClaimTemplate` field names a `ResourceClaimTemplate` that the controller creates automatically, which pods then use to claim a channel: + +```bash +cat < imex-compute-domain.yaml +apiVersion: resource.nvidia.com/v1beta1 +kind: ComputeDomain +metadata: + name: imex-channel-injection +spec: + numNodes: 0 + channel: + resourceClaimTemplate: + name: imex-channel-0 +EOF +kubectl apply -f imex-compute-domain.yaml +``` + +Example output: + +``` +computedomain.resource.nvidia.com/imex-channel-injection created +``` + +3. Create the test pod. `nodeAffinity` restricts scheduling to nodes labeled `nvidia.com/gpu.clique`, and the pod claims the IMEX channel provisioned by the `ComputeDomain`: + +```bash +cat < imex-test-pod.yaml +apiVersion: v1 +kind: Pod +metadata: + name: imex-channel-injection +spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: nvidia.com/gpu.clique + operator: Exists + containers: + - name: ctr + image: ubuntu:22.04 + command: ["bash", "-c"] + args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"] + resources: + claims: + - name: imex-channel-0 + resourceClaims: + - name: imex-channel-0 + resourceClaimTemplateName: imex-channel-0 +EOF +kubectl apply -f imex-test-pod.yaml +``` + +Example output: + +``` +pod/imex-channel-injection created +``` + +4. Verify IMEX channel injection: + +```bash +kubectl logs imex-channel-injection +``` + +Example output should list one or more channel device files under `/dev/nvidia-caps-imex-channels`: + +``` +total 0 +drwxr-xr-x 2 root root 60 ... +crw-rw-rw- 1 root root 507, 0 ... channel0 +``` + +5. Clean up: + +```bash +kubectl delete -f imex-test-pod.yaml -f imex-compute-domain.yaml +``` + +Example output: + +``` +pod "imex-channel-injection" deleted +computedomain.resource.nvidia.com "imex-channel-injection" deleted +``` diff --git a/docs/prerequisites.md b/docs/prerequisites.md new file mode 100644 index 000000000..e545c7fbd --- /dev/null +++ b/docs/prerequisites.md @@ -0,0 +1,85 @@ +# Prerequisites + +Cluster, software, and hardware requirements for the DRA Driver for NVIDIA GPUs. + +> Tip: Most of these prerequisites can be installed and managed for you by the [NVIDIA GPU Operator](#install-prerequisites-with-nvidia-gpu-operator). + + +| Requirement | Version / Notes | +|---|---| +| Kubernetes | v1.34.2 or later, with at least one node that has one or more NVIDIA GPUs. | +| `DynamicResourceAllocation` feature gate | Enabled by default in Kubernetes v1.34+. On v1.32 and v1.33, [enable it manually](#enable-dra-on-kubernetes-v132-and-v133). | +| Helm | v3.8 or later. | +| NVIDIA Driver | v565 or later for GPU allocation. v570.158.01 or later if using [ComputeDomains](#computedomains-additional-prerequisites). | +| CDI | Enabled in your container runtime. This is enabled by default in containerd 2.0+ and CRIO v1.27+. The DRA Driver uses CDI to expose GPUs to containers. | +| Node Feature Discovery (NFD) | Labels GPU nodes in the cluster. The DRA Driver uses these labels to target the GPU kubelet plugin to the correct nodes. | + +## ComputeDomains additional prerequisites + +If you plan to use ComputeDomains, you also need: + +- NVIDIA Driver v570.158.01 or later. The `IMEXDaemonsWithDNSNames` feature gate is enabled by default and requires this driver version. The ComputeDomain plugin will fail to start on older drivers unless `IMEXDaemonsWithDNSNames` is explicitly disabled. +- Multi-Node NVLink (MNNVL) hardware. Nodes must be connected via NVLink fabric, such as GB200 NVL72 and similar systems. +- GPU Feature Discovery (GFD) deployed via the [GPU Operator](#install-prerequisites-with-nvidia-gpu-operator). GFD generates the `nvidia.com/gpu.clique` node labels required by ComputeDomains. +- On all GPU nodes where the `nvidia-imex-*` packages are installed, the `nvidia-imex.service` systemd unit must be disabled: + +```bash +systemctl disable --now nvidia-imex.service && systemctl mask nvidia-imex.service +``` + +## Install prerequisites with NVIDIA GPU Operator + +The [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) is a Kubernetes operator that automates the deployment and lifecycle management of all NVIDIA software components needed to provision and monitor GPUs in a cluster. + +It can manage the following DRA Driver for NVIDIA GPUs prerequisites for you: + +- NVIDIA Driver (v565+ for GPU allocation, v570.158.01+ for ComputeDomains). The GPU Operator installs a [default driver](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#gpu-operator-component-matrix) that meets the DRA Driver's prerequisites. To use a specific version, see [Common chart customization options](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options) in the GPU Operator documentation. +- CDI enabled through the NVIDIA Container Toolkit. +- Node Feature Discovery (NFD). +- GPU Feature Discovery (GFD), required for ComputeDomains. + +If you choose to install the GPU Operator, follow the [DRA Driver for NVIDIA GPUs install guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-intro-install.html) in the GPU Operator documentation. It covers installing the GPU Operator with the NVIDIA Kubernetes Device Plugin disabled and installing the DRA Driver for NVIDIA GPUs. + +## Enable DRA on Kubernetes v1.32 and v1.33 + +On Kubernetes v1.34 and later, `DynamicResourceAllocation` is enabled by default and no additional configuration is required. + +On Kubernetes v1.32 and v1.33, enable the following on each component: + +| Component | Requirement | +|---|---| +| kube-apiserver | Enable the `DynamicResourceAllocation` feature gate and the `resource.k8s.io/v1beta1` API group (available on v1.32 and v1.33). On v1.33, also enable `resource.k8s.io/v1beta2`. | +| kube-controller-manager | Enable the `DynamicResourceAllocation` feature gate | +| kube-scheduler | Enable the `DynamicResourceAllocation` feature gate | +| kubelet | Enable the `DynamicResourceAllocation` feature gate | + +How you apply these depends on your cluster setup. For managed Kubernetes distributions (EKS, GKE, AKS, and others), refer to your provider's documentation. Not all providers support enabling `DynamicResourceAllocation` on v1.32 or v1.33 clusters. + +### Example: kubeadm + +The following `kubeadm-init.yaml` enables DRA for a new cluster using [kubeadm](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/control-plane-flags/): + +```yaml +apiVersion: kubeadm.k8s.io/v1beta4 +kind: ClusterConfiguration +apiServer: + extraArgs: + - name: "feature-gates" + value: "DynamicResourceAllocation=true" + - name: "runtime-config" + # On v1.32, omit "resource.k8s.io/v1beta2=true" + value: "resource.k8s.io/v1beta1=true,resource.k8s.io/v1beta2=true" +controllerManager: + extraArgs: + - name: "feature-gates" + value: "DynamicResourceAllocation=true" +scheduler: + extraArgs: + - name: "feature-gates" + value: "DynamicResourceAllocation=true" +--- +apiVersion: kubelet.config.k8s.io/v1beta1 +kind: KubeletConfiguration +featureGates: + DynamicResourceAllocation: true +``` diff --git a/docs/upgrade.md b/docs/upgrade.md new file mode 100644 index 000000000..b680cafe7 --- /dev/null +++ b/docs/upgrade.md @@ -0,0 +1,124 @@ +# Upgrade from v25.12.0 to v0.4.0 + +Upgrade the DRA Driver for NVIDIA GPUs from `v25.12.0` to `v0.4.0` without disrupting existing workloads. For the full release summary, see the [v0.4.0 release notes](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/releases/tag/v0.4.0). + + +--- + +## What changed + +The v0.4.0 release introduces several changes that affect how this upgrade is performed: + +- The project moved from `NVIDIA/k8s-dra-driver-gpu` to `kubernetes-sigs/dra-driver-nvidia-gpu`. The Go module is now `sigs.k8s.io/nvidia-dra-driver-gpu` and container images are published to `registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu` in addition to NVIDIA NGC Catalog (NGC). +- The Helm chart name changed from `nvidia-dra-driver-gpu` to `dra-driver-nvidia-gpu`. To keep existing Kubernetes resource names (DaemonSets, Deployments, ServiceAccounts, RBAC) stable, `--set nameOverride=nvidia-dra-driver-gpu` is required on the first upgrade. See [Upgrade procedure](#upgrade-procedure) below. +- In addition to NGC (`nvidia/dra-driver-nvidia-gpu`), the DRA Driver Helm chart is now also published to `oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu`. You can continue to use the NGC chart or switch to the Kubernetes registry. +- Starting in v0.4.0, the chart follows SemVer and `--version 0.4.0` is required on `helm install` and `helm upgrade`. +- Once the cluster is on v0.4.0, downgrading to v25.12.0 is not supported. Two changes prevent downgrade: the kubelet plugin checkpoint format added a `BootID` field, and the `ComputeDomain` API now allows `numNodes` to be omitted. Plan this upgrade as forward-only. See the [v0.4.0 release notes](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/releases/tag/v0.4.0) for more details. + +## Before you begin + +- Collect the `--set` flags you used at install time (for example, `gpuResourcesEnabledOverride`, `nvidiaDriverRoot`, `webhook.enabled`). You will pass the same flags on `helm upgrade`. +- If any node hit the "device cannot be reprepared after host reboot" issue ([#951](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/issues/951)) prior to v0.4.0, remove the kubelet plugin checkpoint file on that node before upgrading. The new BootID-aware checkpoint format in v0.4.0 only invalidates checkpoints that already carry a recorded BootID; legacy checkpoints written by v25.12.0 are otherwise assumed valid. + +## Upgrade procedure + +Perform the following steps in order. + +1. Update custom resource definitions. + +Apply the v0.4.0 CRDs before upgrading the Helm chart. Helm only installs CRDs on first install and does not update them on `helm upgrade`, so applying them explicitly ensures the API schema is ready before the new controller and kubelet plugin start. + +Update the ComputeDomains CRD: + +```bash +kubectl apply \ + -f https://raw.githubusercontent.com/kubernetes-sigs/dra-driver-nvidia-gpu/refs/tags/v0.4.0/deployments/helm/dra-driver-nvidia-gpu/crds/resource.nvidia.com_computedomains.yaml +``` + +Update the ComputeDomainsCliques CRD: + +```bash +kubectl apply \ + -f https://raw.githubusercontent.com/kubernetes-sigs/dra-driver-nvidia-gpu/refs/tags/v0.4.0/deployments/helm/dra-driver-nvidia-gpu/crds/resource.nvidia.com_computedomaincliques.yaml +``` + +2. Upgrade the Helm chart by using the `helm upgrade -i` command to upgrade the chart in place. + +Two flags are required to upgrade to v0.4.0: + +- `--version 0.4.0`, because v0.4.0 introduces SemVer. +- `--set nameOverride=nvidia-dra-driver-gpu`, because the chart was renamed. Without this override, the new chart creates duplicate Kubernetes objects (kubelet plugin DaemonSet, controller Deployment, RBAC, and so on) under the new name instead of upgrading the existing ones. This is only required on the first upgrade to v0.4.0 or later. + +The following command upgrades the chart and switches to using the Kubernetes registry source: + +```bash +helm upgrade -i nvidia-dra-driver-gpu oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \ + --version 0.4.0 \ + --namespace nvidia-dra-driver-gpu \ + --set gpuResourcesEnabledOverride=true \ + --set nameOverride=nvidia-dra-driver-gpu +``` + +Append any additional `--set` flags you used at install time. For example, `--set nvidiaDriverRoot=/run/nvidia/driver` if the NVIDIA GPU Operator manages your drivers. + +Example output: + +``` +Release "nvidia-dra-driver-gpu" has been upgraded. Happy Helming! +NAME: nvidia-dra-driver-gpu +LAST DEPLOYED: Wed May 13 05:30:39 2026 +NAMESPACE: nvidia-dra-driver-gpu +STATUS: deployed +REVISION: 2 +TEST SUITE: None +``` + +> Note: Subsequent `helm upgrade` calls do not need `--set nameOverride=nvidia-dra-driver-gpu`. It is only required on the first upgrade to v0.4.0 or later. + +## Verify the upgrade + +After `helm upgrade` completes, confirm the new pods are running and pre-existing workloads are still healthy. + +1. Check that all driver pods are `Running` and `Ready`: + +```bash +kubectl get pods -n nvidia-dra-driver-gpu +``` + +Example output: + +``` +NAME READY STATUS RESTARTS AGE +nvidia-dra-driver-gpu-controller-5c968c745f-s8n2m 1/1 Running 0 13s +nvidia-dra-driver-gpu-kubelet-plugin-6fmmd 2/2 Running 0 112s +``` + +The `controller` pod runs the ComputeDomain controller. The `kubelet-plugin` pod runs two containers (`gpus` and `compute-domains`) when both resource plugins are enabled, so it reports `2/2`. + +2. Confirm every pre-existing `ResourceClaim` is still allocated and reserved for its pod: + +```bash +kubectl get resourceclaims -A +``` + +Example output, showing a ComputeDomain workload and two GPU workloads that were running before the upgrade: + +``` +NAMESPACE NAME STATE AGE +default imex-channel-injection-imex-channel-0-c5pnt allocated,reserved 13s +nvidia-dra-driver-gpu computedomain-daemon-dc84d905-2336-45fa-9-compute-domaifb5sg allocated,reserved 12s +gpu-test1 pod1-gpu-7r7zn allocated,reserved 119s +gpu-test1 pod2-gpu-stmv8 allocated,reserved 119s +``` + +Every claim that was bound before the upgrade should still report `allocated,reserved`. + +## Troubleshooting + +If a workload pod is in a non-`Running` state after the upgrade, capture the kubelet plugin logs from the node where the pod was scheduled: + +```bash +kubectl logs -n nvidia-dra-driver-gpu +``` + +If you see `checkpoint is corrupted` errors, the v0.4.0 kubelet plugin now logs a diff between the on-disk and re-marshaled checkpoint contents to make this easier to debug. Include that log output when filing an [issue](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/issues).