|
| 1 | +--- |
| 2 | +weight: 20 |
| 3 | +--- |
| 4 | + |
| 5 | +# Installation |
| 6 | + |
| 7 | +## Prerequisites |
| 8 | + |
| 9 | +- **NvidiaDriver v565+** |
| 10 | +- **Kubernetes v1.32+** |
| 11 | +- **ACP v4.1+** |
| 12 | +- **Cluster administrator access to your ACP cluster** |
| 13 | +- **CDI must be enabled in the underlying container runtime (such as containerd, see [Enable CDI](how_to/cdi_enable_containerd.mdx))** |
| 14 | +- **DRA and corresponding API groups must be enabled(see [Enable DRA](how_to/k8s_dra_enable.mdx)).** |
| 15 | + |
| 16 | +## Procedure |
| 17 | + |
| 18 | +### Installing Nvidia driver in your gpu node |
| 19 | +Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) |
| 20 | + |
| 21 | +### Installing Nvidia Container Runtime |
| 22 | +Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) |
| 23 | + |
| 24 | +### Downloading Cluster plugin |
| 25 | + |
| 26 | +:::info |
| 27 | + |
| 28 | +`Alauda Build of NVIDIA DRA Driver for GPUs` cluster plugin can be retrieved from Customer Portal. |
| 29 | + |
| 30 | +Please contact Consumer Support for more information. |
| 31 | + |
| 32 | +::: |
| 33 | + |
| 34 | +### Uploading the Cluster plugin |
| 35 | + |
| 36 | +For more information on uploading the cluster plugin, please refer to <ExternalSiteLink name="acp" href="ui/cli_tools/index.html#uploading-cluster-plugins" children="Uploading Cluster Plugins" /> |
| 37 | + |
| 38 | +### Installing Alauda Build of NVIDIA DRA Driver for GPUs |
| 39 | + |
| 40 | +1. Add label "nvidia-device-enable=pgpu-dra" in your GPU node for `nvidia-dra-driver-gpu-kubelet-plugin` schedule. |
| 41 | + ```bash |
| 42 | + kubectl label nodes {nodeid} nvidia-device-enable=pgpu-dra |
| 43 | + ``` |
| 44 | + :::info |
| 45 | + **Note: On the same node, you can only set one of the following labels: `gpu=on`, `nvidia-device-enable=pgpu`, or `nvidia-device-enable=pgpu-dra`.** |
| 46 | + ::: |
| 47 | + |
| 48 | +2. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of NVIDIA DRA Driver for GPUs` Cluster plugin. |
| 49 | + |
| 50 | + |
| 51 | +### Verify DRA setup |
| 52 | + |
| 53 | +1. Check DRA driver and DRA controller pods: |
| 54 | + |
| 55 | + ```bash |
| 56 | + kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu" |
| 57 | + ``` |
| 58 | + You should get results similar to: |
| 59 | + ``` |
| 60 | + nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4 1/1 Running 0 18h |
| 61 | + nvidia-dra-driver-gpu-kubelet-plugin-65fjt 2/2 Running 0 18h |
| 62 | + ``` |
| 63 | + |
| 64 | +2. Verify ResourceSlice objects: |
| 65 | + ```bash |
| 66 | + kubectl get resourceslices -o yaml |
| 67 | + ``` |
| 68 | + |
| 69 | + For GPU nodes, you should see output similar to: |
| 70 | + |
| 71 | + ```yaml |
| 72 | + apiVersion: resource.k8s.io/v1beta1 |
| 73 | + kind: ResourceSlice |
| 74 | + metadata: |
| 75 | + generateName: 192.168.140.59-gpu.nvidia.com- |
| 76 | + name: 192.168.140.59-gpu.nvidia.com-gbl46 |
| 77 | + ownerReferences: |
| 78 | + - apiVersion: v1 |
| 79 | + controller: true |
| 80 | + kind: Node |
| 81 | + name: 192.168.140.59 |
| 82 | + uid: 4ab2c24c-fc35-4c75-bcaf-db038356575c |
| 83 | + spec: |
| 84 | + devices: |
| 85 | + - basic: |
| 86 | + attributes: |
| 87 | + architecture: |
| 88 | + string: Pascal |
| 89 | + brand: |
| 90 | + string: Tesla |
| 91 | + cudaComputeCapability: |
| 92 | + version: 6.0.0 |
| 93 | + cudaDriverVersion: |
| 94 | + version: 12.8.0 |
| 95 | + driverVersion: |
| 96 | + version: 570.124.6 |
| 97 | + pcieBusID: |
| 98 | + string: 0000:00:0b.0 |
| 99 | + productName: |
| 100 | + string: Tesla P100-PCIE-16GB |
| 101 | + resource.kubernetes.io/pcieRoot: |
| 102 | + string: pci0000:00 |
| 103 | + type: |
| 104 | + string: gpu |
| 105 | + uuid: |
| 106 | + string: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66 |
| 107 | + capacity: |
| 108 | + memory: |
| 109 | + value: 16Gi |
| 110 | + name: gpu-0 |
| 111 | + driver: gpu.nvidia.com |
| 112 | + nodeName: 192.168.140.59 |
| 113 | + pool: |
| 114 | + generation: 1 |
| 115 | + name: 192.168.140.59 |
| 116 | + resourceSliceCount: 1 |
| 117 | + ``` |
| 118 | +3. Deploy workloads with DRA. |
| 119 | + :::info |
| 120 | + **Note:Fill in the `selector` field of the following `ResourceClaimTemplate` resource according to your specific GPU model.You can use [common expression language (CEL)](https://cel.dev) to select devices based on specific attributes.** |
| 121 | + ::: |
| 122 | + Create spec file: |
| 123 | + ```bash |
| 124 | + cat <<EOF > dra-gpu-test.yaml |
| 125 | + --- |
| 126 | + apiVersion: resource.k8s.io/v1beta1 |
| 127 | + kind: ResourceClaimTemplate |
| 128 | + metadata: |
| 129 | + name: gpu-template |
| 130 | + spec: |
| 131 | + spec: |
| 132 | + devices: |
| 133 | + requests: |
| 134 | + - name: gpu |
| 135 | + deviceClassName: gpu.nvidia.com |
| 136 | + selectors: |
| 137 | + - cel: |
| 138 | + expression: "device.attributes['gpu.nvidia.com'].productName == 'Tesla P100-PCIE-16GB'" # [!code callout] |
| 139 | + --- |
| 140 | + apiVersion: v1 |
| 141 | + kind: Pod |
| 142 | + metadata: |
| 143 | + name: dra-gpu-workload |
| 144 | + spec: |
| 145 | + tolerations: |
| 146 | + - key: "nvidia.com/gpu" |
| 147 | + operator: "Exists" |
| 148 | + effect: "NoSchedule" |
| 149 | + runtimeClassName: nvidia |
| 150 | + restartPolicy: OnFailure |
| 151 | + resourceClaims: |
| 152 | + - name: gpu-claim |
| 153 | + resourceClaimTemplateName: gpu-template |
| 154 | + containers: |
| 155 | + - name: cuda-container |
| 156 | + image: "ubuntu:22.04" |
| 157 | + command: ["bash", "-c"] |
| 158 | + args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] |
| 159 | + resources: |
| 160 | + claims: |
| 161 | + - name: gpu-claim |
| 162 | + ``` |
| 163 | + Apply spec: |
| 164 | + |
| 165 | + ```bash |
| 166 | + kubectl apply -f dra-gpu-test.yaml |
| 167 | + ``` |
| 168 | + |
| 169 | + Obtain output of container in the pod: |
| 170 | + ```bash |
| 171 | + kubectl logs pod -n dra-gpu-workload -f |
| 172 | + ``` |
| 173 | + The output is expected to show the GPU UUID from the container. Example: |
| 174 | + |
| 175 | + ```text |
| 176 | + GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66) |
| 177 | + ``` |
0 commit comments