Skip to content

Commit fac01c6

Browse files
committed
add dra support
1 parent 0fe5762 commit fac01c6

8 files changed

Lines changed: 236 additions & 71 deletions

File tree

docs/en/pgpu_dra/how_to/cdi_enable_containerd.mdx renamed to docs/en/infrastructure_management/device_management/pgpu_dra/how_to/cdi_enable_containerd.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ CDI (Container Device Interface) provides a standard mechanism for device vendor
88

99
CDI support is enabled by default in containerd version 2.0 and later. Earlier versions, starting from 1.7.0, support for this feature requires manual activation.
1010

11-
## Steps to Enable CDI in Containerd (1.7.0 <= version < 2.0.0)
11+
## Steps to Enable CDI in containerd v1.7.x
1212

1313
1. Update containerd configuration.
1414
Edit the configuration file:

docs/en/pgpu_dra/how_to/index.mdx renamed to docs/en/infrastructure_management/device_management/pgpu_dra/how_to/index.mdx

File renamed without changes.
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
---
2+
weight: 30
3+
---
4+
5+
# Enable DRA(Dynamic Resource Allocation) and corresponding API groups in Kubernetes
6+
7+
DRA support is enabled by default in Kubernetes 1.34 and later. Earlier versions, starting from 1.32, support for this feature requires manual activation.
8+
9+
10+
## Steps to Enable DRA in Kubernetes 1.32–1.33
11+
12+
On the all master nodes:
13+
1. Edit `kube-apiserver` component manifests in `/etc/kubernetes/manifests/kube-apiserver.yaml`:
14+
```yaml
15+
spec:
16+
containers:
17+
- command:
18+
- kube-apiserver
19+
- --feature-gates=DynamicResourceAllocation=true # required
20+
- --runtime-config=resource.k8s.io/v1beta1 # required
21+
- --runtime-config=resource.k8s.io/v1beta2 # required
22+
# ... other flags
23+
```
24+
25+
2. Edit `kube-controller-manager` component manifests in `/etc/kubernetes/manifests/kube-controller-manager.yaml`:
26+
```yaml
27+
spec:
28+
containers:
29+
- command:
30+
- kube-controller-manager
31+
- --feature-gates=DynamicResourceAllocation=true # required
32+
# ... other flags
33+
```
34+
35+
3. Edit `kube-scheduler` component manifests in `/etc/kubernetes/manifests/kube-scheduler.yaml`:
36+
```yaml
37+
spec:
38+
containers:
39+
- command:
40+
- kube-scheduler
41+
- --feature-gates=DynamicResourceAllocation=true
42+
# ... other flags
43+
```
44+
45+
4. For kubelet, edit `/var/lib/kubelet/config.yaml` on the all nodes:
46+
47+
```yaml
48+
apiVersion: kubelet.config.k8s.io/v1beta1
49+
kind: KubeletConfiguration
50+
featureGates:
51+
DynamicResourceAllocation: true
52+
```
53+
54+
Restart kubelet:
55+
56+
```bash
57+
sudo systemctl restart kubelet
58+
```

docs/en/pgpu_dra/index.mdx renamed to docs/en/infrastructure_management/device_management/pgpu_dra/index.mdx

File renamed without changes.
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
---
2+
weight: 20
3+
---
4+
5+
# Installation
6+
7+
## Prerequisites
8+
9+
- **NvidiaDriver v565+**
10+
- **Kubernetes v1.32+**
11+
- **ACP v4.1+**
12+
- **Cluster administrator access to your ACP cluster**
13+
- **CDI must be enabled in the underlying container runtime (such as containerd, see [Enable CDI](how_to/cdi_enable_containerd.mdx))**
14+
- **DRA and corresponding API groups must be enabled(see [Enable DRA](how_to/k8s_dra_enable.mdx)).**
15+
16+
## Procedure
17+
18+
### Installing Nvidia driver in your gpu node
19+
Prefer to [Installation guide of Nvidia Official website](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)
20+
21+
### Installing Nvidia Container Runtime
22+
Prefer to [Installation guide of Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
23+
24+
### Downloading Cluster plugin
25+
26+
:::info
27+
28+
`Alauda Build of NVIDIA DRA Driver for GPUs` cluster plugin can be retrieved from Customer Portal.
29+
30+
Please contact Consumer Support for more information.
31+
32+
:::
33+
34+
### Uploading the Cluster plugin
35+
36+
For more information on uploading the cluster plugin, please refer to <ExternalSiteLink name="acp" href="ui/cli_tools/index.html#uploading-cluster-plugins" children="Uploading Cluster Plugins" />
37+
38+
### Installing Alauda Build of NVIDIA DRA Driver for GPUs
39+
40+
1. Add label "nvidia-device-enable=pgpu-dra" in your GPU node for `nvidia-dra-driver-gpu-kubelet-plugin` schedule.
41+
```bash
42+
kubectl label nodes {nodeid} nvidia-device-enable=pgpu-dra
43+
```
44+
:::info
45+
**Note: On the same node, you can only set one of the following labels: `gpu=on`, `nvidia-device-enable=pgpu`, or `nvidia-device-enable=pgpu-dra`.**
46+
:::
47+
48+
2. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of NVIDIA DRA Driver for GPUs` Cluster plugin.
49+
50+
51+
### Verify DRA setup
52+
53+
1. Check DRA driver and DRA controller pods:
54+
55+
```bash
56+
kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu"
57+
```
58+
You should get results similar to:
59+
```
60+
nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4 1/1 Running 0 18h
61+
nvidia-dra-driver-gpu-kubelet-plugin-65fjt 2/2 Running 0 18h
62+
```
63+
64+
2. Verify ResourceSlice objects:
65+
```bash
66+
kubectl get resourceslices -o yaml
67+
```
68+
69+
For GPU nodes, you should see output similar to:
70+
71+
```yaml
72+
apiVersion: resource.k8s.io/v1beta1
73+
kind: ResourceSlice
74+
metadata:
75+
generateName: 192.168.140.59-gpu.nvidia.com-
76+
name: 192.168.140.59-gpu.nvidia.com-gbl46
77+
ownerReferences:
78+
- apiVersion: v1
79+
controller: true
80+
kind: Node
81+
name: 192.168.140.59
82+
uid: 4ab2c24c-fc35-4c75-bcaf-db038356575c
83+
spec:
84+
devices:
85+
- basic:
86+
attributes:
87+
architecture:
88+
string: Pascal
89+
brand:
90+
string: Tesla
91+
cudaComputeCapability:
92+
version: 6.0.0
93+
cudaDriverVersion:
94+
version: 12.8.0
95+
driverVersion:
96+
version: 570.124.6
97+
pcieBusID:
98+
string: 0000:00:0b.0
99+
productName:
100+
string: Tesla P100-PCIE-16GB
101+
resource.kubernetes.io/pcieRoot:
102+
string: pci0000:00
103+
type:
104+
string: gpu
105+
uuid:
106+
string: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66
107+
capacity:
108+
memory:
109+
value: 16Gi
110+
name: gpu-0
111+
driver: gpu.nvidia.com
112+
nodeName: 192.168.140.59
113+
pool:
114+
generation: 1
115+
name: 192.168.140.59
116+
resourceSliceCount: 1
117+
```
118+
3. Deploy workloads with DRA.
119+
:::info
120+
**Note:Fill in the `selector` field of the following `ResourceClaimTemplate` resource according to your specific GPU model.You can use [common expression language (CEL)](https://cel.dev) to select devices based on specific attributes.**
121+
:::
122+
Create spec file:
123+
```bash
124+
cat <<EOF > dra-gpu-test.yaml
125+
---
126+
apiVersion: resource.k8s.io/v1beta1
127+
kind: ResourceClaimTemplate
128+
metadata:
129+
name: gpu-template
130+
spec:
131+
spec:
132+
devices:
133+
requests:
134+
- name: gpu
135+
deviceClassName: gpu.nvidia.com
136+
selectors:
137+
- cel:
138+
expression: "device.attributes['gpu.nvidia.com'].productName == 'Tesla P100-PCIE-16GB'" # [!code callout]
139+
---
140+
apiVersion: v1
141+
kind: Pod
142+
metadata:
143+
name: dra-gpu-workload
144+
spec:
145+
tolerations:
146+
- key: "nvidia.com/gpu"
147+
operator: "Exists"
148+
effect: "NoSchedule"
149+
runtimeClassName: nvidia
150+
restartPolicy: OnFailure
151+
resourceClaims:
152+
- name: gpu-claim
153+
resourceClaimTemplateName: gpu-template
154+
containers:
155+
- name: cuda-container
156+
image: "ubuntu:22.04"
157+
command: ["bash", "-c"]
158+
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
159+
resources:
160+
claims:
161+
- name: gpu-claim
162+
```
163+
Apply spec:
164+
165+
```bash
166+
kubectl apply -f dra-gpu-test.yaml
167+
```
168+
169+
Obtain output of container in the pod:
170+
```bash
171+
kubectl logs pod -n dra-gpu-workload -f
172+
```
173+
The output is expected to show the GPU UUID from the container. Example:
174+
175+
```text
176+
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66)
177+
```

docs/en/pgpu_dra/intro.mdx renamed to docs/en/infrastructure_management/device_management/pgpu_dra/intro.mdx

File renamed without changes.

docs/en/pgpu_dra/how_to/k8s_dra_enable.mdx

Lines changed: 0 additions & 7 deletions
This file was deleted.

docs/en/pgpu_dra/install.mdx

Lines changed: 0 additions & 63 deletions
This file was deleted.

0 commit comments

Comments
 (0)