Skip to content

Commit 45af5c0

Browse files
committed
add ray
1 parent 8702e8c commit 45af5c0

7 files changed

Lines changed: 679 additions & 110 deletions

File tree

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
weight: 10
3+
---
4+
5+
# Working with distributed workloads
6+
7+
Distributed workloads enable data scientists to use multiple cluster nodes in parallel for faster and more efficient data processing and model training. The Ray and Kubeflow frameworks simplify task orchestration and monitoring, and offer seamless integration for automated resource scaling and optimal node utilization with advanced GPU support.
8+
9+
<Overview />

docs/en/distributed_workloads/running-ray-based-distributed-workloads.mdx

Lines changed: 252 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
weight: 100
3+
---
4+
5+
# Troubleshooting common problems with distributed workloads for users
6+
7+
If you are experiencing errors in Alauda AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.
8+
9+
If the problem is not documented here or in the release notes, contact Alauda Support.
10+
11+
## My Ray cluster is in a suspended state
12+
13+
### Problem
14+
15+
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
16+
17+
### Diagnosis
18+
19+
The Ray cluster head pod or worker pods remain in a suspended state.
20+
21+
### Resolution
22+
23+
1. In **Administrator** view, navigate to **Clusters** -> **Resources**.
24+
2. Check the workload resource:
25+
i. Click Search, and from the Resources list, select Workload.
26+
ii. Select the workload resource that is created with the Ray cluster resource, and click the **YAML** tab.
27+
iii. Check the text in the `status.conditions.message` field, which provides the reason for the suspended state, as shown in the following example:
28+
29+
```yaml
30+
status:
31+
conditions:
32+
- lastTransitionTime: '2024-05-29T13:05:09Z'
33+
message: "couldn't assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue"
34+
```
35+
36+
3. Check the Ray cluster resource:
37+
i. Click **Search**, and from the **Resources** list, select **RayCluster**.
38+
ii. Select the Ray cluster resource, and click the **YAML** tab.
39+
iii. Check the text in the `status.conditions.message` field.
40+
41+
4. Check the cluster queue resource:
42+
i. Click **Search**, and from the **Resources** list, select **ClusterQueue**.
43+
ii. Check your cluster queue configuration to ensure that the resources that you requested are within the limits defined for the project.
44+
iii. Either reduce your requested resources, or contact your administrator to request more resources.
45+
46+
## My Ray cluster is in a failed state
47+
48+
### Problem
49+
50+
You might have insufficient resources.
51+
52+
### Diagnosis
53+
54+
The Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a `failed` state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.
55+
56+
### Resolution
57+
58+
If the failed state persists, complete the following steps:
59+
60+
1. In **Administrator** view, navigate to **Clusters** -> **Resources**.
61+
2. Click **Search**, and from the **Resources** list, select **Pod**.
62+
3. Click your pod name to open the pod details page.
63+
4. Click the **Events** tab, and review the pod events to identify the cause of the problem.
64+
5. If you cannot resolve the problem, contact your administrator to request assistance.
65+
66+
## I see a "failed to call webhook" error message for Kueue
67+
68+
### Problem
69+
70+
After you run the `cluster.apply()` command, the following error is shown:
71+
72+
```
73+
ApiException: (500)
74+
Reason: Internal Server Error
75+
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
76+
```
77+
78+
### Diagnosis
79+
80+
The Kueue pod might not be running.
81+
82+
### Resolution
83+
84+
Contact your administrator to request assistance.
85+
86+
## My Ray cluster does not start
87+
88+
### Problem
89+
90+
After you run the `cluster.apply()` command, when you run either the `cluster.details()` command or the `cluster.status()` command, the Ray Cluster remains in the `Starting` status instead of changing to the `Ready` status. No pods are created.
91+
92+
### Diagnosis
93+
94+
1. In **Administrator** view, navigate to **Clusters** -> **Resources**.
95+
2. Check the workload resource:
96+
i. Click **Search**, and from the **Resources** list, select **Workload**.
97+
ii. Select the workload resource that is created with the Ray cluster resource, and click the **YAML** tab.
98+
iii. Check the text in the `status.conditions.message` field, which provides the reason for remaining in the `Starting` state.
99+
100+
3. Check the Ray cluster resource:
101+
i. Click **Search**, and from the **Resources** list, select **RayCluster**.
102+
ii. Select the Ray cluster resource, and click the **YAML** tab.
103+
iii. Check the text in the `status.conditions.message` field.
104+
105+
### Resolution
106+
107+
If you cannot resolve the problem, contact your administrator to request assistance.
108+
109+
## I see a "Default Local Queue not found" error message
110+
111+
### Problem
112+
113+
After you run the `cluster.apply()` command, the following error is shown:
114+
115+
```
116+
Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
117+
```
118+
119+
### Diagnosis
120+
121+
No default local queue is defined, and a local queue is not specified in the cluster configuration.
122+
123+
### Resolution
124+
125+
1. In **Administrator** view, navigate to **Clusters** -> **Resources**.
126+
2. Click **Search**, and from the **Resources** list, select **LocalQueue**.
127+
3. Resolve the problem in one of the following ways:
128+
- If a local queue exists, add it to your cluster configuration as follows:
129+
```
130+
local_queue="<local_queue_name>"
131+
```
132+
- If no local queue exists, contact your administrator to request assistance.
133+
134+
## I see a "local_queue provided does not exist" error message
135+
136+
### Problem
137+
138+
After you run the `cluster.apply()` command, the following error is shown:
139+
140+
```
141+
local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
142+
```
143+
144+
### Diagnosis
145+
146+
An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.
147+
148+
### Resolution
149+
150+
1. In **Administrator** view, navigate to **Clusters** -> **Resources**.
151+
2. Click **Search**, and from the **Resources** list, select **LocalQueue**.
152+
3. Resolve the problem in one of the following ways:
153+
- If a local queue exists, ensure that you spelled the local queue name correctly in your cluster configuration, and that the `namespace` value in the cluster configuration matches your project name. If you do not specify a `namespace` value in the cluster configuration, the Ray cluster is created in the current project.
154+
- If no local queue exists, contact your administrator to request assistance.
155+
156+
157+
## My pod provisioned by Kueue is terminated before my image is pulled
158+
159+
### Problem
160+
161+
Kueue waits for a period of time before marking a workload as ready for all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.
162+
163+
### Diagnosis
164+
165+
1. In **Administrator** view, navigate to **Clusters** -> **Resources**.
166+
2. Click **Search**, and from the **Resources** list, select **Pod**.
167+
3. Click the Ray head pod name to open the pod details page.
168+
4. Click the **Events** tab, and review the pod events to check whether the image pull completed successfully.
169+
170+
### Resolution
171+
172+
If the pod takes more than 5 minutes to pull the image, contact your administrator to request assistance.

docs/en/kuberay/index.mdx

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
weight: 83
3+
---
4+
5+
# Alauda Build of KubeRay Operator
6+
7+
<Overview />

docs/en/kuberay/install.mdx

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
weight: 20
3+
---
4+
5+
# Installation
6+
7+
## Prerequisites
8+
- **ACP version: v4.0 or later**
9+
10+
### Downloading Cluster Plugin
11+
12+
:::info
13+
14+
`Alauda Build of KubeRay Operator` cluster plugin can be retrieved from Customer Portal.
15+
16+
Please contact Consumer Support for more information.
17+
18+
:::
19+
20+
### Uploading the Cluster Plugin
21+
22+
The platform provides the **`violet`** command-line tool for uploading packages downloaded from the Customer Portal Marketplace.
23+
24+
### Installing Alauda Build of KubeRay Operator
25+
26+
1. Go to the `Administrator` -> `Marketplace` -> `Cluster Plugin` page, switch to the target cluster, and then deploy the `Alauda Build of KubeRay Operator` Cluster plugin.
27+
28+
2. Verify the installation. You can check the pod status:
29+
30+
```bash
31+
kubectl get pods -n cpaas-system | grep "kuberay-operator"
32+
```
33+
34+
You should see the KubeRay operator pods running.

docs/en/kuberay/intro.mdx

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
weight: 10
3+
---
4+
5+
# Introduction
6+
7+
Alauda Build of KubeRay Operator is a Kubernetes-native system that provides a comprehensive solution for running [Ray](https://github.com/ray-project/ray) applications on Kubernetes. Built on the open-source [KubeRay](https://github.com/ray-project/kuberay) project, it simplifies the deployment and management of Ray clusters, jobs, and services using Kubernetes Custom Resource Definitions (CRDs).
8+
9+
## Overview
10+
11+
Alauda Build of KubeRay Operator provides three core CRDs:
12+
13+
- **RayCluster**: Fully manages the lifecycle of Ray clusters, including cluster creation/deletion, autoscaling, and fault tolerance.
14+
- **RayJob**: Automatically creates a RayCluster and submits jobs when the cluster is ready. Supports automatic cleanup after job completion.
15+
- **RayService**: Manages Ray Serve deployments with zero-downtime upgrades and high availability for production ML model serving.
16+
17+
## Key Features
18+
19+
- **Autoscaling**: Automatically adjusts the number of worker nodes based on workload requirements.
20+
- **Heterogeneous Compute**: Supports GPU and other accelerator resources for distributed training and inference.
21+
- **Multiple Ray Versions**: Run different Ray versions in the same Kubernetes cluster.
22+
- **Fault Tolerance**: Provides built-in mechanisms for handling node failures and job retries.
23+
- **Kubernetes Integration**: Seamlessly integrates with existing Kubernetes tools and workflows.
24+
- **Ecosystem Support**: Works with observability tools (Prometheus, Grafana), queuing systems (Kueue, Volcano), and ingress controllers.
25+
26+
## Use Cases
27+
28+
- **Distributed Machine Learning**: Scale ML training workloads across multiple nodes.
29+
- **Model Serving**: Deploy and serve ML models at scale using Ray Serve.
30+
- **Batch Inference**: Process large datasets with parallel inference workloads.
31+
- **Hyperparameter Tuning**: Run distributed hyperparameter optimization with Ray Tune.
32+
- **LLM Inference**: Deploy large language models for online inference.
33+
34+
For more details, refer to [Ray on Kubernetes](https://docs.ray.io/en/latest/cluster/kubernetes/index.html).

0 commit comments

Comments
 (0)