diff --git a/.gitignore b/.gitignore index fe07d48..cef94a0 100644 --- a/.gitignore +++ b/.gitignore @@ -13,3 +13,4 @@ .claude CLAUDE.md +.vscode/extensions.json diff --git a/docs/en/distributed_workloads/running-ray-based-distributed-workloads.mdx b/docs/en/distributed_workloads/running-ray-based-distributed-workloads.mdx index c1e24ee..962337a 100644 --- a/docs/en/distributed_workloads/running-ray-based-distributed-workloads.mdx +++ b/docs/en/distributed_workloads/running-ray-based-distributed-workloads.mdx @@ -20,7 +20,7 @@ The demo Jupyter notebooks from the CodeFlare SDK provide guidelines on how to u #### Prerequisites -- You can access a data science cluster that is configured to run distributed workloads as described in [Managing distributed workloads](../kueue/how_to/config_quotas.mdx). +- You can access a data science cluster that is configured to run distributed workloads as described in [Managing distributed workloads](../managing/managing-distributed-workloads.mdx). - You can access a namespace in Alauda AI, create a workbench, and run a default workbench image that contains the CodeFlare SDK, for example, the **Standard Data Science** notebook. For information about creating workbenches, see [Create Workbench](../workbench/how_to/create_workbench.mdx). - You have logged in to Alauda AI, started your workbench, and logged in to JupyterLab. @@ -64,7 +64,7 @@ In the examples in this procedure, you edit the demo Jupyter notebooks in Jupyte #### Prerequisites -- You can access a data science cluster that is configured to run distributed workloads as described in [Managing distributed workloads](../kueue/how_to/config_quotas.mdx). +- You can access a data science cluster that is configured to run distributed workloads as described in [Managing distributed workloads](../managing/managing-distributed-workloads.mdx). - You have installed the `Alauda Build of KubeRay Operator` cluster plugin in your data science cluster, see [Install Alauda Build of KubeRay Operator](../kuberay/install.mdx). - You can access the following software from your data science cluster: - A Ray cluster image that is compatible with your hardware architecture @@ -175,7 +175,7 @@ The `3_widget_example.ipynb` demo Jupyter notebook shows all of the available in #### Prerequisites -- You can access a data science cluster that is configured to run distributed workloads as described in [Managing distributed workloads](../kueue/how_to/config_quotas.mdx). +- You can access a data science cluster that is configured to run distributed workloads as described in [Managing distributed workloads](../managing/managing-distributed-workloads.mdx). - You have installed the `Alauda Build of KubeRay Operator` cluster plugin in your data science cluster, see [Install Alauda Build of KubeRay Operator](../kuberay/install.mdx). - You can access the following software from your data science cluster: - A Ray cluster image that is compatible with your hardware architecture diff --git a/docs/en/distributed_workloads/troubleshooting.mdx b/docs/en/distributed_workloads/troubleshooting.mdx index 932fd30..a3a3df6 100644 --- a/docs/en/distributed_workloads/troubleshooting.mdx +++ b/docs/en/distributed_workloads/troubleshooting.mdx @@ -72,7 +72,7 @@ After you run the `cluster.apply()` command, the following error is shown: ``` ApiException: (500) Reason: Internal Server Error -HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500} +HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.cpaas-system.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.cpaas-system.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500} ``` ### Diagnosis diff --git a/docs/en/managing/index.mdx b/docs/en/managing/index.mdx new file mode 100644 index 0000000..6845cac --- /dev/null +++ b/docs/en/managing/index.mdx @@ -0,0 +1,9 @@ +--- +weight: 75 +--- + +# Managing Alauda AI + +As an Alauda Container Platform cluster administrator, you can manage Alauda AI users and groups, the dashboard interface and applications, deployment resources, accelerators, distributed workloads, and data backup. + + diff --git a/docs/en/managing/managing-distributed-workloads.mdx b/docs/en/managing/managing-distributed-workloads.mdx new file mode 100644 index 0000000..606b421 --- /dev/null +++ b/docs/en/managing/managing-distributed-workloads.mdx @@ -0,0 +1,410 @@ +--- +weight: 20 +--- + +# Managing distributed workloads + +In Alauda AI, distributed workloads like `PyTorchJob`, `RayJob`, and `RayCluster` are created and managed by their respective workload operators. Kueue provides queueing and admission control and integrates with these operators to decide when workloads can run based on cluster-wide quotas. + +You can perform advanced configuration for your distributed workloads environment, such as configuring quota management. + +## Configuring quota management for distributed workloads + +Configure quotas for distributed workloads by creating Kueue resources. Quotas ensure that you can share resources between several namespaces. + +### Prerequisites + +- You have logged in to Alauda Container Platform with the `cluster-admin` role. +- You have installed the kubectl CLI. +- You have installed the Alauda Build of Kueue cluster plugin as described in [Install Alauda Build of Kueue](../kueue/install.mdx). +- You have installed the required distributed workloads components. This includes installing [Alauda Build of KubeRay](../kuberay/install.mdx) and [Alauda Build of Kubeflow Training Operator](../kubeflow/index.mdx). +- You have created a project that contains a workbench, and the workbench is running a default workbench image that contains the CodeFlare SDK, for example, the **Standard Data Science** workbench. For information about how to create a project, see [Working with projects](../workbench/how_to/create_workbench.mdx). +- You have sufficient resources. In addition to the base Alauda AI resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure. +- The resources are physically available in the cluster. For more information about Kueue resources, see [Alauda Build of Kueue documentation](../kueue/index.mdx). +- If you want to use graphics processing units (GPUs), you have enabled GPU support in Alauda AI. See the GPU device management documentation for details. + +:::note +In Alauda AI, NVIDIA GPU accelerators are supported for distributed workloads. +::: + +### Procedure + +1. Verify that a resource flavor exists or create a custom one, as follows: + 1. Check whether a `ResourceFlavor` already exists: + + ```bash + kubectl get resourceflavors + ``` + + 2. If a `ResourceFlavor` already exists and you need to modify it (for example, to change the resources), edit it in place: + + ```bash + kubectl edit resourceflavor + ``` + + 3. If a `ResourceFlavor` does not exist or you want a custom one, create a file called `default_flavor.yaml` and populate it with the following content: + + **Empty Kueue resource flavor** + + ```yaml + apiVersion: kueue.x-k8s.io/v1beta1 + kind: ResourceFlavor + metadata: + name: + ``` + + For more examples, see [Example Kueue resource configurations](#example-kueue-resource-configurations-for-distributed-workloads). + + 4. Perform one of the following actions: + - If you are modifying the existing resource flavor, save the changes. + - If you are creating a new resource flavor, apply the configuration to create the `ResourceFlavor` object: + + ```bash + kubectl apply -f default_flavor.yaml + ``` + +2. Verify that a cluster queue exists or create a custom one. + + Check whether a `ClusterQueue` already exists: + + ```bash + kubectl get clusterqueues + ``` + + If a `ClusterQueue` already exists and you need to modify it, edit it in place: + + ```bash + kubectl edit clusterqueue + ``` + + If a `ClusterQueue` does not exist or you want a custom one, create a file called `cluster_queue.yaml` and populate it with the following content: + + **Example cluster queue** + + ```yaml + apiVersion: kueue.x-k8s.io/v1beta1 + kind: ClusterQueue + metadata: + name: + spec: + namespaceSelector: {} + resourceGroups: + - coveredResources: ['cpu', 'memory', 'nvidia.com/gpu'] + flavors: + - name: '' + resources: + - name: 'cpu' + nominalQuota: 9 + - name: 'memory' + nominalQuota: 36Gi + - name: 'nvidia.com/gpu' + nominalQuota: 5 + ``` + + Where: + - `namespaceSelector`: Defines which namespaces can use the resources governed by this cluster queue. An empty `namespaceSelector` as shown in the example means that all namespaces can use these resources. + - `coveredResources`: Defines the resource types governed by the cluster queue. This example `ClusterQueue` object governs CPU, memory, and GPU resources. + - `flavors.name`: Defines the resource flavor that is applied to the resource types listed. + - `resources`: Defines the resource requirements for admitting jobs. The cluster queue will start a distributed workload only if the total required resources are within these quota limits. + + Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. For more examples, see [Example Kueue resource configurations](#example-kueue-resource-configurations-for-distributed-workloads). + + You must specify a quota for each resource that the user can request, even if the requested value is 0, by updating the `spec.resourceGroups` section as follows: + - Include the resource name in the `coveredResources` list. + - Specify the resource `name` and `nominalQuota` in the `flavors.resources` section, even if the `nominalQuota` value is 0. + + Apply the configuration to create the `ClusterQueue` object: + + ```bash + kubectl apply -f cluster_queue.yaml + ``` + +3. Verify that a local queue that points to your cluster queue exists for your namespace, or create a custom one. + + Check whether a `LocalQueue` already exists for your namespace: + + ```bash + kubectl get localqueues -n + ``` + + If a `LocalQueue` already exists and you need to modify it (for example, to point to a different `ClusterQueue`), edit it in place: + + ```bash + kubectl edit localqueue -n + ``` + + If a `LocalQueue` does not exist or you want a custom one, create a file called `local_queue.yaml` and populate it with the following content: + + **Example local queue** + + ```yaml + apiVersion: kueue.x-k8s.io/v1beta1 + kind: LocalQueue + metadata: + name: + namespace: + spec: + clusterQueue: + ``` + + Replace the `name`, `namespace`, and `clusterQueue` values accordingly. + + Apply the configuration to create the `LocalQueue` object: + + ```bash + kubectl apply -f local_queue.yaml + ``` + +### Verification + +Check the status of the local queue in a namespace, as follows: + +```bash +kubectl get localqueues -n +``` + +### Additional resources + +- [Alauda Build of Kueue documentation](../kueue/index.mdx) +- [Kueue documentation](https://kueue.sigs.k8s.io/docs/concepts/) + +## Example Kueue resource configurations for distributed workloads + +You can use these example configurations as a starting point for creating Kueue resources to manage your distributed training workloads. + +These examples show how to configure Kueue resource flavors and cluster queues for common distributed training scenarios. + +### NVIDIA GPUs without shared cohort + +#### NVIDIA RTX A400 GPU resource flavor + +```yaml +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ResourceFlavor +metadata: + name: 'a400node' +spec: + nodeLabels: + instance-type: nvidia-a400-node + tolerations: + - key: 'HasGPU' + operator: 'Exists' + effect: 'NoSchedule' +``` + +#### NVIDIA RTX A1000 GPU resource flavor + +```yaml +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ResourceFlavor +metadata: + name: 'a1000node' +spec: + nodeLabels: + instance-type: nvidia-a1000-node + tolerations: + - key: 'HasGPU' + operator: 'Exists' + effect: 'NoSchedule' +``` + +#### NVIDIA RTX A400 GPU cluster queue + +```yaml +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: 'a400queue' +spec: + namespaceSelector: {} # match all. + resourceGroups: + - coveredResources: ['cpu', 'memory', 'nvidia.com/gpu'] + flavors: + - name: 'a400node' + resources: + - name: 'cpu' + nominalQuota: 16 + - name: 'memory' + nominalQuota: 64Gi + - name: 'nvidia.com/gpu' + nominalQuota: 2 +``` + +#### NVIDIA RTX A1000 GPU cluster queue + +```yaml +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: 'a1000queue' +spec: + namespaceSelector: {} # match all. + resourceGroups: + - coveredResources: ['cpu', 'memory', 'nvidia.com/gpu'] + flavors: + - name: 'a1000node' + resources: + - name: 'cpu' + nominalQuota: 16 + - name: 'memory' + nominalQuota: 64Gi + - name: 'nvidia.com/gpu' + nominalQuota: 2 +``` + +### Hami vGPU + +If you use Alauda Build of Hami for vGPU support, you can configure Kueue resources with Hami-specific resource types. + +#### Hami vGPU resource flavor + +```yaml +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ResourceFlavor +metadata: + name: 't4-flavor' +spec: + nodeLabels: + nvidia.com/gpu.product: Tesla-T4 +``` + +#### Hami vGPU cluster queue + +This example shows a cluster queue that manages both standard resources (CPU, memory, pods) and Hami vGPU resources. + +```yaml +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: cluster-queue +spec: + namespaceSelector: {} + resourceGroups: + - coveredResources: ['cpu', 'memory', 'pods'] + flavors: + - name: 'default-flavor' + resources: + - name: 'cpu' + nominalQuota: 9 + - name: 'memory' + nominalQuota: 36Gi + - name: 'pods' + nominalQuota: 5 + - coveredResources: + [ + 'nvidia.com/gpualloc', + 'nvidia.com/total-gpucores', + 'nvidia.com/total-gpumem', + ] + flavors: + - name: 't4-flavor' + resources: + - name: 'nvidia.com/gpualloc' + nominalQuota: '20' + - name: 'nvidia.com/total-gpucores' + nominalQuota: '300' + - name: 'nvidia.com/total-gpumem' + nominalQuota: '20480' +``` + +Where: + +- `nvidia.com/gpualloc`: The maximum number of vGPU allocations. +- `nvidia.com/total-gpucores`: The maximum total GPU cores that can be allocated. +- `nvidia.com/total-gpumem`: The maximum total GPU memory (in MiB) that can be allocated. + +### Additional resources + +- [Alauda Build of Kueue documentation](../kueue/index.mdx) +- [Resource Flavor](https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/) in the Kueue documentation +- [Cluster Queue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/) in the Kueue documentation + +## Troubleshooting common problems with distributed workloads for administrators + +If your users are experiencing errors in Alauda AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem. + +If the problem is not documented here or in the release notes, contact Alauda Support. + +### A user's Ray cluster is in a suspended state + +**Problem** + +The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created. + +**Diagnosis** + +The user's Ray cluster head pod or worker pods remain in a suspended state. Check the status of the `Workload` resource that is created with the `RayCluster` resource. The `status.conditions.message` field provides the reason for the suspended state, as shown in the following example: + +```yaml +status: + conditions: + - lastTransitionTime: '2024-05-29T13:05:09Z' + message: "couldn't assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue" +``` + +**Resolution** + +1. In the Alauda Container Platform console, switch to **Administrator** perspective and select the target cluster from the **Cluster** dropdown. +2. Navigate to **Clusters > Resources**, search for `Workload` in the resource list, and filter by the user's **Namespace**. +3. Select the workload resource, and verify the reason for the suspended status. +4. Search for `ClusterQueue` in the resource list. +5. Select the cluster queue that is referenced in the workload, and verify that the resource flavor exists and has sufficient quota. +6. If necessary, increase the quota in the cluster queue, or create the missing resource flavor. + +### A user's Ray cluster is in a failed state + +**Problem** + +You might have insufficient resources. + +**Diagnosis** + +The user's Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a `failed` state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running. + +**Resolution** + +If the failed state persists, complete the following steps: + +1. In the Alauda Container Platform console, switch to **Administrator** perspective and select the target cluster from the **Cluster** dropdown. +2. Navigate to **Clusters > Resources**, search for `Pod` in the resource list, and filter by the user's **Namespace**. +3. Click the pod name to open the pod details page. +4. Click the **Events** tab, and review the pod events to identify the cause of the problem. +5. If you cannot resolve the problem, contact your administrator to request assistance. + +### A user's Ray cluster does not start + +**Problem** + +After the user creates a Ray cluster, the cluster remains in the `Starting` status instead of changing to the `Ready` status. + +**Diagnosis** + +1. In the Alauda Container Platform console, switch to **Administrator** perspective and select the target cluster from the **Cluster** dropdown. +2. Navigate to **Clusters > Resources**, search for `Workload` in the resource list, and filter by the user's **Namespace**. +3. Select the workload resource that is created with the Ray cluster resource, and click the **YAML** tab. +4. Check the text in the `status.conditions.message` field, which provides the reason for remaining in the `Starting` state. + +**Resolution** + +1. Check if the local queue exists and is correctly configured. +2. Verify that the cluster queue has sufficient quota for the requested resources. +3. If you cannot resolve the problem, contact your administrator to request assistance. + +### A user cannot create a Ray cluster or submit jobs + +**Problem** + +The user does not have the required permissions to create a Ray cluster or submit jobs. + +**Diagnosis** + +The user receives an authorization error when trying to create a Ray cluster or submit jobs. + +**Resolution** + +1. Verify that the user has the required RBAC permissions to create Ray clusters in their namespace. +2. Ensure the user has at least the following permissions: + - Create, get, list, watch `rayclusters.ray.io` + - Create, get, list, watch `rayjobs.ray.io` + - Create, get, list, watch `workloads.kueue.x-k8s.io` +3. If necessary, grant the user the appropriate role or cluster role. diff --git a/docs/en/managing/managing-workloads-with-kueue.mdx b/docs/en/managing/managing-workloads-with-kueue.mdx new file mode 100644 index 0000000..6f26af5 --- /dev/null +++ b/docs/en/managing/managing-workloads-with-kueue.mdx @@ -0,0 +1,217 @@ +--- +weight: 10 +--- + +# Managing workloads with Kueue + +As a cluster administrator, you can manage AI and machine learning workloads at scale by integrating Alauda Build of Kueue with Alauda AI. This integration provides capabilities for quota management, resource allocation, and prioritized job scheduling. + +## Overview of managing workloads with Kueue + +You can use Kueue in Alauda AI to manage AI and machine learning workloads at scale. Kueue controls how cluster resources are allocated and shared through hierarchical quota management, dynamic resource allocation, and prioritized job scheduling. These capabilities help prevent cluster contention, ensure fair access across teams, and optimize the use of heterogeneous compute resources, such as hardware accelerators. + +Kueue lets you schedule diverse workloads, including distributed training jobs (`RayJob`, `RayCluster`, `PyTorchJob`), workbenches (`Notebook`), and model serving (`InferenceService`). + +Using Kueue in Alauda AI provides these benefits: + +- Prevents resource conflicts and prioritizes workload processing +- Manages quotas across teams and namespaces +- Ensures consistent scheduling for all workload types +- Maximizes GPU and other specialized hardware utilization + +### How Kueue manages workloads + +Kueue manages workloads through the `kueue.x-k8s.io/queue-name` label. When a workload has this label set to a valid LocalQueue name, Kueue will manage its admission based on the quota defined in the associated ClusterQueue. + +You can create cluster-scoped `ClusterQueue` and namespace-scoped `LocalQueue` resources to manage workloads. Cluster administrators can configure these resources to define quotas and resource allocation policies. + +Kueue supports managing the following workload types: + +| API Group | Kind | +| ------------------------ | -------------------------------------------------------- | +| batch | Job | +| kubeflow.org | MPIJob, PyTorchJob, TFJob, XGBoostJob, PaddleJob, JAXJob | +| trainer.kubeflow.org | TrainJob | +| ray.io | RayJob, RayCluster | +| jobset.x-k8s.io | JobSet | +| leaderworkerset.x-k8s.io | LeaderWorkerSet | +| workload.codeflare.dev | AppWrapper | +| core | Pod | +| apps | Deployment, StatefulSet | + +### Default LocalQueue + +Kueue supports automatic queue assignment through the `LocalQueueDefaulting` feature. When a LocalQueue named `default` exists in a namespace, any new workload created in that namespace without the `kueue.x-k8s.io/queue-name` label will automatically have the label set to `default`. + +This feature allows batch administrators to enforce quota management for all workloads in a namespace without requiring users to explicitly specify a queue name. + +:::note + +- If a namespace has a LocalQueue named `default`, workloads without a queue-name label will automatically use it. +- If a namespace does not have a LocalQueue named `default`, workloads without the `kueue.x-k8s.io/queue-name` label will not be managed by Kueue. + ::: + +**Additional resources** + +- [Alauda Build of Kueue documentation](../kueue/index.mdx) + +### Kueue workflow + +Managing workloads with Kueue in Alauda AI involves tasks for Alauda Container Platform cluster administrators, Alauda AI administrators, and machine learning (ML) engineers or data scientists: + +**Cluster administrator** + +Installs and configures Kueue: + +1. Installs the Alauda Build of Kueue cluster plugin on the cluster, as described in [Install Alauda Build of Kueue](../kueue/install.mdx). +2. Configures quotas to optimize resource allocation for user workloads, as described in [Configuring quotas](../kueue/how_to/config_quotas.mdx). + +:::note +For workloads to be managed by Kueue, either add the `kueue.x-k8s.io/queue-name` label to the workload, or create a LocalQueue named `default` in the namespace to enable automatic queue assignment. +::: + +**Alauda AI administrator** + +Prepares the Alauda AI environment: + +1. Creates Kueue-enabled hardware profiles so that users can submit workloads from the Alauda AI dashboard. + +**ML Engineer or data scientist** + +Submits workloads to the queuing system: + +1. For workloads created from the Alauda AI dashboard, such as workbenches and model servers, selects a Kueue-enabled hardware profile during creation. +2. For workloads created by using a command-line interface or an SDK, such as distributed training jobs, adds the `kueue.x-k8s.io/queue-name` label to the workload's YAML manifest and sets its value to the target `LocalQueue` name. + +## Configuring workload management with Kueue + +To use workload queuing in Alauda AI, install the Alauda Build of Kueue cluster plugin and configure the Kueue resources. + +### Prerequisites + +- You have cluster administrator privileges for your Alauda Container Platform cluster. +- You have installed the kubectl CLI. + +### Procedure + +1. Install the Alauda Build of Kueue cluster plugin on your Alauda Container Platform cluster as described in [Install Alauda Build of Kueue](../kueue/install.mdx). + +2. Verify that the Alauda Build of Kueue pods are running: + + ```bash + kubectl get pods -n cpaas-system | grep kueue + ``` + + You should see output similar to the following example: + + ``` + kueue-controller-manager-d9fc745df-ph77w 1/1 Running + ``` + +3. Configure quotas by creating `ResourceFlavor`, `ClusterQueue`, and `LocalQueue` objects. For details, see [Configuring quotas](../kueue/how_to/config_quotas.mdx). + +### Next steps + +- For advanced quota configuration examples, see [Example Kueue resource configurations](./managing-distributed-workloads.mdx#example-kueue-resource-configurations-for-distributed-workloads). + +## Troubleshooting common problems with Kueue + +If your users are experiencing errors in Alauda AI relating to Kueue workloads, read this section to understand what could be causing the problem, and how to resolve the problem. + +If the problem is not documented here or in the release notes, contact Alauda Support. + +### A user receives a "failed to call webhook" error message for Kueue + +**Problem** + +After the user runs the `cluster.apply()` command, the following error is shown: + +``` +ApiException: (500) +Reason: Internal Server Error +HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.cpaas-system.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.cpaas-system.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500} +``` + +**Diagnosis** + +The Kueue pod might not be running. + +**Resolution** + +1. In the Alauda Container Platform console, switch to **Administrator** perspective and select the target cluster from the **Cluster** dropdown. +2. Navigate to **Clusters > Resources**, search for `Pod` in the resource list. +3. Filter by **Namespace** `cpaas-system` and search for `kueue` to find the Kueue pod. Verify that the Kueue pod is running. If necessary, restart the Kueue pod. +4. Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example: + + ``` + {"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443} + ``` + +### A user receives a "Default Local Queue ... not found" error message + +**Problem** + +After the user runs the `cluster.apply()` command, the following error is shown: + +``` +Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration. +``` + +**Diagnosis** + +No default local queue is defined, and a local queue is not specified in the cluster configuration. + +**Resolution** + +1. Check whether a local queue exists in the user's namespace, as follows: + 1. In the Alauda Container Platform console, switch to **Administrator** perspective and select the target cluster from the **Cluster** dropdown. + 2. Navigate to **Clusters > Resources**, search for `LocalQueue` in the resource list, and filter by the user's **Namespace**. + 3. If no local queues are found, create a local queue. + 4. Provide the user with the details of the local queues in their namespace, and advise them to add a local queue to their cluster configuration. + +2. Define a default local queue. + + For information about creating a local queue and defining a default local queue, see [Configuring quotas](../kueue/how_to/config_quotas.mdx). + +### A user receives a "local_queue provided does not exist" error message + +**Problem** + +After the user runs the `cluster.apply()` command, the following error is shown: + +``` +local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration. +``` + +**Diagnosis** + +An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace. + +**Resolution** + +1. In the Alauda Container Platform console, switch to **Administrator** perspective and select the target cluster from the **Cluster** dropdown. + 1. Navigate to **Clusters > Resources**, search for `LocalQueue` in the resource list, and filter by the user's **Namespace**. + 2. Resolve the problem in one of the following ways: + - If no local queues are found, create a local queue. + - If one or more local queues are found, provide the user with the details of the local queues in their namespace. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the `namespace` value in the cluster configuration matches their namespace name. + + 3. Define a default local queue. + + For information about creating a local queue and defining a default local queue, see [Configuring quotas](../kueue/how_to/config_quotas.mdx). + +### The pod provisioned by Kueue is terminated before the image is pulled + +**Problem** + +Kueue waits for a period of time before marking a workload as ready for all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods. + +**Diagnosis** + +1. In the Alauda Container Platform console, switch to **Administrator** perspective and select the target cluster from the **Cluster** dropdown. +2. Navigate to **Clusters > Resources**, search for `Pod` in the resource list, and filter by the user's **Namespace**. +3. Click the Ray head pod name to open the pod details page. +4. Click the **Events** tab, and review the pod events to check whether the image pull completed successfully. + +**Resolution** + +If the pod takes more than 5 minutes to pull the image, contact your administrator to request assistance.