Skip to content

Add kueue gap#165

Open
luohua13 wants to merge 1 commit intoadd-kuberayfrom
fix/kueue-gap
Open

Add kueue gap#165
luohua13 wants to merge 1 commit intoadd-kuberayfrom
fix/kueue-gap

Conversation

@luohua13
Copy link
Contributor

@luohua13 luohua13 commented Mar 24, 2026

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guides for distributed workloads including Ray/KubeRay setup, configuration, and troubleshooting.
    • Added Kueue quota management documentation with configuration procedures, GPU examples, and troubleshooting guides.
    • Added integration guides for multiple workload types (JobSet, LeaderWorkerSet, PyTorchJob, RayJob) with advanced features like preemption, borrowing/lending, and label-based queue enforcement.

Add comprehensive Kueue documentation benchmarked against Red Hat OpenShift AI,
including new sections for managing workloads with Kueue, managing distributed
workloads, and multiple new how-to guides covering preemption, borrowing/lending,
label policies, job submission, JobSet and LeaderWorkerSet integrations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 24, 2026

Caution

Review failed

Failed to post review comments

Walkthrough

This pull request adds comprehensive documentation covering distributed workloads orchestration with Ray/KubeRay and workload management using Kueue within Alauda AI. The documentation spans setup guides, procedural walkthroughs, troubleshooting references, and configuration examples across four main documentation sections: distributed workloads, KubeRay operator, Kueue workload management, and managing distributed workloads.

Changes

Cohort / File(s) Summary
Distributed Workloads Overview & Guide
docs/en/distributed_workloads/index.mdx, docs/en/distributed_workloads/running-ray-based-distributed-workloads.mdx, docs/en/distributed_workloads/troubleshooting.mdx
New documentation section introducing distributed workloads, providing step-by-step procedures for running Ray-based workloads from Jupyter notebooks using CodeFlare SDK, managing Ray clusters interactively, and troubleshooting common failures related to RayCluster resources, Kueue webhooks, and image pull timing issues.
KubeRay Operator Documentation
docs/en/kuberay/index.mdx, docs/en/kuberay/intro.mdx, docs/en/kuberay/install.mdx
New documentation section for Alauda Build of KubeRay Operator covering introduction to KubeRay CRDs (RayCluster, RayJob, RayService), feature overview (autoscaling, heterogeneous compute, fault tolerance), installation prerequisites and procedures using the violet CLI tool and marketplace UI.
Kueue How-To Guides
docs/en/kueue/how_to/borrowing_lending.mdx, docs/en/kueue/how_to/jobset.mdx, docs/en/kueue/how_to/label_policy.mdx, docs/en/kueue/how_to/leader_worker_set.mdx, docs/en/kueue/how_to/preemption.mdx, docs/en/kueue/how_to/submit_jobs.mdx
New how-to guides for Kueue covering quota borrowing/lending between ClusterQueues, integration with JobSet and LeaderWorkerSet, label-based queue policy configuration, preemption policy settings, and job submission procedures for Job, RayJob, RayCluster, and PyTorchJob workloads.
Kueue Core Documentation Updates
docs/en/kueue/install.mdx, docs/en/kueue/intro.mdx
Minor additions to existing Kueue documentation including a "Next steps" link to the configuration guide and a cross-reference to the Kueue overview guide for AI/ML workload management.
Managing Distributed Workloads Documentation
docs/en/managing_distributed_workloads/index.mdx, docs/en/managing_distributed_workloads/configuring_quota_management.mdx, docs/en/managing_distributed_workloads/example_gpu_configurations.mdx, docs/en/managing_distributed_workloads/troubleshooting.mdx
New administrator-focused documentation section covering quota management setup with ResourceFlavor and ClusterQueue resources, GPU configuration examples (NVIDIA GPUs, HAMi virtual GPUs, mixed configurations), and troubleshooting for suspended clusters, pending admission, resource coverage, and cohort borrowing issues.
Managing Workloads with Kueue Section
docs/en/managing_workloads_with_kueue/index.mdx, docs/en/managing_workloads_with_kueue/overview.mdx, docs/en/managing_workloads_with_kueue/configuring.mdx, docs/en/managing_workloads_with_kueue/troubleshooting.mdx
New comprehensive section introducing Alauda Build of Kueue capabilities, providing overview of supported workload types and workflows, end-to-end configuration procedures for ResourceFlavor/ClusterQueue/LocalQueue setup, RBAC configuration, and troubleshooting common admission and scheduling issues.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly Related PRs

  • Add Alauda Build of Kueue #71: Adds overlapping Kueue documentation covering intro, installation, and quota management procedures that complement this PR's more comprehensive Kueue how-to guides.
  • Rls 1.5/fix upgrade version #92: Modifies the same Kueue documentation area (docs/en/kueue/install.mdx), creating potential content overlap or interdependencies with this PR's additions.

Suggested Reviewers

  • typhoonzero

Poem

🐰 Ray clusters hop through the Kueue,
With borrowed quotas, fresh and new,
KubeRay orchestrates with grace,
Distributing workloads across the space. 🚀✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Add kueue gap' is vague and does not clearly describe the main changes in the pull request, which add comprehensive documentation for Kueue, distributed workloads, KubeRay, and related configuration guidance. Use a more descriptive title that reflects the primary changes, such as 'Add comprehensive Kueue and distributed workloads documentation' or 'Document Kueue quota management and KubeRay integration'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/kueue-gap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@luohua13 luohua13 changed the base branch from master to add-kuberay March 24, 2026 10:38
@cloudflare-workers-and-pages
Copy link

Deploying alauda-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: dfa2817
Status: ✅  Deploy successful!
Preview URL: https://6a1c6adc.alauda-ai.pages.dev
Branch Preview URL: https://fix-kueue-gap.alauda-ai.pages.dev

View logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant