English | 简体中文
🎯 A Kubernetes API for orchestrating distributed, stateful AI inference workloads with multi-role collaboration and built-in service discovery.
🌐 Official Website: rolebasedgroup.github.io
| Date | Release | Highlights |
|---|---|---|
| 2026-04-22 | v0.7.0-alpha.3 | v1alpha2 conversion webhooks, CLI multi-node LLM serving |
| 2026-03-31 | v0.7.0-alpha.2 | Pod port allocator, CLI foundations |
| 2026-03-18 | v0.7.0-alpha.1 | v1alpha2 API, coordinated policies, gang scheduling |
| 2026-02-18 | v0.6.0 | Coordinated scaling, stateful InstanceSet |
| 2025-12-03 | v0.5.0 | Native InstanceSet, in-place updates, Mooncake integration |
| 2025-09-23 | v0.4.0 | RBGS scaling, Volcano podgroup support |
Traditional Kubernetes primitives (StatefulSets / Deployments) struggle with LLM inference services that:
| Challenge | Description |
|---|---|
| Multi-role topologies | gateway → router → prefill → decode |
| Performance-sensitive | GPU/network topology matters |
| Atomic operations | deploy, upgrade, scale, failover across roles |
RBG treats an inference service as a role-based group — a topologized, stateful, coordinated multi-role organism managed as a single unit.
| Concept | Description |
|---|---|
| Role | Basic scheduling and rollout unit. Each role (prefill, decode) has its own spec, lifecycle and policies. |
| RoleBasedGroup | A group of roles forming one logical service (e.g., one LLM inference deployment). |
| RoleInstance | A collection of Pods with tightly bound lifecycle. Supports in-place updates and controls upgrades/status for the Pod group. |
| CoordinatedPolicy | A separate CRD for coordinating operations across roles. Controls maxSkew and progression during rolling updates and scaling. |
| Capability | Description |
|---|---|
| Stable | Topology-aware deterministic operations with unique RoleID injection |
| Coordination | Cross-role policy engine: deployment pairing, coordinated upgrades, linked recovery |
| Orchestration | Role dependencies, precise startup sequences, topology self-aware service discovery |
| Performance | Hardware affinity scheduling: GPU-NVLink → PCIe → RDMA → VPC |
| Extensible | Declarative APIs and plugin mechanisms for future architectures |
helm install rbg-controller oci://registry-1.docker.io/sglproject/rbg-controller-chart --version v0.7.0-alpha.3For detailed instructions, see Installation Guide.
Deploy a basic RoleBasedGroup with two roles and startup dependencies:
apiVersion: workloads.x-k8s.io/v1alpha2
kind: RoleBasedGroup
metadata:
name: nginx-cluster
spec:
roles:
- name: frontend
replicas: 1
standalonePattern:
template:
spec:
containers:
- name: nginx
image: nginx:1.14.1
ports:
- containerPort: 80
- name: backend
replicas: 3
dependencies: ["frontend"] # backend starts after frontend is ready
standalonePattern:
template:
spec:
containers:
- name: nginx
image: nginx:1.14.1
ports:
- containerPort: 8080| Pattern | Used For | Description |
|---|---|---|
| standalonePattern | Single-node deployment | Single pod per instance |
| leaderWorkerPattern | Multi-node distributed deployment | Leader + workers for tensor parallelism |
Reduce configuration duplication with reusable templates:
spec:
roleTemplates:
- name: base-template
template:
spec:
containers:
- name: nginx
image: nginx:1.14.1
roles:
- name: frontend
replicas: 2
standalonePattern:
templateRef:
name: base-template
- name: backend
replicas: 3
standalonePattern:
templateRef:
name: base-template
patch: # role-specific overrides
spec:
containers:
- name: nginx
resources:
requests:
memory: "128Mi"kubectl-rbg is a CLI tool for managing RBG resources and LLM deployments.
# Build from source
make build-cli
chmod +x bin/kubectl-rbg
sudo mv bin/kubectl-rbg /usr/local/bin/# Initialize configuration
kubectl rbg llm config init
# Pull a model
kubectl rbg llm model pull Qwen/Qwen3.5-0.8B
# Deploy as inference service
kubectl rbg llm svc run my-qwen Qwen/Qwen3.5-0.8B
# Chat with the service
kubectl rbg llm svc chat my-qwenFor detailed CLI documentation, see kubectl-rbg.
SGLang PD-disaggregated examples in examples/inference/:
| Example | Pattern | Description |
|---|---|---|
| pd-disagg-standalone.yaml | standalonePattern | Single pod per role, suitable for single-GPU instances |
| pd-disagg-leader-worker.yaml | leaderWorkerPattern | Multi-GPU tensor parallelism for decode role |
SGLang aggregated examples:
| Example | Pattern | Description |
|---|---|---|
| agg-standalone.yaml | standalonePattern | Single-GPU aggregated inference |
| agg-leader-worker.yaml | leaderWorkerPattern | Multi-GPU tensor parallelism |
RBG integrates with ecosystem components for production LLM inference:
NVIDIA Dynamo is an open-source, datacenter-scale inference stack that orchestrates multi-node AI workloads above inference engines like vLLM and SGLang:
| Example | Description |
|---|---|
| dynamo/pd-disagg.yaml | PD-disaggregated with Dynamo SGLang runtime |
| dynamo/pd-disagg-multi-nodes.yaml | Multi-node PD-disaggregated |
| dynamo/agg.yaml | Aggregated inference with Dynamo |
| dynamo/agg-multi-nodes.yaml | Multi-node aggregated |
Mooncake is a disaggregated architecture for LLM serving, providing KV cache transfer and reuse across distributed inference:
| Example | Description |
|---|---|
| mooncake-store/pd-disagg-kvcache-reuse-with-mooncake.yaml | PD-disaggregated with KV cache reuse |
| mooncake-store/agg-kvcache-reuse-with-mooncake.yaml | Aggregated with KV cache reuse |
| mooncake-transfer-engine/sgl-pd-disagg-with-mooncake-te.yaml | SGLang PD-disaggregated with transfer engine |
| mooncake-transfer-engine/vllm-pd-disagg-with-mooncake-te.yaml | vLLM PD-disaggregated with transfer engine |
| Path | Description |
|---|---|
rbg/base.yaml |
Basic RoleBasedGroup with role dependencies |
rbg/dependency/ |
Role dependency configurations |
rbg/patterns/ |
Deployment patterns: standalone, leader-worker, custom-components |
rbg/scheduling/ |
Gang scheduling: Volcano, scheduler-plugins |
rbg/update-strategy/ |
Rolling update with partition support |
rbg/restart-policy/ |
Restart policy configurations |
rbg/scaling/ |
Scaling adapter with HPA integration |
rbg/role-template/ |
RoleTemplates for reducing duplication |
coordinated-policy/ |
Coordinated rollout and scaling policies |
engine-runtime/ |
Engine runtime profile configurations |
| Path | Description |
|---|---|
agg-standalone.yaml |
Aggregated SGLang (standalone pattern) |
agg-leader-worker.yaml |
Aggregated (leader-worker pattern) |
pd-disagg-standalone.yaml |
Prefill/Decode disaggregated (standalone) |
pd-disagg-leader-worker.yaml |
Prefill/Decode disaggregated (leader-worker) |
ecosystem/ |
NATS, etcd, Dynamo, Mooncake integration |
ecosystem/dynamo/ |
NVIDIA Dynamo runtime examples |
ecosystem/mooncake/ |
Mooncake KV cache transfer engine |
| Source | Link |
|---|---|
| Official Docs | rolebasedgroup.github.io |
| Local Docs | doc/TOC.md |
| RBG Version | Kubernetes | LeaderWorkerSet |
|---|---|---|
| main / v0.7.0-alpha.x | >=v1.22.x | Not Required |
| v0.6.0 | >=v1.28.x | >=v0.7.0 |
| v0.5.0 | >=v1.28.x | >=v0.6.0 |
| v0.4.0 | >=v1.28.x | >=v0.7.0 |
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Verify copyright headers
make copyright-check
# Add missing headers
make copyright-fix| Channel | Link |
|---|---|
| Slack | #rbg channel |
| Issues | GitHub Issues |
| Discussions | Community Discussions |
This project follows the Kubernetes Code of Conduct.
RBG is inspired by and reuses code from LeaderWorkerSet (LWS).
