Skip to content

Commit f9ef40e

Browse files
committed
Update docs with sglang deployment
1 parent 7f9c878 commit f9ef40e

4 files changed

Lines changed: 111 additions & 15 deletions

File tree

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,18 @@
1-
# Sample http route for GKE Gateway to route traffic to sglang InferencePool
1+
apiVersion: gateway.networking.k8s.io/v1
2+
kind: HTTPRoute
3+
metadata:
4+
name: llm-route
5+
spec:
6+
parentRefs:
7+
- group: gateway.networking.k8s.io
8+
kind: Gateway
9+
name: inference-gateway
10+
rules:
11+
- backendRefs:
12+
- group: inference.networking.k8s.io
13+
kind: InferencePool
14+
name: sgl-llama3-8b-instruct
15+
matches:
16+
- path:
17+
type: PathPrefix
18+
value: /

config/manifests/sglang/gpu-deployment.yaml

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,47 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: sgl-llama3-8b-instruct
4+
name: sgl-deepseek-v3
55
labels:
6-
app: sgl-llama3-8b-instruct
6+
app: sgl-deepseek-v3
77
spec:
8-
replicas: 3
8+
replicas: 1
99
selector:
1010
matchLabels:
11-
app: sgl-llama3-8b-instruct
11+
app: sgl-deepseek-v3
1212
template:
1313
metadata:
1414
labels:
15-
app: sgl-llama3-8b-instruct
15+
app: sgl-deepseek-v3
1616
spec:
1717
containers:
1818
- name: sglang
19-
image: lmsysorg/sglang:latest
19+
image: lmsysorg/sglang:v0.5.6.post2-cu129-arm64
2020
command: ["python3", "-m", "sglang.launch_server"]
2121
args:
22-
- "--model-path=meta-llama/Llama-3.1-8B-Instruct"
22+
- "--model-path=nvidia/DeepSeek-V3.1-NVFP4"
2323
- "--host=0.0.0.0"
2424
- "--port=8000"
2525
- "--dtype=bfloat16"
2626
- "--kv-cache-dtype=auto"
27-
- "--tp=1"
28-
- "--mem-fraction-static=0.90" # Equivalent to vllm's gpu-memory-utilization
27+
- "--mem-fraction-static=0.90"
2928
- "--trust-remote-code"
3029
- "--enable-metrics"
30+
# Hardware & Backend Optimization
31+
- "--tp=4"
32+
- "--quantization=modelopt_fp4"
33+
- "--attention-backend=trtllm_mla"
34+
- "--moe-runner-backend=flashinfer_trtllm"
35+
- "--enable-flashinfer-allreduce-fusion"
36+
# Scheduling & Traffic Control
37+
- "--max-running-requests=256"
38+
- "--schedule-low-priority-values-first"
39+
- "--enable-priority-scheduling"
40+
- "--priority-scheduling-preemption-threshold=1000"
41+
# Logging
42+
- "--log-requests"
43+
- "--log-requests-level=1"
44+
- "--enable-request-time-stats-logging"
3145
env:
3246
- name: HF_TOKEN
3347
valueFrom:
@@ -40,7 +54,7 @@ spec:
4054
name: http
4155
resources:
4256
limits:
43-
nvidia.com/gpu: 1
57+
nvidia.com/gpu: 4
4458
volumeMounts:
4559
- name: model-cache
4660
mountPath: /root/.cache/huggingface
@@ -56,8 +70,7 @@ spec:
5670
httpGet:
5771
path: /health_generate
5872
port: 8000
59-
# Give the container 10 minutes (30 * 20s) to download and load weights
60-
failureThreshold: 30
73+
failureThreshold: 90
6174
periodSeconds: 20
6275
volumes:
6376
- name: model-cache
@@ -68,4 +81,8 @@ spec:
6881
tolerations:
6982
- key: "nvidia.com/gpu"
7083
operator: "Exists"
84+
effect: "NoSchedule"
85+
- key: "kubernetes.io/arch"
86+
operator: "Equal"
87+
value: "arm64"
7188
effect: "NoSchedule"

site-src/_includes/epp-sglang.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
=== "GKE"
2+
3+
```bash
4+
export GATEWAY_PROVIDER=gke
5+
helm install sgl-llama3-8b-instruct \
6+
--set inferencePool.modelServers.matchLabels.app=sgl-llama3-8b-instruct \
7+
--set provider.name=$GATEWAY_PROVIDER \
8+
--version $IGW_CHART_VERSION \
9+
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
10+
```
11+
12+
=== "Istio"
13+
14+
```bash
15+
export GATEWAY_PROVIDER=istio
16+
helm install sgl-llama3-8b-instruct \
17+
--set inferencePool.modelServers.matchLabels.app=sgl-llama3-8b-instruct \
18+
--set provider.name=$GATEWAY_PROVIDER \
19+
--version $IGW_CHART_VERSION \
20+
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
21+
```
22+
23+
=== "Kgateway"
24+
25+
```bash
26+
export GATEWAY_PROVIDER=none
27+
helm install sgl-llama3-8b-instruct \
28+
--set inferencePool.modelServers.matchLabels.app=sgl-llama3-8b-instruct \
29+
--set provider.name=$GATEWAY_PROVIDER \
30+
--version $IGW_CHART_VERSION \
31+
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
32+
```
33+
34+
=== "NGINX Gateway Fabric"
35+
36+
```bash
37+
export GATEWAY_PROVIDER=none
38+
helm install sgl-llama3-8b-instruct \
39+
--set inferencePool.modelServers.matchLabels.app=sgl-llama3-8b-instruct \
40+
--set provider.name=$GATEWAY_PROVIDER \
41+
--version $IGW_CHART_VERSION \
42+
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
43+
```

site-src/guides/index.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,13 @@ IGW_LATEST_RELEASE=$(curl -s https://api.github.com/repos/kubernetes-sigs/gatewa
3737
```
3838

3939
--8<-- "site-src/_includes/model-server-sim.md"
40-
40+
4141
```bash
4242
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/${IGW_LATEST_RELEASE}/config/manifests/vllm/sim-deployment.yaml
4343
```
4444

4545
--8<-- "site-src/_includes/sglang-gpu.md"
46-
46+
4747
```bash
4848
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/${IGW_LATEST_RELEASE}/config/manifests/sglang/gpu-deployment.yaml
4949
```
@@ -135,6 +135,11 @@ kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extens
135135

136136
--8<-- "site-src/_includes/epp.md"
137137

138+
For sglang deployment:
139+
140+
--8<-- "site-src/_includes/epp-sglang.md"
141+
142+
138143
### Deploy an Inference Gateway
139144

140145
Choose one of the following options to deploy an Inference Gateway.
@@ -280,6 +285,12 @@ kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extens
280285
kubectl describe inferencepools.inference.networking.k8s.io vllm-llama3-8b-instruct
281286
```
282287

288+
For sglang deployment:
289+
290+
```bash
291+
kubectl describe inferencepools.inference.networking.k8s.io sgl-llama3-8b-instruct
292+
```
293+
283294
Check that the status shows Accepted=True and ResolvedRefs=True. This confirms the InferencePool is ready to handle traffic.
284295
285296
For more information, see the [NGINX Gateway Fabric - Inference Gateway Setup guide](https://docs.nginx.com/nginx-gateway-fabric/how-to/gateway-api-inference-extension/#overview)
@@ -319,6 +330,14 @@ You have now deployed a basic Inference Gateway with a simple routing strategy.
319330
kubectl delete secret hf-token --ignore-not-found
320331
```
321332

333+
For Sglang deployment:
334+
335+
```bash
336+
helm uninstall sgl-llama3-8b-instruct
337+
kubectl delete -f kubectl delete -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/${IGW_LATEST_RELEASE}/config/manifests/sglang/gpu-deployment.yaml --ignore-not-found
338+
kubectl delete secret hf-token --ignore-not-found
339+
```
340+
322341
1. Uninstall the Gateway API Inference Extension CRDs:
323342

324343
```bash

0 commit comments

Comments
 (0)