Skip to content

Commit 9256b6b

Browse files
authored
feat: add LLM batch cost calculator as interactive tool (#30)
Introduces a Vite multi-page app framework under tools/ for hosting interactive React pages on the Hugo site. The cost calculator is the first tool. Future tools can be added by creating a subfolder and registering it in vite.config.js.
1 parent 61cb683 commit 9256b6b

11 files changed

Lines changed: 2747 additions & 0 deletions

File tree

.github/workflows/hugo.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,17 @@ jobs:
4545
with:
4646
submodules: recursive
4747
fetch-depth: 0
48+
- name: Setup Node.js
49+
uses: actions/setup-node@v4
50+
with:
51+
node-version: 20
52+
cache: npm
53+
cache-dependency-path: tools/package-lock.json
54+
- name: Build interactive tools
55+
run: |
56+
cd tools
57+
npm ci
58+
npm run build
4859
- name: Setup Pages
4960
id: pages
5061
uses: actions/configure-pages@v5

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,6 @@ public/
1111

1212
# Ignore hugo build lock file
1313
.hugo_build.lock
14+
15+
# Vite build output (generated by tools/ build)
16+
static/tools/

hugo.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,10 @@ menu:
119119
# name: Tags
120120
# url: /tags/
121121
# weight: 20
122+
- identifier: tools
123+
name: Tools
124+
url: /tools/batch-cost-calculator/
125+
weight: 20
122126
- identifier: github
123127
name: Github
124128
url: https://github.com/vllm-project/aibrix/

tools/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
node_modules/
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# LLM Batch Cost Calculator
2+
3+
A self-hosted vs cloud API cost comparison tool for LLM batch inference. Helps you answer: **"Is it cheaper to run this workload on my own GPUs or pay a cloud API?"**
4+
5+
---
6+
7+
## How to Use
8+
9+
The dashboard has two sides: **Infrastructure** (left panel, configure once) and **Job** (right panel, change per workload).
10+
11+
### Left Panel — Infrastructure Setup
12+
13+
Work through the steps top to bottom, once per cluster configuration.
14+
15+
**① GPU Setup**
16+
17+
Select your GPU type from the grid. This sets the default market rate (Lambda/RunPod/CoreWeave on-demand, Feb–Mar 2026). Fields to override:
18+
19+
- **GPU Count** — how many GPUs in your cluster
20+
- **$/GPU/hr override** — enter your actual contract rate if different from market defaults
21+
22+
The total rental cost (with overhead) is shown below the fields.
23+
24+
Use the sliders to account for cost overhead:
25+
- **Startup overhead** — percentage of GPU time spent loading the model. Set to 0% if you're running a persistent serving process (always-on). Set to 5–10% for batch jobs that cold-start each time.
26+
- **Interruption rate** — percentage of GPU-hours lost to spot instance preemptions or hardware failures. Near 0% on stable on-prem, 5–15% on spot cloud instances.
27+
28+
**② Benchmark Results**
29+
30+
This is the most important step for accuracy. The default throughput numbers are rough estimates for a 70B-class model — replace them with your actual measurements.
31+
32+
Run these two commands on your cluster using [vLLM's benchmark tool](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py):
33+
34+
```bash
35+
# Input tokens/s — measures pure prefill (compute-bound)
36+
python benchmark_throughput.py \
37+
--model <your-model> \
38+
--input-len 2048 \
39+
--output-len 1
40+
41+
# Output tokens/s — measures pure decode (memory-bandwidth-bound)
42+
python benchmark_throughput.py \
43+
--model <your-model> \
44+
--input-len 1 \
45+
--output-len 256
46+
```
47+
48+
Enter the reported `throughput (tokens/s)` values into the dashboard. The two benchmarks represent extremes — real workloads fall between them.
49+
50+
> **Why two separate numbers?** Prefill (processing your input) is compute-bound and runs ~8–15× faster than decode (generating output), which is memory-bandwidth-bound. Cloud providers price these differently (e.g. input tokens are 3–5× cheaper than output tokens). Separating them gives you an accurate apples-to-apples comparison.
51+
52+
**③ Effective Price / M Tokens**
53+
54+
This is the output of your infrastructure setup. Once you have your benchmark numbers, these two figures — **Input $/M** and **Output $/M** — are your self-hosted "rack rates". They tell you exactly what your GPU cluster charges per million tokens processed, equivalent to how cloud APIs publish their pricing.
55+
56+
The math:
57+
```
58+
cost per second = (numGPUs × $/hr × overheadMult) / 3600
59+
Input $/M = (cost per second / inputTokSec) × 1,000,000
60+
Output $/M = (cost per second / outputTokSec) × 1,000,000
61+
```
62+
63+
**④ API Cache Hit Rate**
64+
65+
For cloud APIs that support prompt caching, adjust the slider to match your expected cache hit rate. This only affects API cost calculations — self-hosted cost is unaffected.
66+
67+
Prompt caching applies to input tokens only. Effective savings:
68+
- Anthropic: 90% off input tokens on cache hit
69+
- Doubao: ~60% off
70+
- Qwen: ~60% off
71+
- Fireworks: 50% off
72+
- OpenAI batch API: no prompt caching
73+
74+
---
75+
76+
### Right Panel — Job Definition & Comparison
77+
78+
**⑤ Define Your Job**
79+
80+
Three inputs define any job:
81+
82+
| Field | What to enter |
83+
|---|---|
84+
| Input tokens / request | Token count of your prompt (system prompt + user message). Use a tokenizer or estimate: ~1 token per 0.75 words. |
85+
| Output tokens / request | Token count of the model's response. Check your actual logs or run a sample. |
86+
| Number of requests | 1 for a quick cost check on a single call. Your daily/weekly/monthly volume for scale estimates. |
87+
88+
The summary bar below shows per-request GPU time, total token volume, and self-hosted cost per request at a glance.
89+
90+
**⑥ APIs to Compare**
91+
92+
Toggle which cloud APIs appear in the comparison. Active APIs are highlighted. The ⬡ symbol marks open-source models — the same weights you could run yourself, making them the cleanest cost comparison (no model quality variable).
93+
94+
**⑦ Cost Comparison**
95+
96+
The main output. Self-hosted is shown prominently at the top, with a breakdown of input cost, output cost, and GPU rental. Each API is shown below with:
97+
- Total job cost
98+
- Cost relative to self-hosted (green = API is cheaper, red = self-host wins)
99+
- Per-request cost and $/M rates
100+
- A relative cost bar for visual comparison
101+
102+
**⑧ Cost at Scale**
103+
104+
Set **Batches / day** to see monthly projections. The table shows per-request, per-job, and per-month costs across all providers. Use this to understand when the crossover point is between self-hosting (fixed GPU cost) and APIs (pure variable cost).
105+
106+
---
107+
108+
## Maintainer Guide — Updating Prices
109+
110+
All prices and defaults live in the first ~50 lines of `llm-batch-cost-calculator.jsx`. No other part of the file needs to change for price updates.
111+
112+
### GPU Prices (`GPUS` array, lines 5–18)
113+
114+
```js
115+
const GPUS = [
116+
{ id:"h100-sxm", label:"H100 SXM 80GB", memGB:80,
117+
pricePerHr:3.44, // ← update this
118+
defaultInputTokSec:6000, // ← rough estimate only, users override
119+
defaultOutputTokSec:1200 }, // ← rough estimate only, users override
120+
...
121+
];
122+
```
123+
124+
**Fields:**
125+
- `pricePerHr` — on-demand market rate in USD. This is a starting point; users override with their contract rate.
126+
- `defaultInputTokSec` / `defaultOutputTokSec` — rough estimates for a 70B-class model on this GPU. These are replaced by the user's vLLM benchmark. Update if the estimates are egregiously wrong for a new GPU generation.
127+
128+
**Where to find current GPU prices:**
129+
130+
| Provider | URL | Notes |
131+
|---|---|---|
132+
| Lambda Labs | https://lambdalabs.com/service/gpu-cloud | Most reliable reference for H100/A100/L40S |
133+
| RunPod | https://www.runpod.io/gpu-instance/pricing | "Secure Cloud" tab for stable pricing |
134+
| CoreWeave | https://www.coreweave.com/pricing | Contact for contract rates; list prices available |
135+
| Vast.ai | https://vast.ai/pricing | Spot market — lowest but variable |
136+
| TensorDock | https://tensordock.com/pricing | Good for RTX consumer cards |
137+
138+
Prices fluctuate monthly. Check 2–3 providers and use a representative average. Note the date in a comment when updating.
139+
140+
---
141+
142+
### Cloud API Prices (`APIS` array, lines 26–47)
143+
144+
```js
145+
const APIS = [
146+
{ id:"gpt4o",
147+
name:"GPT-4o",
148+
vendor:"OpenAI",
149+
inCPM:1.25, // ← input $/M tokens (batch API rate)
150+
outCPM:5.00, // ← output $/M tokens (batch API rate)
151+
cacheCPM:null, // ← cached input $/M tokens (null if not supported)
152+
hasCache:false // ← true if prompt caching is available
153+
},
154+
...
155+
];
156+
```
157+
158+
**Fields:**
159+
- `inCPM` — input price in **$/M tokens**. Use the batch API rate where available (typically 50% off standard).
160+
- `outCPM` — output price in **$/M tokens**.
161+
- `cacheCPM` — cached input price. Set to `null` if caching is not supported.
162+
- `hasCache` — set to `true` if the provider supports prompt caching.
163+
- `oss:true` — open-source weights flag (no functional effect, shown as ⬡ in UI).
164+
- `flat:true` — flat rate flag (input and output same price, shown as "flat" badge).
165+
166+
**Where to find current API prices:**
167+
168+
| Provider | Pricing URL | Notes |
169+
|---|---|---|
170+
| OpenAI | https://openai.com/api/pricing | Use "Batch" column (50% off). No prompt caching in batch mode. |
171+
| Anthropic | https://www.anthropic.com/pricing | Use "Batch" column. Prompt caching rate is `cacheCPM`. |
172+
| Doubao (ByteDance) | https://www.volcengine.com/product/doubao | Prices in CNY — convert using `CNY` constant at top of file |
173+
| Qwen (Alibaba CN) | https://bailian.console.aliyun.com/ → 模型广场 → 计费说明 | Prices in CNY. Batch prices are ~50% off standard. |
174+
| Qwen (International) | https://www.alibabacloud.com/en/product/modelstudio/pricing | USD prices, Singapore endpoint |
175+
| Fireworks AI | https://fireworks.ai/pricing | Flat $/M rate (input = output). Check "Open Source Models" section. |
176+
177+
**Qwen CNY conversion:**
178+
179+
Qwen-CN prices are stored in CNY and converted at runtime using the `CNY` constant at line 3:
180+
181+
```js
182+
const CNY = 7.28; // ← update this if CNY/USD rate shifts significantly
183+
```
184+
185+
Qwen-CN entries use inline conversion:
186+
```js
187+
{ id:"qwen3max", inCPM:+(1.25/CNY).toFixed(4), outCPM:+(5.00/CNY).toFixed(4), ... }
188+
// ↑ ¥1.25 per M input ↑ ¥5.00 per M output
189+
```
190+
191+
To update: change the CNY numerator (the yuan price), not the formula.
192+
193+
---
194+
195+
### Adding a New API Provider
196+
197+
Copy an existing entry and fill in the fields:
198+
199+
```js
200+
{ id:"my-provider", // unique string, no spaces
201+
name:"My Provider Model", // display name in UI
202+
vendor:"MyVendor", // must match a key in VENDOR_COLOR
203+
inCPM:0.50, // $/M input tokens
204+
outCPM:2.00, // $/M output tokens
205+
cacheCPM:0.05, // $/M cached input, or null
206+
hasCache:true, // true if caching available
207+
note:"optional note", // shown in comparison table
208+
oss:true, // optional: open-source weights
209+
flat:true, // optional: flat rate (in=out price)
210+
},
211+
```
212+
213+
Then add the vendor to `VENDOR_COLOR` (if new) and `VENDORS` array:
214+
215+
```js
216+
const VENDOR_COLOR = {
217+
...
218+
"MyVendor": "#AABBCC", // hex color for this vendor's UI elements
219+
};
220+
221+
const VENDORS = [..., "MyVendor"]; // controls display order
222+
```
223+
224+
---
225+
226+
### Adding a New GPU
227+
228+
```js
229+
{ id:"h200", // unique string
230+
label:"H200 SXM 141GB", // display name
231+
memGB:141, // VRAM in GB (shown in GPU picker)
232+
pricePerHr:4.50, // on-demand market rate
233+
defaultInputTokSec:8000, // rough prefill estimate for 70B model
234+
defaultOutputTokSec:1600, // rough decode estimate for 70B model
235+
},
236+
```
237+
238+
Throughput defaults are order-of-magnitude estimates — they're clearly marked "estimated" in the UI and are always meant to be overridden by real benchmarks. A reasonable starting point: scale linearly from a known GPU's numbers based on memory bandwidth ratio.
239+
240+
---
241+
242+
## Price Verification Checklist
243+
244+
When doing a price review, check these sources in order:
245+
246+
- [ ] GPU rates: Lambda Labs + RunPod (update `GPUS[*].pricePerHr`)
247+
- [ ] OpenAI batch pricing: https://openai.com/api/pricing
248+
- [ ] Anthropic batch + cache pricing: https://www.anthropic.com/pricing
249+
- [ ] Doubao pricing: Volcengine console (convert CNY)
250+
- [ ] Qwen-CN batch pricing: Alibaba Cloud Bailian console (convert CNY)
251+
- [ ] Qwen-Intl pricing: Alibaba Cloud international site
252+
- [ ] Fireworks flat rates: https://fireworks.ai/pricing
253+
- [ ] CNY/USD rate: update `const CNY` if moved more than ~3%
254+
- [ ] Update the date comment in the footer (bottom of file, search "Feb–Mar 2026")
255+
256+
Prices in this space change frequently — monthly reviews are recommended.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="UTF-8" />
5+
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
6+
<title>AIBrix Batch Cost Calculator</title>
7+
<style>
8+
* { margin: 0; padding: 0; box-sizing: border-box; }
9+
</style>
10+
</head>
11+
<body>
12+
<div id="root"></div>
13+
<script type="module" src="./src/main.jsx"></script>
14+
</body>
15+
</html>

0 commit comments

Comments
 (0)