|
| 1 | +# LLM Batch Cost Calculator |
| 2 | + |
| 3 | +A self-hosted vs cloud API cost comparison tool for LLM batch inference. Helps you answer: **"Is it cheaper to run this workload on my own GPUs or pay a cloud API?"** |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## How to Use |
| 8 | + |
| 9 | +The dashboard has two sides: **Infrastructure** (left panel, configure once) and **Job** (right panel, change per workload). |
| 10 | + |
| 11 | +### Left Panel — Infrastructure Setup |
| 12 | + |
| 13 | +Work through the steps top to bottom, once per cluster configuration. |
| 14 | + |
| 15 | +**① GPU Setup** |
| 16 | + |
| 17 | +Select your GPU type from the grid. This sets the default market rate (Lambda/RunPod/CoreWeave on-demand, Feb–Mar 2026). Fields to override: |
| 18 | + |
| 19 | +- **GPU Count** — how many GPUs in your cluster |
| 20 | +- **$/GPU/hr override** — enter your actual contract rate if different from market defaults |
| 21 | + |
| 22 | +The total rental cost (with overhead) is shown below the fields. |
| 23 | + |
| 24 | +Use the sliders to account for cost overhead: |
| 25 | +- **Startup overhead** — percentage of GPU time spent loading the model. Set to 0% if you're running a persistent serving process (always-on). Set to 5–10% for batch jobs that cold-start each time. |
| 26 | +- **Interruption rate** — percentage of GPU-hours lost to spot instance preemptions or hardware failures. Near 0% on stable on-prem, 5–15% on spot cloud instances. |
| 27 | + |
| 28 | +**② Benchmark Results** |
| 29 | + |
| 30 | +This is the most important step for accuracy. The default throughput numbers are rough estimates for a 70B-class model — replace them with your actual measurements. |
| 31 | + |
| 32 | +Run these two commands on your cluster using [vLLM's benchmark tool](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py): |
| 33 | + |
| 34 | +```bash |
| 35 | +# Input tokens/s — measures pure prefill (compute-bound) |
| 36 | +python benchmark_throughput.py \ |
| 37 | + --model <your-model> \ |
| 38 | + --input-len 2048 \ |
| 39 | + --output-len 1 |
| 40 | + |
| 41 | +# Output tokens/s — measures pure decode (memory-bandwidth-bound) |
| 42 | +python benchmark_throughput.py \ |
| 43 | + --model <your-model> \ |
| 44 | + --input-len 1 \ |
| 45 | + --output-len 256 |
| 46 | +``` |
| 47 | + |
| 48 | +Enter the reported `throughput (tokens/s)` values into the dashboard. The two benchmarks represent extremes — real workloads fall between them. |
| 49 | + |
| 50 | +> **Why two separate numbers?** Prefill (processing your input) is compute-bound and runs ~8–15× faster than decode (generating output), which is memory-bandwidth-bound. Cloud providers price these differently (e.g. input tokens are 3–5× cheaper than output tokens). Separating them gives you an accurate apples-to-apples comparison. |
| 51 | +
|
| 52 | +**③ Effective Price / M Tokens** |
| 53 | + |
| 54 | +This is the output of your infrastructure setup. Once you have your benchmark numbers, these two figures — **Input $/M** and **Output $/M** — are your self-hosted "rack rates". They tell you exactly what your GPU cluster charges per million tokens processed, equivalent to how cloud APIs publish their pricing. |
| 55 | + |
| 56 | +The math: |
| 57 | +``` |
| 58 | +cost per second = (numGPUs × $/hr × overheadMult) / 3600 |
| 59 | +Input $/M = (cost per second / inputTokSec) × 1,000,000 |
| 60 | +Output $/M = (cost per second / outputTokSec) × 1,000,000 |
| 61 | +``` |
| 62 | + |
| 63 | +**④ API Cache Hit Rate** |
| 64 | + |
| 65 | +For cloud APIs that support prompt caching, adjust the slider to match your expected cache hit rate. This only affects API cost calculations — self-hosted cost is unaffected. |
| 66 | + |
| 67 | +Prompt caching applies to input tokens only. Effective savings: |
| 68 | +- Anthropic: 90% off input tokens on cache hit |
| 69 | +- Doubao: ~60% off |
| 70 | +- Qwen: ~60% off |
| 71 | +- Fireworks: 50% off |
| 72 | +- OpenAI batch API: no prompt caching |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +### Right Panel — Job Definition & Comparison |
| 77 | + |
| 78 | +**⑤ Define Your Job** |
| 79 | + |
| 80 | +Three inputs define any job: |
| 81 | + |
| 82 | +| Field | What to enter | |
| 83 | +|---|---| |
| 84 | +| Input tokens / request | Token count of your prompt (system prompt + user message). Use a tokenizer or estimate: ~1 token per 0.75 words. | |
| 85 | +| Output tokens / request | Token count of the model's response. Check your actual logs or run a sample. | |
| 86 | +| Number of requests | 1 for a quick cost check on a single call. Your daily/weekly/monthly volume for scale estimates. | |
| 87 | + |
| 88 | +The summary bar below shows per-request GPU time, total token volume, and self-hosted cost per request at a glance. |
| 89 | + |
| 90 | +**⑥ APIs to Compare** |
| 91 | + |
| 92 | +Toggle which cloud APIs appear in the comparison. Active APIs are highlighted. The ⬡ symbol marks open-source models — the same weights you could run yourself, making them the cleanest cost comparison (no model quality variable). |
| 93 | + |
| 94 | +**⑦ Cost Comparison** |
| 95 | + |
| 96 | +The main output. Self-hosted is shown prominently at the top, with a breakdown of input cost, output cost, and GPU rental. Each API is shown below with: |
| 97 | +- Total job cost |
| 98 | +- Cost relative to self-hosted (green = API is cheaper, red = self-host wins) |
| 99 | +- Per-request cost and $/M rates |
| 100 | +- A relative cost bar for visual comparison |
| 101 | + |
| 102 | +**⑧ Cost at Scale** |
| 103 | + |
| 104 | +Set **Batches / day** to see monthly projections. The table shows per-request, per-job, and per-month costs across all providers. Use this to understand when the crossover point is between self-hosting (fixed GPU cost) and APIs (pure variable cost). |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Maintainer Guide — Updating Prices |
| 109 | + |
| 110 | +All prices and defaults live in the first ~50 lines of `llm-batch-cost-calculator.jsx`. No other part of the file needs to change for price updates. |
| 111 | + |
| 112 | +### GPU Prices (`GPUS` array, lines 5–18) |
| 113 | + |
| 114 | +```js |
| 115 | +const GPUS = [ |
| 116 | + { id:"h100-sxm", label:"H100 SXM 80GB", memGB:80, |
| 117 | + pricePerHr:3.44, // ← update this |
| 118 | + defaultInputTokSec:6000, // ← rough estimate only, users override |
| 119 | + defaultOutputTokSec:1200 }, // ← rough estimate only, users override |
| 120 | + ... |
| 121 | +]; |
| 122 | +``` |
| 123 | + |
| 124 | +**Fields:** |
| 125 | +- `pricePerHr` — on-demand market rate in USD. This is a starting point; users override with their contract rate. |
| 126 | +- `defaultInputTokSec` / `defaultOutputTokSec` — rough estimates for a 70B-class model on this GPU. These are replaced by the user's vLLM benchmark. Update if the estimates are egregiously wrong for a new GPU generation. |
| 127 | + |
| 128 | +**Where to find current GPU prices:** |
| 129 | + |
| 130 | +| Provider | URL | Notes | |
| 131 | +|---|---|---| |
| 132 | +| Lambda Labs | https://lambdalabs.com/service/gpu-cloud | Most reliable reference for H100/A100/L40S | |
| 133 | +| RunPod | https://www.runpod.io/gpu-instance/pricing | "Secure Cloud" tab for stable pricing | |
| 134 | +| CoreWeave | https://www.coreweave.com/pricing | Contact for contract rates; list prices available | |
| 135 | +| Vast.ai | https://vast.ai/pricing | Spot market — lowest but variable | |
| 136 | +| TensorDock | https://tensordock.com/pricing | Good for RTX consumer cards | |
| 137 | + |
| 138 | +Prices fluctuate monthly. Check 2–3 providers and use a representative average. Note the date in a comment when updating. |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +### Cloud API Prices (`APIS` array, lines 26–47) |
| 143 | + |
| 144 | +```js |
| 145 | +const APIS = [ |
| 146 | + { id:"gpt4o", |
| 147 | + name:"GPT-4o", |
| 148 | + vendor:"OpenAI", |
| 149 | + inCPM:1.25, // ← input $/M tokens (batch API rate) |
| 150 | + outCPM:5.00, // ← output $/M tokens (batch API rate) |
| 151 | + cacheCPM:null, // ← cached input $/M tokens (null if not supported) |
| 152 | + hasCache:false // ← true if prompt caching is available |
| 153 | + }, |
| 154 | + ... |
| 155 | +]; |
| 156 | +``` |
| 157 | + |
| 158 | +**Fields:** |
| 159 | +- `inCPM` — input price in **$/M tokens**. Use the batch API rate where available (typically 50% off standard). |
| 160 | +- `outCPM` — output price in **$/M tokens**. |
| 161 | +- `cacheCPM` — cached input price. Set to `null` if caching is not supported. |
| 162 | +- `hasCache` — set to `true` if the provider supports prompt caching. |
| 163 | +- `oss:true` — open-source weights flag (no functional effect, shown as ⬡ in UI). |
| 164 | +- `flat:true` — flat rate flag (input and output same price, shown as "flat" badge). |
| 165 | + |
| 166 | +**Where to find current API prices:** |
| 167 | + |
| 168 | +| Provider | Pricing URL | Notes | |
| 169 | +|---|---|---| |
| 170 | +| OpenAI | https://openai.com/api/pricing | Use "Batch" column (50% off). No prompt caching in batch mode. | |
| 171 | +| Anthropic | https://www.anthropic.com/pricing | Use "Batch" column. Prompt caching rate is `cacheCPM`. | |
| 172 | +| Doubao (ByteDance) | https://www.volcengine.com/product/doubao | Prices in CNY — convert using `CNY` constant at top of file | |
| 173 | +| Qwen (Alibaba CN) | https://bailian.console.aliyun.com/ → 模型广场 → 计费说明 | Prices in CNY. Batch prices are ~50% off standard. | |
| 174 | +| Qwen (International) | https://www.alibabacloud.com/en/product/modelstudio/pricing | USD prices, Singapore endpoint | |
| 175 | +| Fireworks AI | https://fireworks.ai/pricing | Flat $/M rate (input = output). Check "Open Source Models" section. | |
| 176 | + |
| 177 | +**Qwen CNY conversion:** |
| 178 | + |
| 179 | +Qwen-CN prices are stored in CNY and converted at runtime using the `CNY` constant at line 3: |
| 180 | + |
| 181 | +```js |
| 182 | +const CNY = 7.28; // ← update this if CNY/USD rate shifts significantly |
| 183 | +``` |
| 184 | + |
| 185 | +Qwen-CN entries use inline conversion: |
| 186 | +```js |
| 187 | +{ id:"qwen3max", inCPM:+(1.25/CNY).toFixed(4), outCPM:+(5.00/CNY).toFixed(4), ... } |
| 188 | +// ↑ ¥1.25 per M input ↑ ¥5.00 per M output |
| 189 | +``` |
| 190 | + |
| 191 | +To update: change the CNY numerator (the yuan price), not the formula. |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +### Adding a New API Provider |
| 196 | + |
| 197 | +Copy an existing entry and fill in the fields: |
| 198 | + |
| 199 | +```js |
| 200 | +{ id:"my-provider", // unique string, no spaces |
| 201 | + name:"My Provider Model", // display name in UI |
| 202 | + vendor:"MyVendor", // must match a key in VENDOR_COLOR |
| 203 | + inCPM:0.50, // $/M input tokens |
| 204 | + outCPM:2.00, // $/M output tokens |
| 205 | + cacheCPM:0.05, // $/M cached input, or null |
| 206 | + hasCache:true, // true if caching available |
| 207 | + note:"optional note", // shown in comparison table |
| 208 | + oss:true, // optional: open-source weights |
| 209 | + flat:true, // optional: flat rate (in=out price) |
| 210 | +}, |
| 211 | +``` |
| 212 | + |
| 213 | +Then add the vendor to `VENDOR_COLOR` (if new) and `VENDORS` array: |
| 214 | + |
| 215 | +```js |
| 216 | +const VENDOR_COLOR = { |
| 217 | + ... |
| 218 | + "MyVendor": "#AABBCC", // hex color for this vendor's UI elements |
| 219 | +}; |
| 220 | + |
| 221 | +const VENDORS = [..., "MyVendor"]; // controls display order |
| 222 | +``` |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +### Adding a New GPU |
| 227 | + |
| 228 | +```js |
| 229 | +{ id:"h200", // unique string |
| 230 | + label:"H200 SXM 141GB", // display name |
| 231 | + memGB:141, // VRAM in GB (shown in GPU picker) |
| 232 | + pricePerHr:4.50, // on-demand market rate |
| 233 | + defaultInputTokSec:8000, // rough prefill estimate for 70B model |
| 234 | + defaultOutputTokSec:1600, // rough decode estimate for 70B model |
| 235 | +}, |
| 236 | +``` |
| 237 | + |
| 238 | +Throughput defaults are order-of-magnitude estimates — they're clearly marked "estimated" in the UI and are always meant to be overridden by real benchmarks. A reasonable starting point: scale linearly from a known GPU's numbers based on memory bandwidth ratio. |
| 239 | + |
| 240 | +--- |
| 241 | + |
| 242 | +## Price Verification Checklist |
| 243 | + |
| 244 | +When doing a price review, check these sources in order: |
| 245 | + |
| 246 | +- [ ] GPU rates: Lambda Labs + RunPod (update `GPUS[*].pricePerHr`) |
| 247 | +- [ ] OpenAI batch pricing: https://openai.com/api/pricing |
| 248 | +- [ ] Anthropic batch + cache pricing: https://www.anthropic.com/pricing |
| 249 | +- [ ] Doubao pricing: Volcengine console (convert CNY) |
| 250 | +- [ ] Qwen-CN batch pricing: Alibaba Cloud Bailian console (convert CNY) |
| 251 | +- [ ] Qwen-Intl pricing: Alibaba Cloud international site |
| 252 | +- [ ] Fireworks flat rates: https://fireworks.ai/pricing |
| 253 | +- [ ] CNY/USD rate: update `const CNY` if moved more than ~3% |
| 254 | +- [ ] Update the date comment in the footer (bottom of file, search "Feb–Mar 2026") |
| 255 | + |
| 256 | +Prices in this space change frequently — monthly reviews are recommended. |
0 commit comments