diff --git a/content/posts/2026-06-18-batch-cost-study.md b/content/posts/2026-06-18-batch-cost-study.md new file mode 100644 index 0000000..ed71380 --- /dev/null +++ b/content/posts/2026-06-18-batch-cost-study.md @@ -0,0 +1,205 @@ +--- +date: '2026-06-18T12:00:00-00:00' +draft: false +title: 'Should You Self-Host Batch Inference? An Honest Cost Breakdown' +author: ["The AIBrix Team"] + +disableShare: true +hideSummary: true +searchHidden: false +ShowReadingTime: false +ShowWordCount: false +ShowBreadCrumbs: true +ShowPostNavLinks: true +ShowRssButtonInSectionTermList: false +UseHugoToc: true +ShowToc: true +tocopen: true +--- + +*Capability-matched pricing across OpenAI, Anthropic, the cheap open-model APIs, and self-hosting open models (benchmarked on Lambda Cloud & RunPod) — with a decision tree for when each one wins.* + +> **TL;DR — the honest answer is a decision, not a number.** +> "Cheaper than OpenAI" is a useless test — nearly everything passes it, and self-hosting *poorly* can cost more than the API anyway. The real question is whether self-hosting beats the **dirt-cheap open-model APIs**, and the answer turns on **modality** first. For **text**: you won't undercut the Chinese first-party APIs (DeepSeek at $0.43/1M), but you *do* beat the **Western open-model hosts** (Fireworks / DeepInfra, ~3–4× pricier for the *same* weights) and win on data control and volume — while conceding the absolute frontier to closed models. For **multimodal** (doc/image understanding, image generation, video), open models are already *at* the frontier, so self-hosting wins on quality *and* cost. The rest is the math: **capability first, then $/unit, then a decision tree.** + +## Start with the workload + +Batch inference is the unglamorous half of serving LLMs — no user waiting on a token, just a queue of work to grind through offline: tag a few hundred million images, pull fields from a document archive, score a dataset, caption a video library. The jobs are huge, latency-tolerant, and the bill is almost all raw token volume. Which is why someone always asks: *can we just self-host this instead of paying an API?* + +It's tempting to assume the answer is yes — self-hosting must be cheaper. It often isn't. Run an open model at low utilization, with GPUs sitting idle and an engineer babysitting the deployment, and the *same* model on your own hardware can cost *more* than the API you were trying to escape. And when self-hosting *does* win, what it usually beats is the **expensive US labs** — the cheap open-model APIs already undercut those by **5–57×** for free: DeepSeek V4-Pro serves near-frontier quality at **$0.435 in / $0.87 out per 1M tokens**, about **35× cheaper on output than GPT-5.5** ($5 / $30) — and **57× cheaper than the priciest frontier model, Claude Fable 5** ($10 / $50). You don't need AIBrix, a GPU, or this blog post to beat OpenAI on price. + +So the question worth answering is harder: + +> **Given how cheap the open-model APIs already are, when does self-hosting actually pay off — and what does AIBrix add?** + +It comes down to three things — matching capability, pricing the managed options honestly, and working out what self-hosting really costs — then a decision tree for when self-hosting wins. (Cold start gets one paragraph; for hour-long batch jobs it barely matters.) + +## Match capability, not model names + +You can't compare prices across models of different quality. So we tier models by a benchmark basket — the **[Artificial Analysis Intelligence Index](https://artificialanalysis.ai/leaderboards/models)** (**AA-II**) as the spine, cross-checked against GPQA, SWE-bench Verified, LiveCodeBench, and LMArena — and compare **price within a tier**. AA-II is a single ~0–100 score that rolls reasoning, math, coding, and knowledge benchmarks into one "how capable" number (higher = smarter); the tier bands below are AA-II ranges. As of 2026-06-09: + +| Tier | Closed (API-only) | Best **self-hostable** open peer | Verdict | +|---|---|---|---| +| **Frontier** (AA-II ~57–65) | Claude Fable 5, Opus 4.8, GPT-5.5, Gemini 3.1 Pro | **None reaches it** | **Concede — pay the API** | +| **Workhorse** (AA-II ~50–55) | Gemini 3.5 Flash, Claude Sonnet 4.6 | MiniMax-M3, DeepSeek V4-Pro, GLM-5.1 | **Parity** | +| **Efficient** (AA-II ~31–47) | Claude Haiku 4.5, Gemini 3.1 Flash-Lite | **Qwen3.6-27B** | **Open wins** | + +![Artificial Analysis Intelligence Index (11 Jun 2026), bars colored by license: proprietary models (black) lead the frontier, while open-weight models (blue) win the efficient tier](/images/batch-cost-study/artificial-analysis-intelligence-index.png) +*Artificial Analysis Intelligence Index (v4.0, 11 Jun 2026), colored by license — **black = proprietary, blue = open-weight** (dark blue = open but commercial-use-restricted). The coloring *is* the thesis. The **frontier is all black**: Claude Fable 5 (64.9), Opus 4.8 (61.4), GPT-5.5 (60.2), Gemini 3.1 Pro (57.2), plus the closed Qwen3.7-Max (56.6) — no open model reaches it; open (blue) tops out at **MiniMax-M3 (54.7)**, ~10 points lower. But at the **efficient** end the colors flip: **Qwen3.6-27B (45.8) sits above Claude Haiku 4.5 (37.1)** — open wins the tier most batch jobs live in. Source: [Artificial Analysis](https://artificialanalysis.ai/leaderboards/models).* + +Three honest caveats, because the landscape moved and the details matter: + +- **The frontier is closed-only.** No *self-hostable* open model matches the closed frontier — **Claude Fable 5, Opus 4.8, GPT-5.5** — today. The one open model that benchmarks into the frontier band, **Qwen3.7-Max**, is **API-only / closed-weight** — you can't host it. The best deployable open models (MiniMax-M3, DeepSeek V4-Pro) trail by ~6–10 AA-II points, and the gap is *larger* on independently-verified agentic coding (Opus 4.8 posts a third-party-verified 88.6% on SWE-bench Verified; the open models' competing 80%+ figures are vendor self-reported). If your batch job needs frontier quality, **pay for the frontier** — there's no open or budget substitute. +- **Workhorse is genuine parity.** MiniMax-M3 matches Gemini 3.5 Flash; DeepSeek V4-Pro matches Claude Sonnet 4.6; GLM-5.1 actually *leads* on agentic coding. These are real, aggregate-verified ties — this is where self-hosting can credibly replace a mid-tier API. +- **Efficient is where open pulls ahead.** **Qwen3.6-27B** (a 27B *dense* model, trivial to self-host) beats Claude Haiku 4.5 on the aggregate index — **45.8 vs 37.1 AA-II**, both in reasoning mode for an apples-to-apples read — in a package small enough to run one-per-GPU. *Most batch workloads — classification, extraction, summarization, tagging — live in this tier.* This is the strongest self-hosting pitch: **better quality than the closed efficient tier, at a fraction of the cost.** + +> **Licensing:** every open model featured here — Qwen3.6, DeepSeek V4, GLM-5.1 — is MIT-licensed, so self-hosting it commercially is unencumbered. + +**Takeaway:** the cost story is honest and strong in **Workhorse** and **Efficient**, and we concede **Frontier**. The rest of this post is about those two tiers. + +## Beyond text — multimodal makes the case *stronger* + +Everything so far is about **text**, where the frontier really is closed-only. That rule is text-specific. For the multimodal batch workloads teams actually run — bulk captioning, document/chart extraction, programmatic image generation, video indexing — the frontier gap shrinks or disappears: + +- **Image understanding (VLM).** The best open model, **Qwen3-VL-235B**, *beats* the prior closed generation (GPT-5, Gemini 2.5 Pro, Opus 4.1) on the task benchmarks that matter for batch — DocVQA, ChartQA, MMStar, MathVista, MMBench — and trails only the newest closed models on the hardest expert-reasoning aggregate (MMMU-Pro, by ~13–15 pts). For bulk captioning and doc/chart extraction, open is at the frontier. +- **Text-to-image.** On compositional prompt-adherence (GenEval, DPG-Bench, T2I-CompBench), the **Apache-2.0 Qwen-Image** *leads* GPT Image 1, Imagen 4, and DALL·E 3. Closed keeps an edge only on subjective aesthetic-preference arenas. For high-volume programmatic generation — where prompt adherence and per-image cost dominate — open is at the practical frontier. +- **Video understanding.** Qwen3-VL leads MVBench and ties Gemini 2.5 Pro on Video-MME; it trails only on the newest robustness set (Video-MME-v2). For bulk video QA / indexing it's competitive. + +Reach for a closed API in just two places: top-end *expert multimodal reasoning* (MMMU-Pro) and *aesthetic image / video generation* (GPT Image 2, Veo, Sora). + +**And on cost the gap is even wider than text.** Image generation is the clearest case — a self-hosted open model produces an image for a fraction of a cent of GPU time, while managed APIs bill per image: + +| 1024×1024 image | $/image | vs cheapest managed | +|---|--:|---| +| **Self-host Qwen-Image** (Apache-2.0) · 1× H100 | **~$0.006–0.009** `[est]` | the quality leader, self-hosted | +| **Self-host FLUX.1-schnell** (Apache-2.0) · 1× H100 | **~$0.0005** `[est]` | — | +| Google Imagen 4 Fast — *cheapest managed* | $0.02 | self-host **~3–36× cheaper** | +| Nano Banana / Imagen 4 Std | $0.04 | **~5–24× cheaper** | +| OpenAI GPT Image, high quality | $0.13–0.21 | **~20–240× cheaper** | + +The standout is **Qwen-Image**: **Apache-2.0** (commercial self-hosting unencumbered) *and* the frontier open T2I model (GenEval **0.91**, first open model past 0.9) — you self-host at a fraction of the cost **without** giving up quality. Mind the license, though: FLUX.1-schnell is Apache-2.0 too, but **FLUX.1-dev is non-commercial** — fine for eval, a blocker for a production pipeline. + +For **image understanding (VLM)**, a small open model — Qwen3-VL-8B on a single GPU — understands ~1,000 images for **~$0.1–0.5** `[est]` vs **~$1.2** for the cheapest managed VLM (Gemini 3 Flash) and **~$9–14** for GPT-5.5: roughly **3–13× cheaper**. The honest caveat: the *frontier* VLM (GPT-5.5, Gemini 3 Pro) is still stronger than an 8B open model, and the 235B open flagship needs 8 GPUs — so VLM is "cheaper at good-enough quality," not "frontier *and* cheaper" the way image generation is. + +*(Per-image figures are raw GPU-second cost at full utilization — apply a ~1.5–3× real-world multiplier for utilization and ops; the order-of-magnitude gap still holds. Self-host throughput is measured in the benchmark, which now includes image-generation and VLM cells.)* + +## What the managed APIs cost (USD per 1M tokens) + +Within a tier, managed options span two orders of magnitude: + +| Tier | Model | In | Out | Note | +|---|---|--:|--:|---| +| Frontier | Claude Fable 5 | 10.00 | 50.00 | US frontier — newest *and* priciest | +| Frontier | Claude Opus 4.8 | 5.00 | 25.00 | US frontier | +| Frontier | OpenAI GPT-5.5 | 5.00 | 30.00 | US frontier | +| Frontier | Gemini 3.1 Pro | 2.00 | 12.00 | Google — cheaper, still frontier | +| Workhorse | Claude Sonnet 4.6 | 3.00 | 15.00 | US workhorse | +| Workhorse | **DeepSeek V4-Pro** (first-party) | **0.435** | **0.87** | open weights, Chinese API | +| Workhorse | DeepSeek V4-Pro **on Fireworks/DeepInfra** | **1.74** | **3.48** | **same weights, ~4× DeepSeek's own price** | +| Workhorse | Qwen-Plus | 0.40 | 1.20 | | +| Workhorse | **DeepSeek V4-Flash** | **0.14** | **0.28** | ~54× cheaper output than Sonnet | +| Efficient | Claude Haiku 4.5 | 1.00 | 5.00 | US efficient | +| Efficient | Qwen-Turbo | 0.05 | 0.20 | ~25× cheaper output than Haiku — cheapest credible row | + +*Batch endpoints (OpenAI, Anthropic, Alibaba, Fireworks) take another **−50%** on both input and output; prompt caching takes up to **−90%** on cached input.* + +Two findings reshape the whole comparison: + +1. **The Chinese first-party APIs are the real price floor**, not the US labs — 5–35× cheaper, biggest gap on output tokens (which dominate real bills). +2. **Western open-model hosts charge ~3–4× the first-party price for *identical* weights** (DeepSeek V4-Pro is $1.74/$3.48 on Fireworks vs $0.435/$0.87 from DeepSeek itself). You're paying for Western infra, SLA, and data residency. + +This is the crux for self-hosting: **you're not really competing with DeepSeek's $0.43 floor — you're competing with the $1.74 Western-host markup, and with the constraint that some workloads simply cannot send data to a Chinese API at all.** That's the gap AIBrix self-hosting fills. + +## What it costs to run it yourself + +Self-hosting cost is first-principles: + +``` +$/1M tokens = GPU_$/hr × GPU_count × 1e6 / (throughput_tok_s × utilization × 3600) +``` + +GPU prices we'll use (on-demand, 2026-06-09; RunPod bills per-second, Lambda per-minute, **neither offers true spot** — RunPod Community Cloud is the closest discount tier): + +| GPU | RunPod (Community / Secure) | Lambda (on-demand) | +|---|--:|--:| +| H100 SXM 80GB | $3.29 / $2.69 | $4.29 (single) | +| H200 141GB | $4.39 / ~$4.24 | not self-serve | +| A100 80GB | $1.39 / $1.19 | n/a (40GB @ $1.99) | +| L40S 48GB | $0.86 | n/a | +| RTX 4090 24GB | $0.69 | n/a | + +Throughput swings ±30% with engine version, quant, and sequence mix — so every figure below is **measured** in the benchmark, not estimated: + +| Model | Config | Throughput (tok/s) | Self-host $/1M | +|---|---|--:|--:| +| Qwen3.6-27B | 1× H100, FP8 | `[MEASURE]` | **`[MEASURE]`** | +| Qwen3.6-27B | 1× L40S, FP8 | `[MEASURE]` | **`[MEASURE]`** ← cheapest | +| Qwen3.6-27B | 1× A100-80G, FP8 | `[MEASURE]` | **`[MEASURE]`** | +| Qwen3.6-35B-A3B (MoE) | 1× H100, FP8 | `[MEASURE]` | **`[MEASURE]`** | +| DeepSeek-V4-Flash (MoE) | 8× H100, FP8 | `[MEASURE]` | **`[MEASURE]`** | + +At **high utilization**, these floors are expected to land in the low-cents-to-~$1/1M range (the measured numbers fill in above) — below the Western open-model hosts, and for the efficient tier potentially below even the cheap first-party APIs. **"At high utilization" is the catch** — and that, not cold start, is what really decides self-host vs. managed. We lay it out as a decision tree below. + +## A quick note on cold start + +When you self-host, the GPU meter runs during model load + engine init before the first token — about a minute for a 27B, a few minutes for a 70B-class model on several GPUs. A managed API hides this cost; you pay it explicitly. + +For **realistic batch — jobs that run an hour or more — it's a rounding error**: a 3-minute warm-up on a 60-minute job is ~5%, and on a fleet that keeps engines warm it amortizes away entirely. Cold start only dominates if you boot a fresh cluster for one tiny job, which isn't how batch runs — so **we assume long-running, high-utilization execution and don't dwell on it.** (A genuinely tiny, sporadic workload is itself a signal to use a managed API — see the decision tree.) + +## Putting it together — the decision tree + +![Decision tree — Should you self-host batch inference? Q1: is it a multimodal job? → yes, self-host (open is at the frontier). Q2 (text): does it need frontier quality? → yes, pay a closed frontier API (no open substitute). Q3: low volume and data can leave? → yes, use the cheap first-party API; otherwise (high volume, in-house data, or a private model) → self-host on AIBrix](/images/batch-cost-study/aibrix-batch-decision-tree.svg) + +The crossover point — **the single number that answers "should I self-host?"** — is `[MEASURE]`, and it's specific to your model, GPU, and volume. The benchmark is how you measure it. + +## What AIBrix Batch actually is + +![AIBrix Batch architecture — an OpenAI-compatible client talks to the Batch API; a persisted state machine backed by a pluggable metastore (Redis / S3 / TOS / local) is the source of truth; a scheduler dispatches jobs to pluggable execution runtimes (KubernetesJob and a Deployment dispatcher are GA, RunPod/Lambda over SSH are preview); vLLM runs the inference and writes results back to object storage](/images/batch-cost-study/aibrix-batch-architecture.svg) + +The *cost economics* above aren't unique to AIBrix — they're the economics of self-hosting in general. What AIBrix provides is the system that makes capturing them practical and safe at scale. Concretely, it's more than a loop that reads a JSONL and calls vLLM: + +- **A persisted, event-sourced job state machine — the datastore is the source of truth, not Kubernetes annotations.** Each batch is a JSON document in a pluggable metastore (Redis / S3 / TOS / local) with a whitelisted transition table (`created → scheduling → validating → in_progress → finalizing → finalized`, plus `completed / failed / cancelled / expired` conditions). On restart the manager rehydrates every job from the store — a crash doesn't lose your batch. +- **Per-request, resumable execution.** A batch can carry up to 50,000 requests; each is independently locked (Redis `NX`), checkpointed, and streamed to object storage as a multipart upload keyed by line index. A worker that dies mid-batch resumes from the next un-done request instead of re-running the job — at-least-once with dedup, memory-bounded, not buffered in RAM. +- **Storage is pluggable, and the same backends serve two roles.** The input JSONL plus the output / error files live in **object storage — S3, TOS, or local**; the job state machine and per-request progress live in the **metastore — Redis, S3, TOS, or local** (Redis when low-latency state matters; object storage for durability without standing up a database). Results are written incrementally as a multipart upload and served back through `GET /v1/files/{output_file_id}/content` — no external database required. +- **One driver, a registry of pluggable runtimes.** Every backend runs the same lifecycle (`validate → provision → wait_ready → connect → prepare → run → finalize → teardown`); a new backend is a new registered Runtime, not a fork of the driver. Two are GA, both on Kubernetes: *self-hosting* (a K8s Job with the vLLM engine and the batch worker in one pod — the worker waits for `/health`, then dispatches and aggregates) and *control-plane dispatch* (a per-job Deployment the control plane drives over HTTP). Two more are **preview** — **RunPod** and **Lambda Cloud**: the Resource Manager leases a GPU box and the runtime brings up vLLM on it over SSH (on Lambda behind an SSH tunnel, so the engine is never publicly exposed). The production-honored path today is in-cluster Kubernetes. +- **OpenAI-faithful surface, including the accounting.** `/v1/files` + `/v1/batches`, the `validating → in_progress → finalizing → completed` lifecycle, the 24-hour window, `custom_id` echo, and a real `usage` object — `input/output/total_tokens` plus `cached_tokens` (prefix-cache hits) and `reasoning_tokens` — accumulated per request and deduped on retry. Job endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`. +- **An admin/user split that is the security and cost boundary.** Users stay on the stock OpenAI SDK and name a template via `extra_body.aibrix`; the only overridable field is the allowlisted `engine_args` — image, GPU SKU, and model source are admin-only, and any other override key is rejected with a `400`, never silently dropped. Templates are versioned and hot-reload from a ConfigMap with per-item error isolation. +- **Trough-filling raises utilization — the lever that moves $/token.** Batch is latency-tolerant, so AIBrix packs it into the idle capacity your online serving leaves behind; higher effective utilization is what turns that self-hosting floor from theoretical into real, with no extra hardware. + +**Honest boundaries** (a cost article shouldn't oversell the system either): the **vLLM** engine adapter is GA; SGLang / TensorRT-LLM / LMDeploy are in the schema but raise an explicit *unsupported-engine* error until their adapters land. And the self-host throughput numbers above are a **single-machine, vanilla vLLM baseline** — we deliberately didn't chase peak performance. AIBrix stacks three more cost levers on top, all **excluded** here pending their own benchmark: **gateway prefix-cache-aware routing**, **StormService** disaggregated (prefill/decode-split) serving, and **L2 KV offloading / reuse**. Those are a **follow-up post** — so treat every number here as a floor real deployments beat, not a ceiling. + +## A worked example — 1M requests, end to end + +Make it concrete. You have **1M chat-completion requests** to run offline — classify / extract / caption a dataset, ~512 in / 256 out tokens each (≈ **0.77B tokens**). It's a workhorse/efficient job, so run **Qwen3.6-27B on a single H100**. Submission is the stock OpenAI Batch flow — only the `base_url` changes: + +```python +from openai import OpenAI +client = OpenAI(base_url="http:///v1", api_key="...") + +f = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch") +batch = client.batches.create( + input_file_id=f.id, endpoint="/v1/chat/completions", completion_window="24h", + extra_body={"aibrix": {"model_template": {"name": "qwen3.6-27b"}}}, # admin-registered template +) +# poll batch.status → "completed"; then download client.files.content(batch.output_file_id) +``` + +One H100 stays busy for the whole job — high utilization, the regime where the self-host floor is real — and it's a single K8s Job, not a cluster. The cost, against the alternatives you'd realistically reach for (self-host figure from the benchmark): + +| This job (~0.77B tokens) | Cost | +|---|--:| +| **Self-host Qwen3.6-27B · 1× H100 (AIBrix)** | **$`[MEASURE]`** (~$270 `[est]`) | +| Same weights on a Western open-host (Fireworks tier) | ~$690 | +| A frontier US API — GPT-5.5 batch (a higher tier) | ~$5,100 | + +Self-hosting beats the Western open-model host (~2.5×) and the frontier US APIs (~19×, if that's the quality you'd otherwise pay for) — and your data never leaves. *Honest counterpoint:* if your data **can** go to the cheapest first-party API (Qwen-Flash batch ≈ **$38** for this job), that's cheaper still — self-hosting here is the call for data control, a private model, or sustained scale, exactly as the decision tree says. + +**And the self-host floor keeps dropping.** The $/token isn't static — vLLM gets faster at batch every release: continuous batching, **chunked prefill**, **prefix caching** (a big win when a batch shares a system prompt), FP8 weights + FP8 KV cache, paged attention, and bigger `max_num_seqs` / `max_num_batched_tokens` all raise tokens-per-GPU-second. On top of that, AIBrix's **gateway routing + StormService disaggregation + KV offloading** stack further still (all *excluded* from the conservative single-machine numbers here — a follow-up post measures them). So the gap versus a managed API widens over time, not narrows. + +## Reproducing the numbers + +Every self-hosted figure here comes from a small, scripted benchmark on RunPod/Lambda (about half a day, well under $100). The full protocol — matrix, commands, logging schema, and post-processing — is kept separate so this post stays about *what the numbers mean*, not how to run them. + +## Caveats & sources + +GPU and API prices are real quotes as of **2026-06-09** but drift weekly and disagree across aggregators by 5–20% (RunPod Community most of all) — re-verify in-console. Every self-host throughput number is **measured** via the benchmark (not estimated), with the model, GPU, and config logged alongside. Capability tiers use Artificial Analysis as the spine; several open "frontier-adjacent" coding scores are vendor self-reported and not independently reproduced — we do not let them close the frontier gap. *(Full source list: OpenAI / Anthropic / DeepSeek / Alibaba / Fireworks pricing pages; Lambda & RunPod pricing; gpustack, Baseten, Spheron, databasemart throughput labs.)* diff --git a/static/images/batch-cost-study/artificial-analysis-intelligence-index.png b/static/images/batch-cost-study/artificial-analysis-intelligence-index.png new file mode 100644 index 0000000..f47a8af Binary files /dev/null and b/static/images/batch-cost-study/artificial-analysis-intelligence-index.png differ