diff --git a/content/posts/2026-06-18-batch-cost-study.md b/content/posts/2026-06-18-batch-cost-study.md
new file mode 100644
index 0000000..ed71380
--- /dev/null
+++ b/content/posts/2026-06-18-batch-cost-study.md
@@ -0,0 +1,205 @@
+---
+date: '2026-06-18T12:00:00-00:00'
+draft: false
+title: 'Should You Self-Host Batch Inference? An Honest Cost Breakdown'
+author: ["The AIBrix Team"]
+
+disableShare: true
+hideSummary: true
+searchHidden: false
+ShowReadingTime: false
+ShowWordCount: false
+ShowBreadCrumbs: true
+ShowPostNavLinks: true
+ShowRssButtonInSectionTermList: false
+UseHugoToc: true
+ShowToc: true
+tocopen: true
+---
+
+*Capability-matched pricing across OpenAI, Anthropic, the cheap open-model APIs, and self-hosting open models (benchmarked on Lambda Cloud & RunPod) — with a decision tree for when each one wins.*
+
+> **TL;DR — the honest answer is a decision, not a number.**
+> "Cheaper than OpenAI" is a useless test — nearly everything passes it, and self-hosting *poorly* can cost more than the API anyway. The real question is whether self-hosting beats the **dirt-cheap open-model APIs**, and the answer turns on **modality** first. For **text**: you won't undercut the Chinese first-party APIs (DeepSeek at $0.43/1M), but you *do* beat the **Western open-model hosts** (Fireworks / DeepInfra, ~3–4× pricier for the *same* weights) and win on data control and volume — while conceding the absolute frontier to closed models. For **multimodal** (doc/image understanding, image generation, video), open models are already *at* the frontier, so self-hosting wins on quality *and* cost. The rest is the math: **capability first, then $/unit, then a decision tree.**
+
+## Start with the workload
+
+Batch inference is the unglamorous half of serving LLMs — no user waiting on a token, just a queue of work to grind through offline: tag a few hundred million images, pull fields from a document archive, score a dataset, caption a video library. The jobs are huge, latency-tolerant, and the bill is almost all raw token volume. Which is why someone always asks: *can we just self-host this instead of paying an API?*
+
+It's tempting to assume the answer is yes — self-hosting must be cheaper. It often isn't. Run an open model at low utilization, with GPUs sitting idle and an engineer babysitting the deployment, and the *same* model on your own hardware can cost *more* than the API you were trying to escape. And when self-hosting *does* win, what it usually beats is the **expensive US labs** — the cheap open-model APIs already undercut those by **5–57×** for free: DeepSeek V4-Pro serves near-frontier quality at **$0.435 in / $0.87 out per 1M tokens**, about **35× cheaper on output than GPT-5.5** ($5 / $30) — and **57× cheaper than the priciest frontier model, Claude Fable 5** ($10 / $50). You don't need AIBrix, a GPU, or this blog post to beat OpenAI on price.
+
+So the question worth answering is harder:
+
+> **Given how cheap the open-model APIs already are, when does self-hosting actually pay off — and what does AIBrix add?**
+
+It comes down to three things — matching capability, pricing the managed options honestly, and working out what self-hosting really costs — then a decision tree for when self-hosting wins. (Cold start gets one paragraph; for hour-long batch jobs it barely matters.)
+
+## Match capability, not model names
+
+You can't compare prices across models of different quality. So we tier models by a benchmark basket — the **[Artificial Analysis Intelligence Index](https://artificialanalysis.ai/leaderboards/models)** (**AA-II**) as the spine, cross-checked against GPQA, SWE-bench Verified, LiveCodeBench, and LMArena — and compare **price within a tier**. AA-II is a single ~0–100 score that rolls reasoning, math, coding, and knowledge benchmarks into one "how capable" number (higher = smarter); the tier bands below are AA-II ranges. As of 2026-06-09:
+
+| Tier | Closed (API-only) | Best **self-hostable** open peer | Verdict |
+|---|---|---|---|
+| **Frontier** (AA-II ~57–65) | Claude Fable 5, Opus 4.8, GPT-5.5, Gemini 3.1 Pro | **None reaches it** | **Concede — pay the API** |
+| **Workhorse** (AA-II ~50–55) | Gemini 3.5 Flash, Claude Sonnet 4.6 | MiniMax-M3, DeepSeek V4-Pro, GLM-5.1 | **Parity** |
+| **Efficient** (AA-II ~31–47) | Claude Haiku 4.5, Gemini 3.1 Flash-Lite | **Qwen3.6-27B** | **Open wins** |
+
+![Artificial Analysis Intelligence Index (11 Jun 2026), bars colored by license: proprietary models (black) lead the frontier, while open-weight models (blue) win the efficient tier](/images/batch-cost-study/artificial-analysis-intelligence-index.png)
+*Artificial Analysis Intelligence Index (v4.0, 11 Jun 2026), colored by license — **black = proprietary, blue = open-weight** (dark blue = open but commercial-use-restricted). The coloring *is* the thesis. The **frontier is all black**: Claude Fable 5 (64.9), Opus 4.8 (61.4), GPT-5.5 (60.2), Gemini 3.1 Pro (57.2), plus the closed Qwen3.7-Max (56.6) — no open model reaches it; open (blue) tops out at **MiniMax-M3 (54.7)**, ~10 points lower. But at the **efficient** end the colors flip: **Qwen3.6-27B (45.8) sits above Claude Haiku 4.5 (37.1)** — open wins the tier most batch jobs live in. Source: [Artificial Analysis](https://artificialanalysis.ai/leaderboards/models).*
+
+Three honest caveats, because the landscape moved and the details matter:
+
+- **The frontier is closed-only.** No *self-hostable* open model matches the closed frontier — **Claude Fable 5, Opus 4.8, GPT-5.5** — today. The one open model that benchmarks into the frontier band, **Qwen3.7-Max**, is **API-only / closed-weight** — you can't host it. The best deployable open models (MiniMax-M3, DeepSeek V4-Pro) trail by ~6–10 AA-II points, and the gap is *larger* on independently-verified agentic coding (Opus 4.8 posts a third-party-verified 88.6% on SWE-bench Verified; the open models' competing 80%+ figures are vendor self-reported). If your batch job needs frontier quality, **pay for the frontier** — there's no open or budget substitute.
+- **Workhorse is genuine parity.** MiniMax-M3 matches Gemini 3.5 Flash; DeepSeek V4-Pro matches Claude Sonnet 4.6; GLM-5.1 actually *leads* on agentic coding. These are real, aggregate-verified ties — this is where self-hosting can credibly replace a mid-tier API.
+- **Efficient is where open pulls ahead.** **Qwen3.6-27B** (a 27B *dense* model, trivial to self-host) beats Claude Haiku 4.5 on the aggregate index — **45.8 vs 37.1 AA-II**, both in reasoning mode for an apples-to-apples read — in a package small enough to run one-per-GPU. *Most batch workloads — classification, extraction, summarization, tagging — live in this tier.* This is the strongest self-hosting pitch: **better quality than the closed efficient tier, at a fraction of the cost.**
+
+> **Licensing:** every open model featured here — Qwen3.6, DeepSeek V4, GLM-5.1 — is MIT-licensed, so self-hosting it commercially is unencumbered.
+
+**Takeaway:** the cost story is honest and strong in **Workhorse** and **Efficient**, and we concede **Frontier**. The rest of this post is about those two tiers.
+
+## Beyond text — multimodal makes the case *stronger*
+
+Everything so far is about **text**, where the frontier really is closed-only. That rule is text-specific. For the multimodal batch workloads teams actually run — bulk captioning, document/chart extraction, programmatic image generation, video indexing — the frontier gap shrinks or disappears:
+
+- **Image understanding (VLM).** The best open model, **Qwen3-VL-235B**, *beats* the prior closed generation (GPT-5, Gemini 2.5 Pro, Opus 4.1) on the task benchmarks that matter for batch — DocVQA, ChartQA, MMStar, MathVista, MMBench — and trails only the newest closed models on the hardest expert-reasoning aggregate (MMMU-Pro, by ~13–15 pts). For bulk captioning and doc/chart extraction, open is at the frontier.
+- **Text-to-image.** On compositional prompt-adherence (GenEval, DPG-Bench, T2I-CompBench), the **Apache-2.0 Qwen-Image** *leads* GPT Image 1, Imagen 4, and DALL·E 3. Closed keeps an edge only on subjective aesthetic-preference arenas. For high-volume programmatic generation — where prompt adherence and per-image cost dominate — open is at the practical frontier.
+- **Video understanding.** Qwen3-VL leads MVBench and ties Gemini 2.5 Pro on Video-MME; it trails only on the newest robustness set (Video-MME-v2). For bulk video QA / indexing it's competitive.
+
+Reach for a closed API in just two places: top-end *expert multimodal reasoning* (MMMU-Pro) and *aesthetic image / video generation* (GPT Image 2, Veo, Sora).
+
+**And on cost the gap is even wider than text.** Image generation is the clearest case — a self-hosted open model produces an image for a fraction of a cent of GPU time, while managed APIs bill per image:
+
+| 1024×1024 image | $/image | vs cheapest managed |
+|---|--:|---|
+| **Self-host Qwen-Image** (Apache-2.0) · 1× H100 | **~$0.006–0.009** `[est]` | the quality leader, self-hosted |
+| **Self-host FLUX.1-schnell** (Apache-2.0) · 1× H100 | **~$0.0005** `[est]` | — |
+| Google Imagen 4 Fast — *cheapest managed* | $0.02 | self-host **~3–36× cheaper** |
+| Nano Banana / Imagen 4 Std | $0.04 | **~5–24× cheaper** |
+| OpenAI GPT Image, high quality | $0.13–0.21 | **~20–240× cheaper** |
+
+The standout is **Qwen-Image**: **Apache-2.0** (commercial self-hosting unencumbered) *and* the frontier open T2I model (GenEval **0.91**, first open model past 0.9) — you self-host at a fraction of the cost **without** giving up quality. Mind the license, though: FLUX.1-schnell is Apache-2.0 too, but **FLUX.1-dev is non-commercial** — fine for eval, a blocker for a production pipeline.
+
+For **image understanding (VLM)**, a small open model — Qwen3-VL-8B on a single GPU — understands ~1,000 images for **~$0.1–0.5** `[est]` vs **~$1.2** for the cheapest managed VLM (Gemini 3 Flash) and **~$9–14** for GPT-5.5: roughly **3–13× cheaper**. The honest caveat: the *frontier* VLM (GPT-5.5, Gemini 3 Pro) is still stronger than an 8B open model, and the 235B open flagship needs 8 GPUs — so VLM is "cheaper at good-enough quality," not "frontier *and* cheaper" the way image generation is.
+
+*(Per-image figures are raw GPU-second cost at full utilization — apply a ~1.5–3× real-world multiplier for utilization and ops; the order-of-magnitude gap still holds. Self-host throughput is measured in the benchmark, which now includes image-generation and VLM cells.)*
+
+## What the managed APIs cost (USD per 1M tokens)
+
+Within a tier, managed options span two orders of magnitude:
+
+| Tier | Model | In | Out | Note |
+|---|---|--:|--:|---|
+| Frontier | Claude Fable 5 | 10.00 | 50.00 | US frontier — newest *and* priciest |
+| Frontier | Claude Opus 4.8 | 5.00 | 25.00 | US frontier |
+| Frontier | OpenAI GPT-5.5 | 5.00 | 30.00 | US frontier |
+| Frontier | Gemini 3.1 Pro | 2.00 | 12.00 | Google — cheaper, still frontier |
+| Workhorse | Claude Sonnet 4.6 | 3.00 | 15.00 | US workhorse |
+| Workhorse | **DeepSeek V4-Pro** (first-party) | **0.435** | **0.87** | open weights, Chinese API |
+| Workhorse | DeepSeek V4-Pro **on Fireworks/DeepInfra** | **1.74** | **3.48** | **same weights, ~4× DeepSeek's own price** |
+| Workhorse | Qwen-Plus | 0.40 | 1.20 | |
+| Workhorse | **DeepSeek V4-Flash** | **0.14** | **0.28** | ~54× cheaper output than Sonnet |
+| Efficient | Claude Haiku 4.5 | 1.00 | 5.00 | US efficient |
+| Efficient | Qwen-Turbo | 0.05 | 0.20 | ~25× cheaper output than Haiku — cheapest credible row |
+
+*Batch endpoints (OpenAI, Anthropic, Alibaba, Fireworks) take another **−50%** on both input and output; prompt caching takes up to **−90%** on cached input.*
+
+Two findings reshape the whole comparison:
+
+1. **The Chinese first-party APIs are the real price floor**, not the US labs — 5–35× cheaper, biggest gap on output tokens (which dominate real bills).
+2. **Western open-model hosts charge ~3–4× the first-party price for *identical* weights** (DeepSeek V4-Pro is $1.74/$3.48 on Fireworks vs $0.435/$0.87 from DeepSeek itself). You're paying for Western infra, SLA, and data residency.
+
+This is the crux for self-hosting: **you're not really competing with DeepSeek's $0.43 floor — you're competing with the $1.74 Western-host markup, and with the constraint that some workloads simply cannot send data to a Chinese API at all.** That's the gap AIBrix self-hosting fills.
+
+## What it costs to run it yourself
+
+Self-hosting cost is first-principles:
+
+```
+$/1M tokens = GPU_$/hr × GPU_count × 1e6 / (throughput_tok_s × utilization × 3600)
+```
+
+GPU prices we'll use (on-demand, 2026-06-09; RunPod bills per-second, Lambda per-minute, **neither offers true spot** — RunPod Community Cloud is the closest discount tier):
+
+| GPU | RunPod (Community / Secure) | Lambda (on-demand) |
+|---|--:|--:|
+| H100 SXM 80GB | $3.29 / $2.69 | $4.29 (single) |
+| H200 141GB | $4.39 / ~$4.24 | not self-serve |
+| A100 80GB | $1.39 / $1.19 | n/a (40GB @ $1.99) |
+| L40S 48GB | $0.86 | n/a |
+| RTX 4090 24GB | $0.69 | n/a |
+
+Throughput swings ±30% with engine version, quant, and sequence mix — so every figure below is **measured** in the benchmark, not estimated:
+
+| Model | Config | Throughput (tok/s) | Self-host $/1M |
+|---|---|--:|--:|
+| Qwen3.6-27B | 1× H100, FP8 | `[MEASURE]` | **`[MEASURE]`** |
+| Qwen3.6-27B | 1× L40S, FP8 | `[MEASURE]` | **`[MEASURE]`** ← cheapest |
+| Qwen3.6-27B | 1× A100-80G, FP8 | `[MEASURE]` | **`[MEASURE]`** |
+| Qwen3.6-35B-A3B (MoE) | 1× H100, FP8 | `[MEASURE]` | **`[MEASURE]`** |
+| DeepSeek-V4-Flash (MoE) | 8× H100, FP8 | `[MEASURE]` | **`[MEASURE]`** |
+
+At **high utilization**, these floors are expected to land in the low-cents-to-~$1/1M range (the measured numbers fill in above) — below the Western open-model hosts, and for the efficient tier potentially below even the cheap first-party APIs. **"At high utilization" is the catch** — and that, not cold start, is what really decides self-host vs. managed. We lay it out as a decision tree below.
+
+## A quick note on cold start
+
+When you self-host, the GPU meter runs during model load + engine init before the first token — about a minute for a 27B, a few minutes for a 70B-class model on several GPUs. A managed API hides this cost; you pay it explicitly.
+
+For **realistic batch — jobs that run an hour or more — it's a rounding error**: a 3-minute warm-up on a 60-minute job is ~5%, and on a fleet that keeps engines warm it amortizes away entirely. Cold start only dominates if you boot a fresh cluster for one tiny job, which isn't how batch runs — so **we assume long-running, high-utilization execution and don't dwell on it.** (A genuinely tiny, sporadic workload is itself a signal to use a managed API — see the decision tree.)
+
+## Putting it together — the decision tree
+
+![Decision tree — Should you self-host batch inference? Q1: is it a multimodal job? → yes, self-host (open is at the frontier). Q2 (text): does it need frontier quality? → yes, pay a closed frontier API (no open substitute). Q3: low volume and data can leave? → yes, use the cheap first-party API; otherwise (high volume, in-house data, or a private model) → self-host on AIBrix](/images/batch-cost-study/aibrix-batch-decision-tree.svg)
+
+The crossover point — **the single number that answers "should I self-host?"** — is `[MEASURE]`, and it's specific to your model, GPU, and volume. The benchmark is how you measure it.
+
+## What AIBrix Batch actually is
+
+![AIBrix Batch architecture — an OpenAI-compatible client talks to the Batch API; a persisted state machine backed by a pluggable metastore (Redis / S3 / TOS / local) is the source of truth; a scheduler dispatches jobs to pluggable execution runtimes (KubernetesJob and a Deployment dispatcher are GA, RunPod/Lambda over SSH are preview); vLLM runs the inference and writes results back to object storage](/images/batch-cost-study/aibrix-batch-architecture.svg)
+
+The *cost economics* above aren't unique to AIBrix — they're the economics of self-hosting in general. What AIBrix provides is the system that makes capturing them practical and safe at scale. Concretely, it's more than a loop that reads a JSONL and calls vLLM:
+
+- **A persisted, event-sourced job state machine — the datastore is the source of truth, not Kubernetes annotations.** Each batch is a JSON document in a pluggable metastore (Redis / S3 / TOS / local) with a whitelisted transition table (`created → scheduling → validating → in_progress → finalizing → finalized`, plus `completed / failed / cancelled / expired` conditions). On restart the manager rehydrates every job from the store — a crash doesn't lose your batch.
+- **Per-request, resumable execution.** A batch can carry up to 50,000 requests; each is independently locked (Redis `NX`), checkpointed, and streamed to object storage as a multipart upload keyed by line index. A worker that dies mid-batch resumes from the next un-done request instead of re-running the job — at-least-once with dedup, memory-bounded, not buffered in RAM.
+- **Storage is pluggable, and the same backends serve two roles.** The input JSONL plus the output / error files live in **object storage — S3, TOS, or local**; the job state machine and per-request progress live in the **metastore — Redis, S3, TOS, or local** (Redis when low-latency state matters; object storage for durability without standing up a database). Results are written incrementally as a multipart upload and served back through `GET /v1/files/{output_file_id}/content` — no external database required.
+- **One driver, a registry of pluggable runtimes.** Every backend runs the same lifecycle (`validate → provision → wait_ready → connect → prepare → run → finalize → teardown`); a new backend is a new registered Runtime, not a fork of the driver. Two are GA, both on Kubernetes: *self-hosting* (a K8s Job with the vLLM engine and the batch worker in one pod — the worker waits for `/health`, then dispatches and aggregates) and *control-plane dispatch* (a per-job Deployment the control plane drives over HTTP). Two more are **preview** — **RunPod** and **Lambda Cloud**: the Resource Manager leases a GPU box and the runtime brings up vLLM on it over SSH (on Lambda behind an SSH tunnel, so the engine is never publicly exposed). The production-honored path today is in-cluster Kubernetes.
+- **OpenAI-faithful surface, including the accounting.** `/v1/files` + `/v1/batches`, the `validating → in_progress → finalizing → completed` lifecycle, the 24-hour window, `custom_id` echo, and a real `usage` object — `input/output/total_tokens` plus `cached_tokens` (prefix-cache hits) and `reasoning_tokens` — accumulated per request and deduped on retry. Job endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`.
+- **An admin/user split that is the security and cost boundary.** Users stay on the stock OpenAI SDK and name a template via `extra_body.aibrix`; the only overridable field is the allowlisted `engine_args` — image, GPU SKU, and model source are admin-only, and any other override key is rejected with a `400`, never silently dropped. Templates are versioned and hot-reload from a ConfigMap with per-item error isolation.
+- **Trough-filling raises utilization — the lever that moves $/token.** Batch is latency-tolerant, so AIBrix packs it into the idle capacity your online serving leaves behind; higher effective utilization is what turns that self-hosting floor from theoretical into real, with no extra hardware.
+
+**Honest boundaries** (a cost article shouldn't oversell the system either): the **vLLM** engine adapter is GA; SGLang / TensorRT-LLM / LMDeploy are in the schema but raise an explicit *unsupported-engine* error until their adapters land. And the self-host throughput numbers above are a **single-machine, vanilla vLLM baseline** — we deliberately didn't chase peak performance. AIBrix stacks three more cost levers on top, all **excluded** here pending their own benchmark: **gateway prefix-cache-aware routing**, **StormService** disaggregated (prefill/decode-split) serving, and **L2 KV offloading / reuse**. Those are a **follow-up post** — so treat every number here as a floor real deployments beat, not a ceiling.
+
+## A worked example — 1M requests, end to end
+
+Make it concrete. You have **1M chat-completion requests** to run offline — classify / extract / caption a dataset, ~512 in / 256 out tokens each (≈ **0.77B tokens**). It's a workhorse/efficient job, so run **Qwen3.6-27B on a single H100**. Submission is the stock OpenAI Batch flow — only the `base_url` changes:
+
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://<aibrix>/v1", api_key="...")
+
+f = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch")
+batch = client.batches.create(
+    input_file_id=f.id, endpoint="/v1/chat/completions", completion_window="24h",
+    extra_body={"aibrix": {"model_template": {"name": "qwen3.6-27b"}}},   # admin-registered template
+)
+# poll batch.status → "completed"; then download client.files.content(batch.output_file_id)
+```
+
+One H100 stays busy for the whole job — high utilization, the regime where the self-host floor is real — and it's a single K8s Job, not a cluster. The cost, against the alternatives you'd realistically reach for (self-host figure from the benchmark):
+
+| This job (~0.77B tokens) | Cost |
+|---|--:|
+| **Self-host Qwen3.6-27B · 1× H100 (AIBrix)** | **$`[MEASURE]`** (~$270 `[est]`) |
+| Same weights on a Western open-host (Fireworks tier) | ~$690 |
+| A frontier US API — GPT-5.5 batch (a higher tier) | ~$5,100 |
+
+Self-hosting beats the Western open-model host (~2.5×) and the frontier US APIs (~19×, if that's the quality you'd otherwise pay for) — and your data never leaves. *Honest counterpoint:* if your data **can** go to the cheapest first-party API (Qwen-Flash batch ≈ **$38** for this job), that's cheaper still — self-hosting here is the call for data control, a private model, or sustained scale, exactly as the decision tree says.
+
+**And the self-host floor keeps dropping.** The $/token isn't static — vLLM gets faster at batch every release: continuous batching, **chunked prefill**, **prefix caching** (a big win when a batch shares a system prompt), FP8 weights + FP8 KV cache, paged attention, and bigger `max_num_seqs` / `max_num_batched_tokens` all raise tokens-per-GPU-second. On top of that, AIBrix's **gateway routing + StormService disaggregation + KV offloading** stack further still (all *excluded* from the conservative single-machine numbers here — a follow-up post measures them). So the gap versus a managed API widens over time, not narrows.
+
+## Reproducing the numbers
+
+Every self-hosted figure here comes from a small, scripted benchmark on RunPod/Lambda (about half a day, well under $100). The full protocol — matrix, commands, logging schema, and post-processing — is kept separate so this post stays about *what the numbers mean*, not how to run them.
+
+## Caveats & sources
+
+GPU and API prices are real quotes as of **2026-06-09** but drift weekly and disagree across aggregators by 5–20% (RunPod Community most of all) — re-verify in-console. Every self-host throughput number is **measured** via the benchmark (not estimated), with the model, GPU, and config logged alongside. Capability tiers use Artificial Analysis as the spine; several open "frontier-adjacent" coding scores are vendor self-reported and not independently reproduced — we do not let them close the frontier gap. *(Full source list: OpenAI / Anthropic / DeepSeek / Alibaba / Fireworks pricing pages; Lambda & RunPod pricing; gpustack, Baseten, Spheron, databasemart throughput labs.)*
diff --git a/static/images/batch-cost-study/artificial-analysis-intelligence-index.png b/static/images/batch-cost-study/artificial-analysis-intelligence-index.png
new file mode 100644
index 0000000..f47a8af
Binary files /dev/null and b/static/images/batch-cost-study/artificial-analysis-intelligence-index.png differ