From 1359515c8d3fff83d21e36e8d6218ad00ac72b60 Mon Sep 17 00:00:00 2001
From: Benjamin Knofe-Vider <benjamin.k@posthog.com>
Date: Tue, 16 Jun 2026 15:45:00 +0200
Subject: [PATCH] docs(design): add managed-warehouse compute-seconds billing
 design

Design doc for per-org compute billing of managed-warehouse worker pods
(remote k8s backend). Covers the billable unit (compute_seconds = worker
size x connection wall-clock), connection-end metering, config-store
aggregation buffer + leader drain with at-most/least-once delivery, and
emit via public ingestion capture events that ride the existing
usage_report -> billing pipeline.

Draft / iterating; one open item (billing-side usage_key + token registration).
---
 docs/design/billing-compute-seconds-plan.md | 505 ++++++++++++++++++++
 1 file changed, 505 insertions(+)
 create mode 100644 docs/design/billing-compute-seconds-plan.md

diff --git a/docs/design/billing-compute-seconds-plan.md b/docs/design/billing-compute-seconds-plan.md
new file mode 100644
index 00000000..601ecc85
--- /dev/null
+++ b/docs/design/billing-compute-seconds-plan.md
@@ -0,0 +1,505 @@
+# Compute-Seconds Billing — Design / Implementation Plan
+
+Status: **DRAFT — iterating**. Last updated 2026-06-16.
+
+This is a living design doc. Decisions marked ✅ are locked; ❓ are open forks.
+
+> **Scope: remote Kubernetes backend only** (`--mode control-plane
+> --worker-backend remote`, `-tags kubernetes`). This is where a client session
+> holds a dedicated per-org worker **pod** (one-session-per-worker contract) with
+> a known CPU/mem size, which is the thing we bill. Standalone and process
+> backends are out of scope — no per-org worker pod, no `WorkerProfile`, metering
+> is simply not wired there.
+
+---
+
+## 1. Goal
+
+Meter and bill per-org compute usage of duckgres worker pods, reusing PostHog's
+existing usage→billing plumbing (the pattern the feature flag service uses). No
+custom side-channel store; no per-query rows in the config store; no early
+landing path. The metric is a single scalar shipped like any other PostHog
+billable resource.
+
+The DuckLake `system.query_log` is a **separate, independent** concern
+(debugging/analytics) and is NOT the billing path. This plan does not depend on
+it.
+
+✅ Billing usage_key = **`managed_warehouse_compute_seconds`**.
+
+---
+
+## 2. Billable unit — `managed_warehouse_compute_seconds` (single scalar)
+
+✅ One billable resource, one number per org/period. Worker size folded into the
+scalar (pre-weighted, like FF `local_eval ×10` / `ai_credits ×100×1.2`).
+
+```
+conn_secs       = connection wall-clock (connect → disconnect/reap/error/shutdown-close)
+compute_seconds = cu × ceil(conn_secs)           # round UP to full second
+cu              = max(cores, gib / R)            # compute-unit = dominant resource
+```
+
+- `cores` = worker pod vCPU (WorkerProfile.CPU, millicores → cores)
+- `gib`   = worker pod RAM (WorkerProfile.Memory, bytes → GiB)
+- `R = 2` ✅ — GiB-per-vCPU of the densest mainstream AWS shape (c-family 1:2).
+  Floor: any worker bills ≥ its cpu-seconds; mem-heavier shapes (m/r/x) bill more
+  on RAM. `R` is a tunable pricing constant (config/env), not a schema field.
+- `max` not sum: a pod reserves cpu+mem together → bill the binding dimension,
+  never double-count. Balanced worker → equals cpu-seconds.
+
+✅ **Rounding = ceil per connection** (one ceil per connection, at its end).
+
+Worked examples (R=2):
+
+| worker pod    | conn    | ceil | cu = max(cores, gib/2) | compute_seconds |
+|---------------|---------|------|------------------------|-----------------|
+| 8cpu / 16Gi   | 9.2s    | 10   | max(8, 8) = 8          | 80              |
+| 8cpu / 64Gi   | 10.0s   | 10   | max(8, 32) = 32        | 320             |
+| 2cpu / 4Gi    | 0.3s    | 1    | max(2, 2) = 2          | 2               |
+
+A connection holds exactly one worker pod for its whole life (one-session-per-
+worker). Warm-idle pool time *between* connections is held by no connection →
+billed to no one (our infra cost). Only live-connection time bills.
+
+---
+
+## 3. What is billed vs not (locked)
+
+✅ **Billed** — the whole connection's worker-held wall-clock, because the
+connection holds a dedicated pod the entire time:
+- query execution, transpilation, Parse/Bind/Execute (all protocols)
+- result rows streamed to the client — including slow-client transfer (worker
+  pinned sending → fair to bill)
+- inbound `COPY FROM` upload
+- **idle-but-connected time** — an open connection holds the pod, unavailable to
+  others → billed
+- a long query that **dies on a user error** — pod held the whole time, and the
+  connection still reaches its end report → billed (automatic)
+
+❌ **Not billed / lost (for now):**
+- Catalog ATTACH / worker activation — per-session setup before the client gets
+  the session, attributed to no connection. Bigger catalog → slower attach, left
+  amortized-free for v1.
+- Warm-idle worker pool time (between connections) — our infra cost, not a user's.
+- **Anything lost to a hard crash** — we report only at connection end (§4), so a
+  CP-pod or worker-pod crash mid-connection loses that connection's entire bill.
+  Accepted ("a 3h query that then crashes is lost for billing, for now").
+
+---
+
+## 4. Timing model — report once at connection end ✅
+
+**We do NOT meter per query.** We bill the connection's wall-clock and **emit
+once, when the connection ends.** A connection lives entirely on one CP pod (k8s
+replaces CP pods on deploy; client connections do not migrate between pods), so
+there is exactly one start and one end — no segments.
+
+Why end-only:
+- At connection end we know the **end reason** (clean disconnect, idle reap, user
+  error, fatal error) — error handling is trivial and reads naturally.
+- No per-query accumulator, no heartbeat, no drip. Simplest possible.
+- **Loss model:** a connection contributes nothing until it ends. A hard crash
+  before the end report loses that connection's entire bill. Accepted for v1.
+
+```
+conn start: client connect (CP assigns/spawns the org's worker pod)
+conn end:   client disconnect | idle reap | fatal error
+            | CP graceful shutdown closing the connection
+on end →  compute_seconds = cu × ceil(now - connStart)
+          → add to in-process per-org counter (§6), best-effort
+```
+
+### End reason (recorded, not yet acted on)
+Capture the end reason at the report point for visibility (and future infra-error
+exclusion). **v1 bills regardless of reason** — the only thing that escapes
+billing is a crash that never reaches the report. No classification logic now.
+
+### Plant point
+The connection teardown path in `server/conn.go` (where the `clientConn` serve
+loop exits / `Close`). `cu` is on the `clientConn` (§5); `connStart` recorded at
+connection setup.
+
+---
+
+## 5. Code changes — duckgres (emit side, remote backend only)
+
+### 5.1 Plumb worker pod size onto the connection
+`WorkerProfile.CPU/Memory` is a local var in the CP connect handler
+(`controlplane/control.go:1046`), NOT reachable at connection time. Thread it
+through and precompute `cu`:
+
+1. Add fields to `clientConn` (`server/conn.go:154`): `cu float64` (or fixed-point)
+   and `connStart time.Time`. Store precomputed `cu = max(cores, gib/R)` (constant
+   for the session's worker). `cu == 0` (e.g. non-remote / unknown profile) →
+   metering skipped.
+2. Add params to `NewClientConn` (`server/exports.go:52`).
+3. Compute + pass at the call site (`controlplane/control.go:1303`) —
+   `workerProfile` in scope from `:1046`; reuse `parseK8sCPU`/`parseK8sMemory`
+   from `workerDuckDBLimits` (`control.go:1345`) to normalize cpu→cores, mem→GiB.
+
+### 5.2 Report at connection end
+- Record `connStart` at connection setup.
+- On teardown (serve-loop exit / Close): if `cu > 0`, compute
+  `cu × ceil(now - connStart)`, record end reason, add to the in-process per-org
+  counter (§6.1). **Best-effort** — wrap so a metering error never blocks or fails
+  connection teardown.
+- No per-query hooks. (Optional: also set a `compute_seconds` column on the
+  DuckLake query_log if/when enabled, for reconciliation — independent of billing.)
+
+### 5.3 Graceful shutdown flush (CP churn correctness) ✅
+Verified shutdown behavior (remote/k8s), with the facts that make report-at-end
+safe:
+
+- On SIGTERM the CP pod marks not-ready, stops accepting **new** connections, then
+  blocks on `cp.wg.Wait()` for every existing connection goroutine to return
+  (`controlplane/control.go:585-610`, `:1556-1631`, `:685/687`).
+- It **waits for the whole CONNECTION to end (client disconnect/EOF)**, NOT just
+  the current query — `messageLoop` loops over queries and returns only on client
+  disconnect, idle-timeout, or fatal error (`server/conn.go:1047-1064`). A finished
+  query does not end the connection.
+- The drain wait is **UNBOUNDED in remote mode**: `HandoverDrainTimeout = 0`
+  (`control.go:256-268`) — the old 15m default was removed after it cut in-flight
+  customer queries (CLAUDE.md/help/README still say "15m" — stale). duckgres never
+  force-closes a live connection; `activeConns` is just a counter.
+- The only hard cutoff is the CP pod's k8s `terminationGracePeriodSeconds` →
+  SIGKILL. ✅ **Prod = 24h.** With unbounded drain + a 24h grace, essentially every
+  connection reaches its **natural end on the terminating pod** → its end report
+  (§5.2) fires → billed in full.
+
+So on shutdown:
+1. Stop accepting new connections.
+2. Existing connections run to their natural end (client disconnect) on this pod;
+   each fires one complete end report (§5.2).
+3. **Final flush** of the in-process counter to the config store before
+   `os.Exit(0)`.
+
+Correct across CP pods coming/going because the buffer is the shared config store
+with UPSERT-increment (§6.3): a departing pod adds its counts before exit; the
+leader drainer + survivors carry on.
+
+> Side note: the unbounded remote drain (idle connections can hold a terminating
+> CP pod open for up to the 24h grace wall) is a separate problem tracked in
+> PostHog/duckgres#782 — not a billing blocker.
+
+**Residual loss (A, accepted):** a connection still open at the 24h grace wall is
+SIGKILLed → no end report → its whole bill lost; likewise a hard crash. Both are
+the accepted crash case (§3/§4); with 24h grace this is rare. (If it ever matters
+— short grace, very long idle sessions — the fix is per-connection periodic
+checkpointing, see §10; NOT doing it now.)
+
+e2e must assert: roll the CP deployment while a session is connected → the pod
+stays `Terminating` until the connection ends and the session's compute_seconds
+still lands.
+
+---
+
+## 6. Emit transport & reliability — duckgres → PostHog
+
+### 6.1 Reliability principle (LOAD-BEARING) ✅
+**Losing billing data is acceptable; failing a query/connection is not.** Metering
+must never block or fail the request/teardown path.
+- Connection end writes `compute_seconds` to an **in-process per-org counter only**
+  (map + mutex, microseconds, no I/O). Even that is best-effort — swallow errors.
+- Everything downstream (flush to buffer, ship to PostHog) is **async, off-path,
+  retried**. Any downstream outage loses at most some counts; never delays/fails a
+  client connection.
+- duckgres takes **no new hard dependency** on PostHog ingestion availability.
+
+### 6.2 Pipeline shape
+```
+conn end → in-proc per-org counter (mem)        [best-effort; never fails teardown]
+        │  periodic flush (~15s), async
+        ▼
+   config store: duckgres_org_compute_usage (org × time-bucket → compute_seconds)
+        │  leader drainer (one CP pod), ~60s, retried
+        ▼
+   PostHog capture("managed warehouse compute usage",
+                   {distinct_id: org_id, count: compute_seconds})
+        │
+        ▼
+   ClickHouse → usage_report CH sum per org/period → POST billing service
+```
+Bucket keyed by **connection-end time**: `bucket_start = floor(end_time / width)`.
+
+### 6.3 Durable buffer — config store (Postgres) ✅
+No new infra (already a hard dependency of the remote control plane); aggregated
+rows only (org × time-bucket), NOT per-query rows; survives CP restarts.
+
+```
+duckgres_org_compute_usage(
+    org_id          TEXT,
+    bucket_start    TIMESTAMPTZ,
+    compute_seconds BIGINT NOT NULL,
+    PRIMARY KEY (org_id, bucket_start)
+)
+duckgres_org_compute_drain_state(
+    org_id              TEXT PRIMARY KEY,
+    last_drained_bucket TIMESTAMPTZ NOT NULL      -- high-water mark
+)
+```
+Flush = UPSERT-increment (sums across CP pods):
+```
+INSERT INTO duckgres_org_compute_usage(org_id, bucket_start, compute_seconds)
+VALUES (...)
+ON CONFLICT (org_id, bucket_start)
+DO UPDATE SET compute_seconds = duckgres_org_compute_usage.compute_seconds
+                              + EXCLUDED.compute_seconds;
+```
+(Migrations go in the config store's Goose set — see `controlplane/configstore/`.)
+
+### 6.4 Bucket close + drain (delivery = ship-then-delete ✅)
+
+**Bucket** = aligned window, key `bucket_start = floor(end_time / width)` (width
+60s). **Closed** (time-based, no coordination):
+```
+closed  ⇔  now ≥ bucket_start + width + grace
+grace   ≥  in-proc flush_interval + clock_skew_margin   (60s / 15s / 15s → grace 30s)
+```
+Grace waits out CP-pod flush lag so every contribution has landed before drain.
+
+**Delivery contract ✅ (ship-then-delete, at-least-once):** delete a bucket ONLY
+after ingestion confirms success; a ship failure keeps the row and retries next
+tick — never lose a bucket to a transient outage.
+
+**Leader drain loop** (leader-only goroutine, ~60s; NOT a k8s CronJob — runs
+alongside `leader_loop.go`/`janitor.go`):
+```
+for each org's closed, not-yet-drained buckets
+    (bucket_start ≤ now - width - grace  AND  bucket_start > last_drained_bucket):
+  read compute_seconds
+  ship capture("managed warehouse compute usage",
+               {distinct_id: org_id, count: compute_seconds},
+               event_uuid = hash(org_id, bucket_start),   -- deterministic, stable across retries
+               timestamp  = bucket_start)                 -- stable: same toDate on every retry
+  on SUCCESS: TXN { advance last_drained_bucket = bucket_start; DELETE the row }
+  on FAILURE: leave row; retry next tick                  -- no data loss
+```
+
+**Idempotency (effectively exactly-once for billing).** Only double-ship window =
+crash after ingestion ack but before the delete commits → re-ship next tick. The
+deterministic `event_uuid = hash(org_id, bucket_start)` + stable
+`timestamp = bucket_start` + fixed event/distinct_id make the re-ship **fold into
+one** in PostHog's billable query: `usage_report` counts billable events with
+`count(DISTINCT toDate(timestamp), event, cityHash64(distinct_id),
+cityHash64(uuid))` (`usage_report.py:557/584`), independent of CH merge timing.
+(Verified: CH inserts the dup row; the billable `count(distinct)` collapses it.
+Only non-billable plain `count()` queries would see both — irrelevant.)
+
+**High-water mark** also drops the late-write corner (a CP pod stalled longer than
+`grace` re-INSERTing an already-drained bucket): `bucket_start >
+last_drained_bucket` skips it, and the cleanup sweep removes it.
+
+**Cleanup is free:** the drain DELETE *is* the cleanup — `duckgres_org_compute_usage`
+only ever holds open/recent + retrying buckets; inactive orgs drain to empty.
+`drain_state` is one tiny row per org. Safety sweep: hard-delete any lingering row
+`≤ last_drained_bucket`.
+
+### 6.5 PostHog ship — public ingestion (capture, like any SDK) ✅
+The leader drainer ships each closed bucket as a **`capture()` event to PostHog's
+public ingestion endpoint over HTTPS**, exactly like any customer SDK — authed by a
+project API token. No private path, no new AWS infra.
+
+- **Why public:** the managed-warehouse EKS cluster has **no private network path**
+  to posthog (its VPC is not peered to posthog, and VPC peering is non-transitive) —
+  but it **does** have **NAT egress to the internet**. So the public ingestion host
+  is reachable today. PrivateLink was evaluated and rejected on cost: at our volume
+  (one small event per active org per ~60s bucket = single-digit GB/month) the
+  standing per-region cost of a PrivateLink (an NLB + interface endpoint per region)
+  is not worth it versus the near-zero marginal cost over the already-present NAT.
+  Data carries only `org_id` + `compute_seconds` — no PII — so public TLS is fine.
+- **Requirement:** mw EKS egress must reach the ingestion host. NAT provides
+  general internet egress; if a cluster egress NetworkPolicy/firewall is in place,
+  allow the ingestion domain. No AWS-account-level infra.
+- **Config (duckgres, remote backend):** ingestion base URL (e.g.
+  `https://us.i.posthog.com`) + a project API token, env/CLI. Unset → metering
+  ships nowhere (logs only), never fails a query.
+
+This unblocks the FF-style path end to end:
+`capture() → ClickHouse → usage_report gather → billing` (§7 now applies as-is).
+
+The duckgres side is unchanged from §6.4: config-store buffer + leader
+ship-then-delete with deterministic `event_uuid = hash(org_id, bucket_start)` +
+`timestamp = bucket_start`; "ack" = the ingestion HTTP 2xx. (The buffer + leader +
+idempotency still earn their keep: cross-pod aggregation, retry across ingestion
+blips, and exactly-once-for-billing via the deterministic uuid.)
+
+### Org identity ✅
+`org_id` is available in the control plane per connection — the natural key
+end-to-end (counter → buffer → capture `distinct_id`). (Token + any org_id→team
+mapping = billing-team config, owner will handle.)
+
+---
+
+## 7. PostHog side — billing report (no enforcement in v1)
+
+(Repo: `~/code/posthog/posthog`. Report path only.)
+
+### How posthog usage→billing works (verified) — and where we plug in
+Usage→billing = the Temporal `usage_report` workflow (`posthog/temporal/usage_report/`):
+```
+gather queries (ClickHouse + Postgres)  →  write per-org JSONL + manifest to S3
+   →  SQS pointer  →  external billing service READS the S3 files
+```
+- **All usage is sourced from CH/PG gather queries** (`usage_report/queries.py`);
+  the S3/SQS part is the outbound handoff to billing. There is no inbound
+  "read usage from a bucket" path — and we don't need one.
+- **Our plug-in point = ClickHouse**, exactly like feature flags: duckgres ships
+  `"managed warehouse compute usage"` **capture events** (§6.5) which land in CH;
+  we add a **gather query** that sums them (§7.2), and it rides the existing
+  S3→SQS→billing handoff. No cross-account data plumbing.
+- `org_id == posthog team_id` (`products/data_warehouse/backend/api/data_warehouse.py:846`)
+  — identity is trivial. ✅
+- Managed warehouse today = provisioning-proxy only, **zero metering**; no
+  `compute_seconds`, no managed-warehouse quota resource, no SKU. All net-new.
+- `usage_report/activities.py`, `storage.py:31`, `settings/object_storage.py:45-48`;
+  FF reference (capture-events-into-CH + gather): `flag_analytics.py:69/161`,
+  `usage_report.py:557/584/968`.
+
+### Integration seam ✅ (resolved — capture events land it in CH)
+duckgres ships compute_seconds as **`capture()` events to public ingestion**
+(§6.5). Those events land in posthog's **ClickHouse** like any other event, so the
+existing usage→billing pipeline picks them up via a gather query — **no cross-account
+data plumbing, no new inbound consumer.** This is exactly the feature-flag
+`"decide usage"` pattern: ship usage events, then a gather query sums them.
+
+Concretely: add a CH gather query for the `"managed warehouse compute usage"`
+events (§7.2). It rides the existing `usage_report` → S3 → SQS → billing handoff
+for free. The earlier "how does mw data reach posthog CH/PG" blocker is gone —
+public capture *is* the landing.
+
+### No enforcement in v1
+
+✅ **No enforcement / quota-limiting in v1.** Meter + report only. No
+`is_team_limited`, no admission gate, no shutdown in duckgres. (Enforcement
+plumbing exists and can be wired later. Out of scope now.)
+
+### 7.1 Register the billable resource
+- Add `MANAGED_WAREHOUSE_COMPUTE_SECONDS = "managed_warehouse_compute_seconds"` to
+  `QuotaResource` (`ee/billing/quota_limiting.py:73`) + `UsageCounters` (`:119`).
+- Coordinate the usage_key with the billing service `default_plans_config.yml`.
+
+### 7.2 Usage report field
+- Add `managed_warehouse_compute_seconds_in_period` (flat int) to
+  `UsageReportCounters` (`posthog/tasks/usage_report.py:120`).
+- Add a CH counter `get_teams_with_managed_warehouse_compute_seconds_in_period`
+  over the `"managed warehouse compute usage"` events, token-gated, using
+  `count(distinct ...)` (mirror `:557/:584/:968`).
+- Assemble in `_get_team_report` (`:2402`); flows to billing via existing
+  `send_report_to_billing_service` (`:425`). No new transport.
+
+---
+
+## 8. Testing
+
+### duckgres (unit)
+- `compute_seconds = cu × ceil(conn_secs)` math (R, dominant-resource `max`, ceil,
+  cpu/mem normalization, `cu==0` skip). End-reason capture. Best-effort property (a
+  metering error never fails teardown). Flush UPSERT-increment + drain
+  ship-then-delete + high-water + idempotent uuid.
+
+### duckgres (e2e — `tests/e2e-mw-dev/harness.sh`, required per CLAUDE.md)
+Provision an org with a known worker profile, then assert emitted compute_seconds:
+- run a query, disconnect → ≈ cu × ceil(connection wall-clock).
+- connect, sit **idle**, disconnect → idle time billed.
+- long query that errors with a **user error**, then disconnect → still billed.
+- large-result / throttled SELECT + a COPY FROM → their time inside the connection.
+- **CP churn:** roll the control-plane deployment mid-session → the session's
+  compute_seconds still lands (graceful drain + shutdown flush work).
+- **reliability:** make the durable buffer unreachable → queries/connections all
+  still succeed; counts silently dropped.
+- both metadata backends (cnpg + ext) where it touches metadata.
+
+### posthog (unit)
+New CH counter + usage_report field + quota_limiting resource registration,
+following FF test patterns.
+
+---
+
+## 9. Phasing
+
+1. **Meter + plumbing (duckgres, no emit):** plumb worker pod size + `cu` onto
+   `clientConn`, record `connStart`, compute `compute_seconds` at connection end,
+   write to in-proc counter, log to slog. Unit tests. Reliability property in
+   place. Verifiable in isolation.
+2. **Durable buffer + flush + shutdown (duckgres):** the two config-store tables
+   (Goose migration), periodic async flush, cross-pod aggregation,
+   graceful-shutdown flush (§5.3). Assert flush failures never touch the request
+   path.
+3. **Drain → PostHog (duckgres):** leader drainer + posthog-go capture client +
+   token, ship-then-delete with deterministic uuid/timestamp. e2e: event lands AND
+   queries survive PostHog being down AND survive a CP roll.
+4. **Billing report (posthog):** register resource + usage_report CH counter +
+   send-to-billing. Coordinate usage_key with billing service.
+
+Enforcement = possible later follow-up, NOT in this plan. Each phase
+independently shippable/testable.
+
+---
+
+## 10. Open decisions (tracker)
+
+- ❓ (owner: billing team, not code) provision a billing-analytics validity token
+   for the `"managed warehouse compute usage"` event; register usage_key
+   `managed_warehouse_compute_seconds` + any org_id→team mapping.
+
+Resolved / out of scope:
+- ✅ **Intervals = bucket width 60s / in-proc flush 15s / grace 30s / drain tick 60s.**
+- ✅ **Emit transport = public ingestion** (capture events over HTTPS via NAT,
+   like any SDK). PrivateLink evaluated + rejected on cost (standing per-region
+   NLB+endpoint cost not worth it at single-digit-GB/mo volume). Posthog
+   integration = capture events land in CH + a gather query in
+   `usage_report/queries.py` → rides existing S3→SQS→billing. No cross-account
+   plumbing, no new posthog inbound consumer.
+- ✅ Shutdown loss model = **A (report-at-end)**. Verified: remote drain is
+   unbounded (`HandoverDrainTimeout=0`), waits for the whole connection to end
+   (client disconnect, not query), no force-close; only cutoff is pod
+   `terminationGracePeriodSeconds` = **24h in prod** → connections finish
+   naturally and bill in full. Lose only connections >24h or hard crashes
+   (accepted). Per-connection checkpointing (B) considered, NOT done now.
+- ✅ **Scope = remote Kubernetes backend only** (per-org worker pod with a known
+   size). Standalone/process not wired.
+- ✅ usage_key = `managed_warehouse_compute_seconds`.
+- ✅ Bill **connection wall-clock** (incl. idle); one whole worker pod per session.
+- ✅ Report **once at connection end** — end reason known then; no segments (k8s
+   CP pods don't hand connections over), no drip. A hard crash mid-connection
+   loses that bill (accepted, incl. a 3h query).
+- ✅ Unit `compute_seconds = max(cores, gib/2) × ceil(conn_secs)`, R=2; ceil
+   per connection.
+- ✅ User-error query billed (automatic); infra-failure handled only by
+   crash-loses-it for v1 (no classifier).
+- ✅ Graceful CP shutdown flushes connection-end reports + in-proc counter;
+   correct across CP-pod churn via shared config-store UPSERT buffer.
+- ✅ Durable buffer = config store, aggregated `duckgres_org_compute_usage`.
+- ✅ Drain = leader goroutine (not CronJob); time-based bucket close; delivery
+   ship-then-delete (at-least-once), deterministic uuid+timestamp → billable
+   count(distinct) folds duplicates; per-org high-water + DELETE = cleanup.
+- ✅ No enforcement / quota-limiting in v1.
+- ✅ Reliability: request/teardown path never fails on metering; lose over fail.
+- ✅ org_id is the identity key, available in the control plane.
+- (deferred) ATTACH / activation cost recovery — out of scope v1.
+
+---
+
+## 11. Reference — key code locations
+
+duckgres (remote backend):
+- connection teardown / serve-loop exit (plant point) — `server/conn.go`
+  (`clientConn` lifecycle; exact line TBD)
+- graceful shutdown / drain — `--handover-drain-timeout`; CP stop path
+  `controlplane/control.go` (`stopQueryLogger` neighborhood, `:1657/:1842`)
+- worker profile origin — `controlplane/control.go:1046`; sizing
+  `workerDuckDBLimits` `:1345` (`parseK8sCPU`/`parseK8sMemory`)
+- clientConn struct — `server/conn.go:154`; `NewClientConn` `server/exports.go:52`;
+  call site `control.go:1303`
+- leader loop (drain home) — `controlplane/leader_loop.go`, `janitor.go`
+- config store + migrations — `controlplane/configstore/` (Goose)
+- ATTACH (not billed) — `server/server.go:1204` `ActivateDBConnection`
+- query_log chokepoint (optional reconciliation column) — `server/querylog.go:487`
+
+posthog:
+- `QuotaResource` / billable `count(distinct)` — `ee/billing/quota_limiting.py:73`;
+  `posthog/tasks/usage_report.py:557/584`
+- usage report dataclass + assembly + send —
+  `posthog/tasks/usage_report.py:120 / :2402 / :425`
+- FF emit reference — `products/feature_flags/backend/flag_analytics.py:69/161`