Skip to content

feat(observability): per-request performance timing → Server-Timing (perf-tuning mode) #2408

Description

@os-zhuang

Problem

We currently cannot answer "where did a request's time go?" from inside the platform. When the cloud control-plane data API feels slow (e.g. a /api/v1/data/sys_environment console request measured at ~360ms of row-count-independent server time on top of network), external black-box probing cannot break it down. The cost is split across auth/session resolution, org-scope beforeFind hooks, the find + count SQL, and serialization — all invisible from outside.

Performance analysis should be a basic platform capability: a "perf-tuning mode" where an operator can see, per request, exactly how the time was spent — without standing up an external tracing backend.

What already exists (reuse, don't reinvent)

  • MetricsRegistry (counter/histogram/gauge) — packages/observability/src/contracts.ts:40
  • instrumentRouteHandler already times every HTTP request and assigns X-Request-Idpackages/runtime/src/observability/instrument.ts:52 (emits http_request_duration_ms)
  • HookMetricsRecorder already times every hook execution — packages/objectql/src/hook-metrics.ts:41
  • ObservabilityServicePlugin registers the metrics service — packages/runtime/src/observability/observability-service-plugin.ts:71

What's missing (this card)

  1. No per-request Server-Timing response header. Metrics only flow to aggregate OTLP; nothing comes back to the browser, so a developer can't open DevTools → Network and see the breakdown for the request they just made.
  2. The SQL driver does not time individual queries. packages/plugins/driver-sql/src/sql-driver.ts never wires knex .on('query'|'query-response'), so we have no per-query duration or per-request query count — the single most useful number for diagnosing "N sequential round-trips".
  3. No request-scoped collector. No AsyncLocalStorage threads a "current request" through find/count/hook/driver, so per-query timings can't be attributed back to the originating HTTP request.

Proposed design

A request-scoped timing collector, gated, that emits a standard Server-Timing header.

  • Collector: an AsyncLocalStorage<TimingCollector> in @objectstack/runtime (or @objectstack/observability). When inactive, record() is a no-op → zero overhead when the mode is off.
  • Entry/exit: instrumentRouteHandler (instrument.ts:52) opens the scope at request start; at response it serializes the collected spans into a Server-Timing header. The dispatcher result already supports headershttp-dispatcher.ts (HttpDispatcherResult.response.headers).
  • SQL spans: wire knex .on('query') + .on('query-response') in SqlDriver (constructor, sql-driver.ts:523), keyed by __knexQueryUid, recording { sql, durationMs } into the active collector. Surfaces total DB time + query count + slowest query.
  • Hook spans: feed the existing HookMetricsRecorder durations (hook-metrics.ts) into the collector too — splits "hook time" from "DB time".
  • Auth/session span: time the identity/session resolution in the dispatcher (http-dispatcher.ts executionContext resolution) — the prime suspect for the unexplained overhead.
  • Header shape (RFC: Server-Timing):
    Server-Timing: auth;dur=42, db;dur=210;desc="6 queries", hooks;dur=18, serialize;dur=7, total;dur=355
    

Gating

  • Off by default. Enabled per-request via a debug header (X-OS-Debug-Timing: 1) requiring admin/service identity, or globally via an env flag (OS_PERF_TIMING=1) for an environment under investigation.
  • Optional richer JSON (per-query SQL list, slowest query) only for admin/service — never leak SQL text to ordinary users.

Open-core boundary

This is a mechanism → lives in the framework (the request path, driver, engine all do). Policy (who may see it, whether it's exposed in the cloud console as a "perf panel") is cloud-side. Mirrors the existing observability split (mechanism in framework, exporter/gating in cloud — see docs/design/observability.md).

Acceptance criteria

  • With the mode on, a /api/v1/data/<object> response carries a Server-Timing header decomposing total into at least: auth, db (with query count), hooks, serialize.
  • Per-query SQL timing attributed to the originating request (works under concurrency — ALS, not globals).
  • Zero measurable overhead when the mode is off (no ALS scope, no-op record()).
  • Gated to admin/service identity (or env flag); SQL text never exposed to non-admins.
  • Unit test: collector accumulates + serializes; integration test: header present on a real find and reflects query count.

Why now

This is the tool that finally breaks down the ~360ms control-plane data-path overhead we've been chasing by external probing (see cloud perf investigation). It's both the diagnostic we need and a permanent platform capability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions