Skip to content

Add production startup and TTFT telemetry#22198

Open
mzeng-openai wants to merge 2 commits into
mainfrom
dev/mzeng/startup_latency_telemetry
Open

Add production startup and TTFT telemetry#22198
mzeng-openai wants to merge 2 commits into
mainfrom
dev/mzeng/startup_latency_telemetry

Conversation

@mzeng-openai
Copy link
Copy Markdown
Collaborator

@mzeng-openai mzeng-openai commented May 11, 2026

Why

While investigating codex exec hi startup latency, the useful questions were not "is startup slow?" but "which durable bucket is slow in production?"

The path we observed has a few distinct stages:

  1. thread/start creates the session
  2. startup prewarm builds the turn context, tools, and prompt
  3. startup prewarm warms the websocket
  4. the first real turn resolves the prewarm
  5. the model produces the first token

Before this PR, production telemetry had some of the raw measurements already:

  • aggregate startup-prewarm duration / age-at-first-turn metrics
  • TTFT as a metric
  • websocket request telemetry

But there was no coherent production event stream for the startup breakdown itself, and TTFT was metric-only. That made it hard to answer the same latency questions from OpenTelemetry-backed logs without adding one-off local instrumentation.

What changed

Add durable production telemetry on the existing SessionTelemetry path:

  • new codex.startup_phase OTel log/trace events plus codex.startup.phase.duration_ms
  • new codex.turn_ttft OTel log/trace events while preserving the existing TTFT metric

The startup phase event is emitted for the coarse buckets we actually observed while running exec hi:

  • thread_start_create_thread
  • startup_prewarm_total
  • startup_prewarm_create_turn_context
  • startup_prewarm_build_tools
  • startup_prewarm_build_prompt
  • startup_prewarm_websocket_warmup
  • startup_prewarm_resolve

These phases are intentionally low-cardinality so they remain safe as production telemetry tags.

Why this shape

This keeps the instrumentation on the same production path as the rest of the session telemetry instead of adding a local debug-only trace mode. It also avoids changing startup behavior:

  • prewarm still runs
  • no control flow changes
  • no extra remote calls
  • no user-visible behavior changes

One boundary is intentional: very early process bootstrap that happens before a session exists is not included here, because this PR uses session-scoped production telemetry. The expensive buckets we were trying to understand after thread/start are now covered durably.

Verification

  • cargo test -p codex-otel
  • cargo test -p codex-core turn_timing
  • cargo test -p codex-core regular_turn_emits_turn_started_without_waiting_for_startup_prewarm
  • cargo test -p codex-core interrupting_regular_turn_waiting_on_startup_prewarm_emits_turn_aborted
  • cargo test -p codex-app-server thread_start
  • just fix -p codex-otel -p codex-core -p codex-app-server

I also ran cargo test -p codex-core; it built successfully and then hit an existing unrelated stack overflow in tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed.

@mzeng-openai mzeng-openai changed the title Add opt-in startup latency tracing Add production startup and TTFT telemetry May 11, 2026
@mzeng-openai mzeng-openai marked this pull request as ready for review May 11, 2026 20:41
@mzeng-openai mzeng-openai requested a review from a team as a code owner May 11, 2026 20:41
@mzeng-openai mzeng-openai requested review from alexsong-oai, owenlin0 and pakrym-oai and removed request for pakrym-oai May 11, 2026 20:41
"thread_start_create_thread",
create_thread_started_at.elapsed(),
Some("ready"),
);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why call it done here instead of at the bottom of the method?

}

#[test]
fn otel_export_routing_policy_routes_startup_and_ttft_observability() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test useful?

Copy link
Copy Markdown
Collaborator

@owenlin0 owenlin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments but preapproving

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants