Add production startup and TTFT telemetry by mzeng-openai · Pull Request #22198 · openai/codex

mzeng-openai · 2026-05-11T19:41:49Z

Why

While investigating codex exec hi startup latency, the useful questions were not "is startup slow?" but "which durable bucket is slow in production?"

The path we observed has a few distinct stages:

thread/start creates the session
startup prewarm builds the turn context, tools, and prompt
startup prewarm warms the websocket
the first real turn resolves the prewarm
the model produces the first token

Before this PR, production telemetry had some of the raw measurements already:

aggregate startup-prewarm duration / age-at-first-turn metrics
TTFT as a metric
websocket request telemetry

But there was no coherent production event stream for the startup breakdown itself, and TTFT was metric-only. That made it hard to answer the same latency questions from OpenTelemetry-backed logs without adding one-off local instrumentation.

What changed

Add durable production telemetry on the existing SessionTelemetry path:

new codex.startup_phase OTel log/trace events plus codex.startup.phase.duration_ms
new codex.turn_ttft OTel log/trace events while preserving the existing TTFT metric

The startup phase event is emitted for the coarse buckets we actually observed while running exec hi:

thread_start_create_thread
startup_prewarm_total
startup_prewarm_create_turn_context
startup_prewarm_build_tools
startup_prewarm_build_prompt
startup_prewarm_websocket_warmup
startup_prewarm_resolve

These phases are intentionally low-cardinality so they remain safe as production telemetry tags.

Why this shape

This keeps the instrumentation on the same production path as the rest of the session telemetry instead of adding a local debug-only trace mode. It also avoids changing startup behavior:

prewarm still runs
no control flow changes
no extra remote calls
no user-visible behavior changes

One boundary is intentional: very early process bootstrap that happens before a session exists is not included here, because this PR uses session-scoped production telemetry. The expensive buckets we were trying to understand after thread/start are now covered durably.

Verification

cargo test -p codex-otel
cargo test -p codex-core turn_timing
cargo test -p codex-core regular_turn_emits_turn_started_without_waiting_for_startup_prewarm
cargo test -p codex-core interrupting_regular_turn_waiting_on_startup_prewarm_emits_turn_aborted
cargo test -p codex-app-server thread_start
just fix -p codex-otel -p codex-core -p codex-app-server

I also ran cargo test -p codex-core; it built successfully and then hit an existing unrelated stack overflow in tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed.

owenlin0 · 2026-05-11T22:53:46Z

+            "thread_start_create_thread",
+            create_thread_started_at.elapsed(),
+            Some("ready"),
+        );


why call it done here instead of at the bottom of the method?

owenlin0 · 2026-05-11T22:54:55Z

 }
+
+#[test]
+fn otel_export_routing_policy_routes_startup_and_ttft_observability() {


is this test useful?

owenlin0

small comments but preapproving

mzeng-openai added 2 commits May 11, 2026 12:40

Add opt-in startup latency tracing

7f013f1

Replace local startup tracing with production telemetry

18937d8

mzeng-openai changed the title ~~Add opt-in startup latency tracing~~ Add production startup and TTFT telemetry May 11, 2026

mzeng-openai marked this pull request as ready for review May 11, 2026 20:41

mzeng-openai requested a review from a team as a code owner May 11, 2026 20:41

mzeng-openai requested review from alexsong-oai, owenlin0 and pakrym-oai and removed request for pakrym-oai May 11, 2026 20:41

owenlin0 reviewed May 11, 2026

View reviewed changes

owenlin0 approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add production startup and TTFT telemetry#22198

Add production startup and TTFT telemetry#22198
mzeng-openai wants to merge 2 commits into
mainfrom
dev/mzeng/startup_latency_telemetry

mzeng-openai commented May 11, 2026 •

edited

Loading

Uh oh!

owenlin0 May 11, 2026

Uh oh!

owenlin0 May 11, 2026

Uh oh!

owenlin0 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mzeng-openai commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Why this shape

Verification

Uh oh!

owenlin0 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

owenlin0 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

owenlin0 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mzeng-openai commented May 11, 2026 •

edited

Loading