Skip to content

fix(producer): restore calibration timeout ceiling + add pipeline observability#1233

Merged
miguel-heygen merged 3 commits into
mainfrom
fix/calibration-timeout-regression
Jun 6, 2026
Merged

fix(producer): restore calibration timeout ceiling + add pipeline observability#1233
miguel-heygen merged 3 commits into
mainfrom
fix/calibration-timeout-regression

Conversation

@miguel-heygen
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen commented Jun 6, 2026

Summary

Fixes #1231 — CLI render hangs at "Initializing calibration session..." on v0.6.74.

Root cause: PR #1221 changed Math.minMath.max in createCaptureCalibrationConfig, raising the calibration protocol timeout from 30s to the default 300s. When any CDP call stalls during session init (page.goto, pollHfReady, page.evaluate), the 300s timeout makes the render appear to hang indefinitely. Note: BeginFrame is Linux-only — the reporter is on macOS where screenshot mode is always used, so this is a stalled CDP call in the screenshot init path, not a BeginFrame wedge.

Commit 1: Restore calibration timeout ceiling

  • Math.maxMath.min in createCaptureCalibrationConfig — calibration stays capped at 30s so a stuck CDP call fails fast and the fallback path recovers
  • Users who need longer timeouts for the main render can still use --protocol-timeout
  • Per-phase timing logs in initializeSession ([initSession:<mode>] <phase> (<ms>ms)) to pinpoint which step stalls on affected machines

Commit 2: Render pipeline observability

Added structured logging at four key points so future issues contain enough context for triage:

Log line What it captures
[Render] Pipeline started Platform, arch, node version, all timeout values, GPU mode
[Render] Calibration config Calibration timeout vs parent timeout, forceScreenshot, frame count
[BrowserManager] Browser launched Chrome version, capture mode, headless shell path, platform
[Render] Failure summary Failed stage, error, elapsed time, stage timings, last 5 console errors

Test plan

  • renderOrchestrator.test.ts — 59/59 pass (calibration safeguard tests updated)
  • TypeScript compiles cleanly (engine + producer)
  • Lint + format clean (oxlint + oxfmt)
  • Local render verified: all new log lines appear, healthy init completes in ~120ms
  • Pre-commit hooks pass (lefthook: lint, format, typecheck, fallow)

…der hang

The v0.6.74 change (Math.min → Math.max in createCaptureCalibrationConfig)
raised the calibration protocol timeout from 30s to the default 300s. When
a CDP call stalls during session init — page.goto, pollHfReady, or any
page.evaluate — the 300s timeout makes the render appear to hang
indefinitely at "Initializing calibration session...".

Restore Math.min so calibration stays capped at 30s: if Chrome is stuck,
fail fast and let the fallback path recover. Also add phase-level timing
logs to initializeSession so the next report pinpoints which step stalls.

Closes #1231
Log the resolved environment at pipeline start (platform, arch, node
version, all timeout values, GPU mode), the calibration config showing
the actual timeout being used vs the parent, Chrome version and capture
mode at browser launch, and a structured failure summary on error with
stage timings and console errors. These four log categories give agents
and users enough context to file actionable issues without needing to
reproduce the problem.
@miguel-heygen miguel-heygen changed the title fix(producer): restore 30s calibration timeout ceiling to prevent render hang fix(producer): restore calibration timeout ceiling + add pipeline observability Jun 6, 2026
Copy link
Copy Markdown

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Miguel — two-part PR, both clean: a real production regression fix from #1221 + observability instrumentation that's well-shaped for triage. No blockers, one small instrumentation asymmetry, a couple of nits.

The fix itself

  • Math.maxMath.min in createCaptureCalibrationConfig (captureCost.ts:163). One character, exactly the right semantic — calibration's protocolTimeout is a ceiling (cap a wedged BeginFrame fast so screenshot fallback kicks in), not a floor. The flip in #1221 made the default 300s cfg.protocolTimeout get picked, which is the user-visible "hangs at Initializing calibration session..." that #1231 reports. Confirmed: Math.min(300_000, 30_000) = 30_000 (capped) and Math.min(5_000, 30_000) = 5_000 (preserved). The docstring rewrite ("ceiling, not a floor") locks the intent for the next reader.
  • Test rename matches the restored semantic. The previous tests were renamed to "respects user protocol timeout when higher" / "uses calibration floor when user timeout is lower" in #1221 to match the (wrong) Math.max behavior — so CI stayed green through the regression. Worth a small process callout: when a test name flips its premise (floorceiling, respectscaps) the underlying contract has changed, and that's the moment for an extra sanity check on the behavior shift. Not actionable for this PR; the new test names ("caps protocol timeout at calibration ceiling for fast fallback" / "preserves user timeout when already below calibration ceiling") self-document the restored semantic cleanly.

Observability — concerns

  • Asymmetric pollVideosReady instrumentation. The BeginFrame branch logs logInitPhase("pollVideosReady complete") at frameCapture.ts:936. The screenshot branch calls pollVideosReady at frameCapture.ts:784 (the exact same gate, same pageReadyTimeout) but doesn't log the phase. The reporter on #1231 is on macOS = screenshot mode — and the per-phase logs are explicitly there to pinpoint which CDP call wedges. A future stall in pollVideosReady on the screenshot path would show up as a gap between applyVideoMetadataHints complete and images ready + decoded, which doesn't name the wedged step. One extra logInitPhase("pollVideosReady complete") line after frameCapture.ts:802 closes the gap.

Nits

  • Per-phase logs use console.log; the orchestrator-level logs use log.info. Defensible because frameCapture.ts is a lower-level module that already uses console.warn for the video/image timeout warnings — internally consistent. Worth noting only because the on-call diagnosing #1231-shaped issues will be grep'ing for both prefixes; a single structured logger pass eventually would make field aggregation simpler. Future cleanup, not blocking. (nit)
  • Log volume on the producer's hot path. ~9 logInitPhase lines per render, fires on every render. For a one-off CLI render, fine. For a producer pumping 100s/hour, that's a measurable log stream. Worth either a process.env.HF_RENDER_DEBUG gate, or pushing them to log.debug (assuming the producer's logger has level filtering) so they only stream in verbose mode. The triage value is highest when something is wrong; the 80% happy-path case doesn't need the timing trail. (nit)
  • [Render] Failure summary includes a nice browserConsoleErrors slice but filters by string-match on "ERROR"/"PAGEERROR". The actual buffer format isn't visible from the diff; if those tokens are part of a formatter prefix you control, you'll catch what you want — if they're upstream Chrome console substrings, edge-case error categories (e.g. WARNING-level entries that still indicate the wedge) get dropped. Worth a quick check that the regex matches what the buffer actually emits. (nit)
  • pollSubCompositionTimelines complete phase log fires unconditionally in both branches. If a composition has zero sub-comp timelines, the call returns ~instantly and the log line is still 100% noise. Could elide when the inner work was a no-op, but the cost is one console.log so not worth the conditional. (nit, lean toward leave as-is)
  • Browser launch log uses await browser.version().catch(() => "unknown") — adds a CDP round-trip to the launch path. browser.version() is typically cheap (cached after first call), but if you ever launch many short-lived browsers (Lambda burst case from the #1225 conversation) it's worth knowing this is per-browser, not per-render. (nit, almost certainly fine in practice)

Questions

  • Long-term, do you want these phase timings going to whatever metrics pipeline the producer uses (Datadog, Honeycomb), or is structured stdout sufficient for the on-call use case? The shape you've chosen is great for grep + tail diagnostics; metric ingestion would want the same data with a slightly different envelope (e.g. tag = phase, value = ms). Mention because the next iteration of this work might want both.
  • Were the [initSession] phase logs sized against any specific real wedged-render trace from #1231, or are they the staff-engineer-intuition pick of "the obvious split points"? If you have a wedge trace from the reporter, worth a comment in the PR describing which phase it actually stalled at — closes the feedback loop and proves the instrumentation answers the question that drove it.

What I didn't verify

  • The executeRenderJob's catch block scope — confirmed pipelineStart (renderOrchestrator.ts:1454) and lastBrowserConsole (renderOrchestrator.ts:1458) are live inside the new failure summary; the perfStages reference is presumed in scope from the surrounding context.
  • The exact format of lastBrowserConsole entries — only the filter is visible in the diff.
  • pollHfReady / pollSubCompositionTimelines internals — assumed they each have their own internal timeouts and don't depend on protocolTimeout in a way the calibration cap affects.
  • Local reproducer of #1231 (no Chromium-on-stalling-macOS rig).
  • That the asymmetry in pollVideosReady instrumentation isn't intentional (e.g., screenshot mode's call is somehow lower-cost / not worth instrumenting) — read it as oversight, happy to be told otherwise.

Review by Rames D Jusso

The BeginFrame path logged this phase but the screenshot path didn't,
creating an instrumentation gap when diagnosing hangs on macOS where
screenshot mode is always used.
@miguel-heygen miguel-heygen merged commit 1d16216 into main Jun 6, 2026
29 checks passed
@miguel-heygen miguel-heygen deleted the fix/calibration-timeout-regression branch June 6, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.6.74 regression: CLI render hangs on "Initializing calibration session..." (macOS M4)

2 participants