Skip to content

Add Initializing state and optimize speed to init#136

Merged
sjmiller609 merged 33 commits intomainfrom
codex/max-speed-initializing-concurrency
Mar 10, 2026
Merged

Add Initializing state and optimize speed to init#136
sjmiller609 merged 33 commits intomainfrom
codex/max-speed-initializing-concurrency

Conversation

@sjmiller609
Copy link
Collaborator

@sjmiller609 sjmiller609 commented Mar 9, 2026

Summary

Add Initializing state

  • add Initializing to the public instance lifecycle state model and OpenAPI
  • gate Running on explicit guest boot markers (HYPEMAN-PROGRAM-START, HYPEMAN-AGENT-READY), with skip_guest_agent=true bypassing agent readiness only
  • enforce guest-agent readiness startup gate in exec mode before launching guest program
  • persist and hydrate boot markers from serial logs (including rotated logs)
  • treat Initializing as VMM-active for liveness/cleanup/resource logic
  • allow stop/delete/get/list while Initializing; keep running-only ops blocked
  • update docs in lib/instances/README.md, lib/system/README.md, and lib/network/README.md
  • update tests for the new immediate lifecycle semantics (Create/Start/Restore may return Initializing)

Speed up Initializing

  • replace exec-mode guest-agent readiness polling with an event-driven FD handshake
  • move kernel-headers installation off the critical startup path
    • exec mode: async worker process
    • systemd mode: injected oneshot service
  • add kernel-headers status tracking and serial sentinels (START/READY/FAILED)
  • add fast-path skip when matching headers and build symlink are already installed
  • parallelize independent init work (configureNetwork + mountVolumes) with staged barriers
  • remove unnecessary network.target dependency from hypeman-agent.service
  • update README docs for new startup semantics and active-state notes
  • harden TestRegistryPushAndCreateInstance to use a long-running command (sleep infinity) under stricter Running semantics

Initializing Performance

Measured on deft-kernel-dev with the same 5-run harness (TestMeasureCreateToRunning5Runs, minimal workload: alpine, cmd=["sleep","infinity"], networking disabled), comparing this branch to codex/initializing-state-readiness-gate:

  • Baseline median (codex/initializing-state-readiness-gate): 2077ms
  • This branch median (codex/max-speed-initializing-concurrency): 614ms
  • Improvement: 1463ms faster (70.4% reduction, 3.38x faster)

Run samples:

  • Baseline runs: [1958, 1965, 2077, 2116, 2158]
  • This branch runs: [596, 601, 614, 634, 728]

Note

High Risk
Changes the core VM lifecycle/state model and guest boot orchestration (readiness gating, marker parsing/persistence, async work), which can affect API semantics and operational flows if markers or timing behave unexpectedly across environments.

Overview
Adds a new public instance lifecycle state, Initializing, and changes state derivation so Running is only reported once the guest emits boot progress sentinels (program start and, unless skip_guest_agent=true, guest-agent ready). Instance CRUD/liveness/resource/network behaviors are updated to treat Initializing as VMM-active, allow stop/delete during Initializing, preserve TAPs for initializing VMs, and propagate the new enum through OpenAPI/SDK generation.

Reworks guest startup to reduce time-to-Running: exec mode now uses an event-driven guest-agent readiness FD handshake (no polling) before launching the entrypoint, kernel-headers setup is moved off the critical path (async worker in exec mode, injected oneshot in systemd mode) with status tracking and serial sentinels, and independent init steps (network + volume mounts) run in parallel. Tests and CI are adjusted for the new immediate semantics (Create/Start/Restore may return Initializing), including stronger waiting helpers, QEMU availability checks, and more robust apt/QEMU tooling setup.

Written by Cursor Bugbot for commit 192f095. This will update automatically on new commits. Configure here.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: init: speed up Initializing with staged concurrency and async headers
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

⚠️ hypeman-typescript studio · code

Your SDK build had at least one "error" diagnostic.
generate ❗build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/abd4d6d38b8acf6d153a286242a58661086519b0/dist.tar.gz
hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@0b3751a37068dca3771ad04fc1af876c5e52384f

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-03-10 14:43:28 UTC

@sjmiller609 sjmiller609 changed the title init: speed up Initializing with staged concurrency and async headers Add Initializing state and optimize speed to init Mar 9, 2026
@sjmiller609 sjmiller609 marked this pull request as ready for review March 9, 2026 03:22
@cursor cursor bot requested review from hiroTamada and rgarcia March 9, 2026 03:25
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated risk assessment for this PR: Medium-High risk.

Why this risk level (from code diff evidence):

  • Large behavioral change in production lifecycle logic across lib/instances/* (new Initializing state and state-derivation path changes).
  • Infrastructure/startup-path modifications in lib/system/init/* (async kernel-header worker, exec/systemd boot gating, service injection).
  • Cross-surface impact (openapi.yaml, generated lib/oapi/oapi.go, API/runtime behavior) with broad blast radius.

Decision:

  • Code review is required.
  • Per policy for Medium-High risk PRs, I did not self-approve.
  • Requested reviewers: @hiroTamada and @rgarcia.

Open in Web View Automation 

@sjmiller609 sjmiller609 enabled auto-merge (squash) March 9, 2026 15:01
…alizing-concurrency

# Conflicts:
#	lib/instances/qemu_test.go
#	lib/network/allocate.go
#	lib/oapi/oapi.go
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 5 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Copy link
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough and well-structured PR. The Initializing state model is clean, boot marker detection is robust (throttled, time-bounded, rotation-aware), and the latency optimizations are meaningful — async kernel headers, parallel init stages, and event-driven agent readiness gate.

Two inline comments flagged:

  1. Boot marker filtering in kernel/kernel: The new sentinels (HYPEMAN-PROGRAM-START, HYPEMAN-AGENT-READY, HYPEMAN-HEADERS-*) will flow through app.log → s2 stream → user-facing logs in builder VMs. shouldEmitBuildLogLine in kernel/kernel needs to filter these out — tracked as a follow-up.
  2. No metric for time spent in Initializing: Could be a useful operational signal for detecting slow boots — optional follow-up.

LGTM.

@sjmiller609 sjmiller609 merged commit 3ee0c14 into main Mar 10, 2026
4 of 6 checks passed
@sjmiller609 sjmiller609 deleted the codex/max-speed-initializing-concurrency branch March 10, 2026 14:42
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 5 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Copy link
Collaborator Author

@sjmiller609 sjmiller609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting stale pending review so follow-up thread replies can be posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants