Skip to content

cloudhypervisor: avoid sparse serial logs after rotation#211

Closed
alityb wants to merge 1 commit intokernel:mainfrom
alityb:cloudhypervisor-serial-append
Closed

cloudhypervisor: avoid sparse serial logs after rotation#211
alityb wants to merge 1 commit intokernel:mainfrom
alityb:cloudhypervisor-serial-append

Conversation

@alityb
Copy link
Copy Markdown

@alityb alityb commented May 8, 2026

Summary

Cloud Hypervisor's serial.mode=File opens the log path without O_APPEND, so after copytruncate rotation the next write lands at the stale fd offset and creates a sparse NUL hole. This is the same class of bug fixed for QEMU in #209.

  • Switch serial config to Socket mode so CH binds a Unix listener instead of opening the file directly
  • Hypeman connects to this socket after CreateVM (or Restore) and copies serial output into app.log through an O_APPEND fd it owns
  • Post-truncate writes correctly resume at byte 0

Design

Cloud Hypervisor's serial socket mode (ConsoleOutputMode::Socket) creates a UnixListener. Hypeman dials this socket after the VM is configured (before boot) and runs a background io.Copy into the append-mode log file. When the VM shuts down or is killed, the socket closes and the copy goroutine exits naturally.

The serial socket is placed at {instance_dir}/serial.sock (not inside logs/) to keep the Unix socket path well within the 108-byte sun_path limit.

Backward compatibility

  • Old snapshots with File-mode serial restore gracefully — serialLogPathsFromSnapshot returns empty strings, no logger is started, and CH uses its existing file path
  • Fork/snapshot config rewriting upgrades FileSocket mode, so forked instances get the fix automatically

Test plan

  • go test ./lib/hypervisor/cloudhypervisor/... — config generation, snapshot rewriting, socket logger lifecycle
  • go test -race -count=50 ./lib/hypervisor/cloudhypervisor — no races
  • go test ./lib/hypervisor/... — no regressions across all hypervisors
  • go test ./lib/instances -run 'TestAppLog|TestParseBootMarkers|TestHydrateBootMarkers|TestLog'
  • Boot a VM with this build on Linux, trigger rotateLogIfNeeded, verify no sparse NUL hole in post-rotation app.log

Note

Medium Risk
Changes Cloud Hypervisor VM start/restore and snapshot-rewrite behavior by switching serial output from direct file writes to a managed Unix-socket logger; failures or path issues could impact VM boot/restore or log capture. Scope is localized to Cloud Hypervisor but touches lifecycle-critical code paths.

Overview
Switches Cloud Hypervisor serial logging from File to Socket mode to avoid sparse/holed app.log output after log rotation/truncation.

Hypeman now dials the CH-provided serial.sock after CreateVM/Restore, streams serial output into logs/app.log via an O_APPEND file descriptor, removes any stale socket path before boot/restore, and closes the logger on shutdown/boot failure.

Forked snapshot config.json rewriting is updated to convert serial config to socket mode (and drop file), with new helpers and tests covering path derivation, snapshot parsing/back-compat, retry/timeout dialing, and logger lifecycle/idempotent close.

Reviewed by Cursor Bugbot for commit 8b24d2c. Bugbot is set up for automated code reviews on this repo. Configure here.

Cloud Hypervisor's serial.mode=File opens the log path without O_APPEND,
so after copytruncate rotation the next write lands at the stale fd offset
and creates a sparse NUL hole. This is the same class of bug fixed for
QEMU in kernel#209.

Switch serial config to Socket mode so Cloud Hypervisor binds a Unix
listener instead of opening the file directly. Hypeman connects to this
socket after CreateVM (or Restore) and copies serial output into app.log
through an O_APPEND file descriptor it owns. Post-truncate writes
correctly resume at byte 0.

Also updates fork/snapshot config rewriting to upgrade existing File-mode
serial entries to Socket mode, ensuring forked instances get the fix.

Old snapshots with File-mode serial restore gracefully (no logger is
started; serial output goes to the existing file path as before).
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR modifies Cloud Hypervisor hypervisor integration in packages/api/lib/hypervisor, not the kernel API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) specified in the filter.

To monitor this PR anyway, reply with @firetiger monitor this.

@sjmiller609
Copy link
Copy Markdown
Collaborator

Thanks but already got this going over here #210

@sjmiller609 sjmiller609 closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants