Skip to content

Remove output listener read timeout for large image imports#263

Open
ejc3 wants to merge 2 commits intomainfrom
fix-output-timeout
Open

Remove output listener read timeout for large image imports#263
ejc3 wants to merge 2 commits intomainfrom
fix-output-timeout

Conversation

@ejc3
Copy link
Owner

@ejc3 ejc3 commented Feb 6, 2026

Summary

The 5-minute read timeout on the container output vsock caused the listener to exit during long image imports (10+ min for 26GB images). Container stdout/stderr was lost.

Fix

Remove the read timeout. The listener stays alive until EOF (connection closed by fc-agent) or VM exit. 6 lines added, 15 removed.

Test plan

make test-root FILTER=sanity_rootless
make test-root FILTER=localhost

The 5-minute read timeout on the container output vsock caused the
listener to exit during long image imports (10+ min). When the container
finally started, its stdout/stderr had nowhere to go.

Remove the timeout — the listener stays alive until EOF (connection
closed) or the VM exits. The VM exit handler already cleans up.
@claude-claude
Copy link
Contributor

claude-claude bot commented Feb 6, 2026

🔍 Claude Review

SEVERITY: none

Findings

No issues found. The PR correctly addresses the timeout problem for large image imports.

Analysis:

  • ✅ The 5-minute timeout was too restrictive for large image imports (10+ minutes)
  • ✅ Removing the timeout allows the listener to run until the natural EOF when fc-agent closes the connection
  • ✅ Proper cleanup mechanisms are in place:
    • Socket file cleanup at line 885
    • Task will be aborted if still running when output_handle drops (line 1433)
    • VM exit triggers vsock connection close, causing EOF
  • ✅ Code is simplified and clearer (15 deletions, 6 additions)
  • ✅ SIGTERM/SIGINT handling ensures users can still interrupt long operations

The change is well-justified and follows good async patterns.

Summary

Clean fix for a real problem. No issues identified.


Review by Claude | Run

@claude-claude
Copy link
Contributor

claude-claude bot commented Feb 6, 2026

🔧 CI Auto-Fix

Created fix PR: #264

Issue: The test was failing because the output listener task was hanging when VMs were killed abruptly during stress tests (100 concurrent VMs). The vsock connection wouldn't cleanly close, causing to block indefinitely.

Fix: Added explicit task abortion during VM cleanup by storing the output listener task handle and aborting it in .

View Claude Run

The removal of the 5-minute read timeout in run_output_listener allowed
long-running image imports to complete, but introduced a hang in stress
tests where 100 VMs are spawned/killed rapidly.

When VMs are killed abruptly, the vsock connection may not cleanly close,
causing read_line() to block indefinitely. This prevents the task from
terminating naturally.

Fix by:
1. Storing the output listener task handle (removing _ prefix)
2. Adding output_listener_handle parameter to cleanup_vm()
3. Explicitly aborting the task when cleanup_vm() is called

This ensures the listener is forcefully terminated during cleanup while
still allowing indefinite waits for legitimate long-running operations.
@claude-claude
Copy link
Contributor

claude-claude bot commented Feb 6, 2026

⚠️ CI Check Failed

The Host-Root-arm64-SnapshotDisabled test has failed. Please fix the failing test before I can complete the review.

Failed Check:

  • Host-Root-arm64-SnapshotDisabled - View logs

Review by Claude | Run

@claude-claude
Copy link
Contributor

claude-claude bot commented Feb 6, 2026

CI Failure Analysis

I analyzed CI #21755094976 but couldn't create an automatic fix.

Diagnosis

The test test_podman_healthcheck_unhealthy is timing out after 81 seconds. The test is waiting for the VM's health status to become Unhealthy, but the status remains None (VM state not found/reported).

Test expectations:

  • Container image: localhost/test-unhealthy (nginx with HEALTHCHECK CMD exit 1)
  • Healthcheck config: --interval=1s --retries=1
  • Expected: Status should become Unhealthy within seconds
  • Actual: Status stays None for 60+ seconds, test times out

What Changed in PR:

  1. Removed 5-minute read timeout on output listener
  2. Added aborting of output listener task during VM cleanup

Observed in logs:

  • Multiple Error: no such object: "fcvm-container" errors (expected during startup)
  • VM appears to start successfully
  • Health monitor should be running, but status isn't being reported

Why Not Fixable

This appears to be either:

  1. A race condition introduced by the output listener changes that prevents health status from being saved/read
  2. A flaky test that was already intermittently failing
  3. An environmental issue specific to the CI runner's timing/load

The code compiles successfully (cargo check --release passes), so there are no type errors or missing arguments. The issue is a subtle runtime behavior that requires:

  • Local reproduction with debug logging
  • Deeper investigation into health monitor task lifecycle
  • Possibly adding instrumentation to understand why VM state isn't being reported

Recommendation

  1. Check if this test passes locally on the same architecture (x64-SnapshotEnabled, arm64-SnapshotDisabled)
  2. Add debug logging to health monitor to see if it's starting/running
  3. Consider if the test needs adjustment for the new output listener behavior
  4. Check if there's a timing issue with how quickly health status is expected to be available

View Claude Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant