Skip to content

[build-tools] Add max_idle_minutes to agent-device remote session step#3781

Open
szdziedzic wants to merge 1 commit into
mainfrom
szdziedzic/agent-device-idle-timeout
Open

[build-tools] Add max_idle_minutes to agent-device remote session step#3781
szdziedzic wants to merge 1 commit into
mainfrom
szdziedzic/agent-device-idle-timeout

Conversation

@szdziedzic
Copy link
Copy Markdown
Contributor

@szdziedzic szdziedzic commented May 22, 2026

Why

The eas/start_agent_device_remote_session step keeps the turtle job alive indefinitely (new Promise<never>(() => {})) until stopDeviceRunSession cancels it. If a client connects, disappears, and never sends stopDeviceRunSession, the worker burns until the orchestrator's max_run_time_seconds ceiling fires. We want a tighter, activity-aware ceiling so abandoned sessions release their worker quickly.

How

Adds an optional max_idle_minutes build function input. When set, the step:

  1. Generates a one-file ESM auth-hook module at ${tmpdir}/agent-device-idle-hook.mjs that stamps Date.now() into ${tmpdir}/agent-device-last-activity on every authenticated daemon request, then returns true so the daemon's own token check remains the source of truth for authorization.
  2. Launches the daemon with AGENT_DEVICE_HTTP_AUTH_HOOK pointing at that module. This is a documented extension point on agent-device's HTTP server (src/daemon/http-server.ts loadHttpAuthHook); it fires for /rpc, /upload, and /artifacts/* — every meaningful client call — while skipping /health, so cloudflared/health probes don't reset the timer.
  3. Races a watchdog (waitForIdleTimeoutAsync) against the existing await new Promise<never>(() => {}). The watchdog polls the state file every 30s; if Date.now() - lastActivityAt >= maxIdleMinutes * 60_000, it throws SystemError. That fails the step and cascades to the device run session via the existing turtle-job-status trigger.

When max_idle_minutes is absent or non-positive, the step behaves exactly as before (no hook installed, no watchdog, infinite wait).

The state file is pre-seeded with the current time before the daemon starts so the first poll has a baseline; malformed/missing state is treated as fresh activity rather than firing a false-positive timeout.

The complementary www-side change that plumbs this from the GraphQL createDeviceRunSession mutation lives in a follow-up PR; until that ships, no caller passes the new input so the step is a no-op for everyone.

Test Plan

  • yarn typecheck (build-tools): clean
  • yarn lint on the changed file: 0 errors
  • yarn fmt:check: clean
  • yarn jest-unit (build-tools): 522/522 pass
  • Manual verification will follow once the www side is wired up and a real turtle job runs end-to-end.

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions
Copy link
Copy Markdown

Subscribed to pull request

File Patterns Mentions
**/* @douglowder

Generated by CodeMention

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 12.19512% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.12%. Comparing base (fce2a77) to head (042d61f).

Files with missing lines Patch % Lines
...c/steps/functions/startAgentDeviceRemoteSession.ts 12.20% 33 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3781      +/-   ##
==========================================
- Coverage   57.16%   57.12%   -0.04%     
==========================================
  Files         905      905              
  Lines       39284    39322      +38     
  Branches     8227     8236       +9     
==========================================
+ Hits        22454    22459       +5     
- Misses      15358    15388      +30     
- Partials     1472     1475       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +227 to +228
// - return true so the daemon's own token check stays the source of
// truth for authorization,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// When a max-idle window is set, install an auth hook that stamps the
// current time into IDLE_STATE_PATH on every authenticated daemon
// request. The watchdog below polls that file to detect inactivity.
const idleEnv: Record<string, string> = {};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idleEnv is a bit of a weird name

@szdziedzic szdziedzic force-pushed the szdziedzic/agent-device-idle-timeout branch from cf313cf to 042d61f Compare May 22, 2026 12:10
@github-actions
Copy link
Copy Markdown

❌ It looks like a changelog entry is missing for this PR. Add it manually to CHANGELOG.md.
⏩ If this PR doesn't require a changelog entry, such as if it's an internal change that doesn't affect the user experience, you can add the "no changelog" label to the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants