Skip to content

Enable Daytona sandbox-local usage proxy#587

Open
bingran-you wants to merge 9 commits into
benchflow-ai:mainfrom
bingran-you:bry/daytona-sandbox-usage-proxy
Open

Enable Daytona sandbox-local usage proxy#587
bingran-you wants to merge 9 commits into
benchflow-ai:mainfrom
bingran-you:bry/daytona-sandbox-usage-proxy

Conversation

@bingran-you
Copy link
Copy Markdown
Collaborator

@bingran-you bingran-you commented May 30, 2026

Summary

  • Start a per-Daytona-sandbox usage proxy by default so provider token/cost telemetry works without external tunnel parameters.
  • Remove the external tunnel/fixed-port usage proxy CLI/config path from the PR Add Daytona usage tracking proxy support #568-era implementation.
  • Reuse host-side raw capture parsing for sandbox-local proxy captures and update docs/tests.

Testing

  • uv run --extra dev ruff check .
  • uv run --extra dev ty check src/
  • uv run --extra dev python -m pytest tests/

Open in Devin Review

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread src/benchflow/providers/sandbox_usage_proxy.py
@bingran-you
Copy link
Copy Markdown
Collaborator Author

Addressed the Daytona usage-proxy review follow-up in 741eb4f:

  • Replaced the nohup ... & startup with a Node launcher that spawn()s the proxy with detached: true, ignored stdin, log-file stdout/stderr, and unref(), so the Daytona exec command can return immediately.
  • Restored the remote Bedrock-direct guard: Daytona/OpenHands native Bedrock now skips usage proxy in auto and fails fast in required until native Bedrock metering exists.
  • Removed agent-name CLI args from the long-lived proxy argv, tightened the disconnect pkill -f pattern, and added a sandbox proxy pid liveness check before runtime reuse.
  • Fixed the telemetry smoke test trial_name -> rollout_name, set usage_tracking="required", and added an explicit Daytona smoke variant that asserts provider usage and non-empty llm_trajectory.jsonl.
  • Split sandbox-local coverage into tests/test_sandbox_usage_proxy.py; tests/test_usage_proxy.py is back under 1k lines.

Local checks:

  • uv run --extra dev ruff format --check src tests
  • uv run --extra dev ruff check .
  • uv run --extra dev ty check src/
  • uv run --extra dev python -m pytest tests/ -> 2482 passed, 12 skipped, 1 deselected

I could not run the live Daytona smoke in this workspace because DAYTONA_API_KEY and provider credentials are not present in the environment.

@bingran-you
Copy link
Copy Markdown
Collaborator Author

Follow-up addressed in 6b2656a:\n\n- extract_usage() now only reports provider_response when captured exchanges contain real provider usage/token fields; 200/400 captures without usage stay unavailable.\n- Daytona usage_tracking=auto now degrades to unchanged provider env if proxy startup fails; required still fails fast and cleans partial runtime.\n- Raw capture parsing now prefers response content-type, preserving JSON error bodies even when the request had stream=true.\n- Sandbox proxy shutdown imports best-effort captures but always terminates the proxy and removes its runtime dir.\n- Node proxy capture now redacts sensitive request/response headers plus sensitive query parameters, and a local Node integration test covers forwarding, redaction, SSE, JSON errors, and gzip.\n\nLocal verification on latest diff:\n- uv sync --extra dev --extra sandbox-daytona --locked\n- uv run ruff check .\n- uv run ruff format --check src tests\n- uv run ty check src/\n- uv run python -m pytest tests/ -> 2490 passed, 12 skipped, 1 deselected\n\nLive Daytona SkillsBench evidence:\n- lake-warming-attribution with Daytona + OpenHands + DeepSeek-compatible OpenAI endpoint: rollout reached verifier with error=null, usage_tracking.status=enabled, usage_source=provider_response, 27 LLM exchanges, token totals/timing populated.\n- Repeated with Daytona + OpenHands + GLM-compatible OpenAI endpoint: error=null, usage_tracking.status=enabled, usage_source=provider_response, 20 LLM exchanges, token totals/timing populated.\n\nBoth live runs had reward 0.0 because the model output did not satisfy the task verifier, but the Daytona sandbox-local usage proxy contract, trajectory capture, token aggregation, redaction, and timing metadata all completed end-to-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant