Skip to content

Release v1.12.0#2302

Merged
xingyaoww merged 7 commits intomainfrom
rel-1.12.0
Mar 5, 2026
Merged

Release v1.12.0#2302
xingyaoww merged 7 commits intomainfrom
rel-1.12.0

Conversation

@all-hands-bot
Copy link
Collaborator

@all-hands-bot all-hands-bot commented Mar 4, 2026

Release v1.12.0

This PR prepares the release for version 1.12.0.

Release Checklist

  • Version set to 1.12.0
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Draft release created at https://github.com/OpenHands/software-agent-sdk/releases/new
    • Select tag: v1.12.0
    • Select branch: rel-1.12.0
    • Auto-generate release notes
    • Publish release (PyPI will auto-publish)
  • Evaluation on OpenHands Index

Next Steps

  1. Review the version changes
  2. Address any deprecation deadlines
  3. Ensure integration tests pass
  4. Ensure behavior tests pass
  5. Ensure example tests pass
  6. Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:f03e068-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-f03e068-python \
  ghcr.io/openhands/agent-server:f03e068-python

All tags pushed for this build

ghcr.io/openhands/agent-server:f03e068-golang-amd64
ghcr.io/openhands/agent-server:f03e068-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:f03e068-golang-arm64
ghcr.io/openhands/agent-server:f03e068-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:f03e068-java-amd64
ghcr.io/openhands/agent-server:f03e068-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:f03e068-java-arm64
ghcr.io/openhands/agent-server:f03e068-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:f03e068-python-amd64
ghcr.io/openhands/agent-server:f03e068-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:f03e068-python-arm64
ghcr.io/openhands/agent-server:f03e068-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:f03e068-golang
ghcr.io/openhands/agent-server:f03e068-java
ghcr.io/openhands/agent-server:f03e068-python

About Multi-Architecture Support

  • Each variant tag (e.g., f03e068-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., f03e068-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Mar 4, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

API breakage checks (Griffe)

Result: Failed

Log excerpt (first 1000 characters)

============================================================
Checking openhands-sdk (openhands.sdk)
============================================================
Comparing openhands-sdk 1.12.0 against 1.11.5
::warning file=openhands-sdk/openhands/sdk/conversation/conversation.py,line=103,title=Conversation.__new__(delete_on_close)::Parameter default was changed: `False` -> `True`
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): temperature
::warning file=openhands-sdk/openhands/sdk/llm/llm.py,line=196,title=LLM.top_p::Attribute value was changed: `Field(default=1.0, ge=0, le=1)` -> `Field(default=None, ge=0, le=1, description='Nucleus sampling parameter. Defaults to None (uses provider default). Set to a value between 0 and 1 to control diversity of outputs.')`
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): prompt_cache_retention
Breaking changes detected (2) and version bump policy satisfied (1.11.5 -> 1.12.0)

Action log

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Agent server REST API breakage checks (OpenAPI)

Result: Passed

Action log

Copy link
Collaborator Author

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good Taste - Clean Release Bump

Taste Rating: 🟢 Good taste - Mechanical version bump, exactly what it should be.

Review Summary

The version changes are clean and correct:

  • All 4 packages consistently bumped from 1.11.5 → 1.12.0
  • Workflow default updated to v1.12.0
  • Lock file properly synced
  • ✅ Deprecation check passes (0 deadline violations)

Process Notes

The release checklist has incomplete items:

  • Integration tests
  • Behavior tests
  • Example tests
  • Draft release creation
  • Evaluation on OpenHands Index

These should be completed before merge per the standard release workflow.

Verdict

Version changes are correct - No technical issues with the version bumps themselves.

⏸️ Hold for checklist completion - Follow the release process checklist before merging.

Key Insight: This is a textbook mechanical release bump with zero technical issues. Just complete the process checklist and ship it.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL20534550473% 
report-only-changed-files is enabled. No files were changed during this commit :)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-03-04 20:52:45 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 24.2s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 18.9s $0.02
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 12.6s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 40.7s $0.03
01_standalone_sdk/09_pause_example.py ✅ PASS 16.3s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 31.3s $0.02
01_standalone_sdk/11_async.py ✅ PASS 33.2s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 10.1s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 20.1s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 2m 23s $0.18
01_standalone_sdk/17_image_input.py ✅ PASS 15.7s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 17.5s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 13.1s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 16.0s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.8s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 14.6s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 19s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 3m 5s $0.22
01_standalone_sdk/25_agent_delegation.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.29
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 19.3s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 28.8s $0.04
01_standalone_sdk/29_llm_streaming.py ✅ PASS 37.0s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 9.6s $0.00
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 4m 19s $0.30
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 19.2s $0.02
01_standalone_sdk/34_critic_example.py ✅ PASS 3m 11s $0.27
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 12.2s $0.01
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.2s $0.00
01_standalone_sdk/38_browser_session_recording.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.05
01_standalone_sdk/39_llm_fallback.py ✅ PASS 9.9s $0.01
01_standalone_sdk/40_acp_agent_example.py ❌ FAIL
Exit code 1
11.1s --
01_standalone_sdk/41_task_tool_set.py ❌ FAIL
Exit code 1
4.4s --
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 59.4s $0.06
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 43.2s $0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 21s $0.03
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 48s $0.00
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 26s $0.02
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 24.0s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
4.5s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 22.8s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 43s $0.08
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 13.5s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 20.5s $0.03

❌ Some tests failed

Total: 43 | Passed: 38 | Failed: 5 | Total Cost: $2.02

Failed examples:

  • examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/38_browser_session_recording.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/40_acp_agent_example.py: Exit code 1
  • examples/01_standalone_sdk/41_task_tool_set.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-03-04 20:55:42 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 24.8s $0.02
01_standalone_sdk/03_activate_skill.py ✅ PASS 26.1s $0.02
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 11.9s $0.00
01_standalone_sdk/07_mcp_integration.py ✅ PASS 36.3s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 16.8s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 28.1s $0.01
01_standalone_sdk/11_async.py ✅ PASS 30.4s $0.02
01_standalone_sdk/12_custom_secrets.py ✅ PASS 10.5s $0.00
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 19.3s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 3m 28s $0.20
01_standalone_sdk/17_image_input.py ✅ PASS 15.6s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 26.4s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 16.6s $0.01
01_standalone_sdk/20_stuck_detector.py ✅ PASS 15.9s $0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 11.9s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 13.7s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 57.8s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 12s $0.33
01_standalone_sdk/25_agent_delegation.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.27
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 15.7s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 33.3s $0.02
01_standalone_sdk/29_llm_streaming.py ✅ PASS 38.9s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 9.7s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 5m 44s $0.40
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 15.3s $0.01
01_standalone_sdk/34_critic_example.py ✅ PASS 4m 2s $0.37
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 9.5s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.3s $0.00
01_standalone_sdk/38_browser_session_recording.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.03
01_standalone_sdk/39_llm_fallback.py ✅ PASS 10.6s $0.01
01_standalone_sdk/40_acp_agent_example.py ❌ FAIL
Exit code 1
10.9s --
01_standalone_sdk/41_task_tool_set.py ❌ FAIL
Exit code 1
4.5s --
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 50.5s $0.04
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 1m 14s $0.08
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 27s $0.03
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 57.2s $0.11
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 54.6s $0.02
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 30.3s $0.01
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 7s $0.02
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 27.8s $0.03
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 17s $0.08
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 12.1s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 22.5s $0.04

❌ Some tests failed

Total: 43 | Passed: 39 | Failed: 4 | Total Cost: $2.37

Failed examples:

  • examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/38_browser_session_recording.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/40_acp_agent_example.py: Exit code 1
  • examples/01_standalone_sdk/41_task_tool_set.py: Exit code 1

View full workflow run

@xingyaoww xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Mar 4, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🧪 Integration Tests Results

Overall Success Rate: 96.7%
Total Cost: $1.07
Models Tested: 4
Timestamp: 2026-03-04 20:40:03 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.04 620,054
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.52 320,049
litellm_proxy_anthropic_claude_sonnet_4_6 87.5% 7/8 0 8 $0.44 254,549
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.07 241,384

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.04
  • Token Usage: prompt: 607,147, completion: 12,907, cache_read: 541,952, reasoning: 5,591
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_47bb174_deepseek_v3_2_reasoner_run_N8_20260304_202822
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.52
  • Token Usage: prompt: 312,851, completion: 7,198, cache_read: 104,522, reasoning: 5,222
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_47bb174_gemini_3_pro_run_N8_20260304_202818

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.44
  • Token Usage: prompt: 248,791, completion: 5,758, cache_read: 168,685, cache_write: 79,858, reasoning: 844
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_47bb174_claude_sonnet_4_6_run_N8_20260304_203424

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.07
  • Token Usage: prompt: 236,397, completion: 4,987, cache_read: 179,712
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_47bb174_kimi_k2_thinking_run_N8_20260304_202818
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🧪 Integration Tests Results

Overall Success Rate: 96.7%
Total Cost: $1.11
Models Tested: 4
Timestamp: 2026-03-04 20:46:41 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.04 635,066
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.57 418,384
litellm_proxy_anthropic_claude_sonnet_4_6 87.5% 7/8 0 8 $0.44 254,936
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.08 255,849

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.04
  • Token Usage: prompt: 620,741, completion: 14,325, cache_read: 569,856, reasoning: 6,064
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_47bb174_deepseek_v3_2_reasoner_run_N8_20260304_202829
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.57
  • Token Usage: prompt: 409,439, completion: 8,945, cache_read: 200,531, reasoning: 6,840
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_47bb174_gemini_3_pro_run_N8_20260304_202857

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.44
  • Token Usage: prompt: 249,258, completion: 5,678, cache_read: 169,238, cache_write: 79,772, reasoning: 945
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_47bb174_claude_sonnet_4_6_run_N8_20260304_202822

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.08
  • Token Usage: prompt: 251,029, completion: 4,820, cache_read: 194,816
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_47bb174_kimi_k2_thinking_run_N8_20260304_202829
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@xingyaoww
Copy link
Collaborator

@OpenHands check the log #2302 (comment) and tell me why claude sonnet 4.6 is failing

@openhands-ai
Copy link

openhands-ai bot commented Mar 4, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@enyst
Copy link
Collaborator

enyst commented Mar 4, 2026

@OpenHands make a new issue to decide the wanted behavior for the public API checks (griffe and oasdiff) workflows. Look at this PR comments / CI to start with:

  • I see a comment saying check Failed
  • it's a rel-* PR, so why doesn't it fail the workflow
  • LLM class has breaking changes, are they the kind we should require deprecated on? why don't we require it now
  • look at all the CI log for the checks to understand all that happened
  • read issue 2252 as part of your investigation

Make an issue describing the current execution and compare it with documented expectations. Give links to source code on github.

@openhands-ai
Copy link

openhands-ai bot commented Mar 4, 2026

I'm on it! enyst can track my progress at all-hands.dev

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🧪 Integration Tests Results

Overall Success Rate: 96.7%
Total Cost: $1.32
Models Tested: 4
Timestamp: 2026-03-04 20:50:30 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.03 584,273
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.80 712,620
litellm_proxy_anthropic_claude_sonnet_4_6 87.5% 7/8 0 8 $0.42 236,250
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.07 226,273

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.03
  • Token Usage: prompt: 571,182, completion: 13,091, cache_read: 520,768, reasoning: 5,377
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_47bb174_deepseek_v3_2_reasoner_run_N8_20260304_203921
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.80
  • Token Usage: prompt: 703,586, completion: 9,034, cache_read: 395,723, reasoning: 6,658
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_47bb174_gemini_3_pro_run_N8_20260304_203908

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.42
  • Token Usage: prompt: 231,306, completion: 4,944, cache_read: 151,797, cache_write: 79,277, reasoning: 794
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_47bb174_claude_sonnet_4_6_run_N8_20260304_203920

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.07
  • Token Usage: prompt: 221,860, completion: 4,413, cache_read: 171,008
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_47bb174_kimi_k2_thinking_run_N8_20260304_203955
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@openhands-ai
Copy link

openhands-ai bot commented Mar 4, 2026

Summary

You asked me to check the CI log from PR #2302 and explain why Claude Sonnet 4.6 is failing. Here's the answer:

Claude Sonnet 4.6 failed the t02_add_bash_hello integration test because the agent created the shell script but never made it executable (chmod +x).

Specifically, the agent:

  1. ✅ Created the shell/ directory with mkdir -p
  2. ✅ Created shell/hello.sh with correct content (#!/bin/sh + echo 'hello')
  3. ✅ Verified the script runs by invoking sh /tmp/.../shell/hello.sh — which doesn't require the execute bit
  4. Never ran chmod +x on the script

The test's verify_result() checks os.access(self.script_path, os.X_OK), which fails because the file_editor's create command writes files with default 0o644 permissions (not executable). The test returns: "Shell script is not executable".

This is a non-deterministic agent behavior issue, not a code bug. Claude Sonnet 4.6 simply didn't think to set the execute permission. It may pass on a re-run. All other models (DeepSeek, Gemini, Kimi) passed at 100%.

No code changes were made — this was purely an investigative/diagnostic task.

@openhands-ai
Copy link

openhands-ai bot commented Mar 4, 2026

Summary of work completed:

No repository code changes were made, so no git diff or push is applicable.

@VascoSch92
Copy link
Contributor

Failed Examples

Example Result Duration Cost
25_agent_delegation.py ❌ Timed out after 600s 10m 0s $0.30
38_browser_session_recording.py ❌ Timed out after 600s 10m 0s $0.04
40_acp_agent_example.py ❌ Exit code 1 18.2s --
41_task_tool_set.py ❌ Exit code 1 7.1s --

Root Causes

1. Browser launch failure (38_browser_session_recording.py)

Chrome/Chromium failed to start on the CI runner. The browser_use library could not connect to Chrome's CDP endpoint:

ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 52511)
TimeoutError: Event handler BrowserSession.on_BrowserStartEvent timed out after 30.0s

The browser watchdog retried for 30s, failed, then the test burned the remaining time until the 600s timeout.

2. Agent delegation too slow (25_agent_delegation.py)

The delegation example spent $0.30 in LLM calls (using claude-haiku-4-5) but did not complete within the 600s budget. The delegated sub-agent may be stuck in a loop or the task is too complex for the timeout.

3. Immediate failures (40_acp_agent_example.py, 41_task_tool_set.py)

Both failed fast (18.2s and 7.1s respectively) with exit code 1, suggesting code errors, missing dependencies, or configuration issues rather than timeouts. Further investigation needed on their specific error
output.

Note: 41_task_tool_set.py is an example that i wrote. It is an interactivce example, i.e., the user should give an input. I suppose is failing for that.

@enyst
Copy link
Collaborator

enyst commented Mar 5, 2026

How about this one, not sure if it's still your @simonrosenberg version basically or maybe you rewrote it @VascoSch92 , just wondering if you guys have an idea what's up here? This used to work I think, it's number 25

  1. Agent delegation too slow (25_agent_delegation.py)
    The delegation example spent $0.30 in LLM calls (using claude-haiku-4-5) but did not complete within the 600s budget. The delegated sub-agent may be stuck in a loop or the task is too complex for the timeout.

Xingyao's agent excluded all 4, it just looks like this shouldn't maybe happen

@xingyaoww
Copy link
Collaborator

@OpenHands please revert 8a21c3f - most of the issues should've been fixed in the main branch that's just merged in. Can you re-tag test-examples and monitor the result

@openhands-ai
Copy link

openhands-ai bot commented Mar 5, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@xingyaoww xingyaoww removed the test-examples Run all applicable "examples/" files. Expensive operation. label Mar 5, 2026
@xingyaoww xingyaoww added the test-examples Run all applicable "examples/" files. Expensive operation. label Mar 5, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-03-05 11:03:18 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 21.6s $0.02
01_standalone_sdk/03_activate_skill.py ✅ PASS 19.5s $0.02
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 12.6s $0.00
01_standalone_sdk/07_mcp_integration.py ✅ PASS 37.9s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 15.9s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 30.2s $0.02
01_standalone_sdk/11_async.py ✅ PASS 31.3s $0.02
01_standalone_sdk/12_custom_secrets.py ✅ PASS 11.2s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 20.3s $0.02
01_standalone_sdk/14_context_condenser.py ❌ FAIL
Exit code 1
1m 6s --
01_standalone_sdk/17_image_input.py ✅ PASS 16.7s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 29.1s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 13.5s $0.00
01_standalone_sdk/20_stuck_detector.py ✅ PASS 14.8s $0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 9.6s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 13.8s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 19s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 54s $0.38
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 17s $0.09
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 19.0s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 26.8s $0.02
01_standalone_sdk/29_llm_streaming.py ✅ PASS 32.0s $0.02
01_standalone_sdk/30_tom_agent.py ✅ PASS 10.3s $0.00
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 9m 39s $0.72
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 19.4s $0.01
01_standalone_sdk/34_critic_example.py ❌ FAIL
Timed out after 600 seconds
10m 0s --
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 12.1s $0.01
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.8s $0.00
01_standalone_sdk/38_browser_session_recording.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.04
01_standalone_sdk/39_llm_fallback.py ✅ PASS 10.3s $0.01
01_standalone_sdk/40_acp_agent_example.py ❌ FAIL
Exit code 1
9.4s --
01_standalone_sdk/41_task_tool_set.py ✅ PASS 28.4s $0.02
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 1m 19s $0.06
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 1m 9s $0.06
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 36s $0.02
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 17s $0.00
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 18s $0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 32.0s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 23s $0.02
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 20.0s $0.03
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 49s $0.08
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 12.9s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 23.8s $0.04

❌ Some tests failed

Total: 43 | Passed: 39 | Failed: 4 | Total Cost: $1.91

Failed examples:

  • examples/01_standalone_sdk/14_context_condenser.py: Exit code 1
  • examples/01_standalone_sdk/34_critic_example.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/38_browser_session_recording.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/40_acp_agent_example.py: Exit code 1

View full workflow run

@enyst
Copy link
Collaborator

enyst commented Mar 5, 2026

@xingyaoww About the API breakage job failing:

I think it's not blocking release, but it's there to warn us and make us look (though we could make it mandatory anytime if it behaves as wanted)

These are the current failures:
image

  • the browser export existed only one release. It was my mistake, I exported all tools, and we reverted it because the CLI doesn't support it.
  • I think the delegate agent methods were also very short-lived. Vasco knows more.

IMHO, I think maybe we could "sin" this time and let these pass without deprecation? 🤔

@xingyaoww
Copy link
Collaborator

@OpenHands please check https://github.com/OpenHands/software-agent-sdk/actions/runs/22713720951/job/65857938260?pr=2302 and #2302 (comment)

Tell me which commit introduced these and which commit removed these? Tell me the release version before and after these commits. I want to assess the impact of ignoring these errors (eg not marking them deprecated since they are very short lived)

@openhands-ai
Copy link

openhands-ai bot commented Mar 5, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Mar 5, 2026

No code changes were made — this was a pure investigation/analysis task. Here's the summary:

API Breakage Analysis for PR #2302 (Release v1.12.0)

The CI check SDK programmatic API (Griffe) is failing because 3 symbols were removed from openhands.tools.__all__ without prior deprecation:

Timeline

Symbol Added by Removed by Lifespan
BrowserToolSet 23e167d6 (#2080) — Feb 15 e25a1ef8 (#2197) — Feb 25 1 release
get_agent_factory 23e167d6 (#2080) — Feb 15 ab8f5f28 (#2183) — Feb 25 1 release
register_agent 23e167d6 (#2080) — Feb 15 ab8f5f28 (#2183) — Feb 25 1 release

Key Findings

  • All 3 were introduced in a single commit (23e167d6"ci: extend API breakage checks to openhands-tools"), which was the commit that created the __all__ for openhands-tools in the first place.
  • Before v1.11.5: openhands-tools/__init__.py had no __all__ at all — just a bare version-only init. These symbols were never in any tracked public API before.
  • v1.11.5 (Feb 20, 2026): The only released version containing all 3 symbols.
  • v1.12.0 (pending): All 3 removed.
  • Total lifespan: ~5 days, exactly 1 release.

Removal Reasons

  • BrowserToolSet: Forces heavy browser-use dependencies on downstream consumers (CLI bundle bloat).
  • get_agent_factory + register_agent: Removed as part of delegation refactor — file-based agent definitions replaced programmatic registration.

Impact Assessment

Very low. These were essentially an accidental over-export — added as part of setting up the API breakage CI for openhands-tools, then immediately cleaned up in the next development cycle. They existed for exactly 1 release with no prior public API presence. Skipping the deprecation cycle for these seems safe.

@xingyaoww
Copy link
Collaborator

@enyst agree that it is not release blocking :) Just need to get example tests fixed and we'd be good to go

@xingyaoww xingyaoww added test-examples Run all applicable "examples/" files. Expensive operation. integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. labels Mar 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

🧪 Integration Tests Results

Overall Success Rate: 93.3%
Total Cost: $1.13
Models Tested: 4
Timestamp: 2026-03-05 18:57:28 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.03 560,607
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.57 328,813
litellm_proxy_anthropic_claude_sonnet_4_6 87.5% 7/8 0 8 $0.44 254,527
litellm_proxy_moonshot_kimi_k2_thinking 85.7% 6/7 1 8 $0.09 289,602

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.03
  • Token Usage: prompt: 548,427, completion: 12,180, cache_read: 501,888, reasoning: 4,811
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_6e1cedc_deepseek_v3_2_reasoner_run_N8_20260305_185422
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.57
  • Token Usage: prompt: 320,012, completion: 8,801, cache_read: 95,422, reasoning: 5,729
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_6e1cedc_gemini_3_pro_run_N8_20260305_185422

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.44
  • Token Usage: prompt: 248,733, completion: 5,794, cache_read: 168,632, cache_write: 79,853, reasoning: 1,038
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_6e1cedc_claude_sonnet_4_6_run_N8_20260305_185423

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 85.7% (6/7)
  • Total Cost: $0.09
  • Token Usage: prompt: 283,151, completion: 6,451, cache_read: 224,256
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_6e1cedc_kimi_k2_thinking_run_N8_20260305_185422
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.01)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-03-05 19:16:05 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 26.6s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 20.6s $0.02
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 13.2s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 34.3s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 17.8s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 29.1s $0.02
01_standalone_sdk/11_async.py ✅ PASS 36.4s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 10.1s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 22.4s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 3m 2s $0.21
01_standalone_sdk/17_image_input.py ✅ PASS 17.9s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 23.7s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 14.5s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 16.9s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 12.3s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 18.1s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 2m 40s $0.03
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 2m 49s $0.18
01_standalone_sdk/25_agent_delegation.py ✅ PASS 55.7s $0.06
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 18.8s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 31.6s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 50.0s $0.04
01_standalone_sdk/30_tom_agent.py ✅ PASS 10.4s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 5m 24s $0.36
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 21.1s $0.02
01_standalone_sdk/34_critic_example.py ✅ PASS 1m 39s $0.13
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 10.3s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 5.1s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 56.1s $0.02
01_standalone_sdk/39_llm_fallback.py ✅ PASS 11.8s $0.01
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 34.7s $0.14
01_standalone_sdk/41_task_tool_set.py ✅ PASS 32.0s $0.03
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 1m 27s $0.09
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 53.0s $0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 37s $0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 54.9s $0.00
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 29s $0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 36.3s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 14s $0.01
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 26.0s $0.03
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 58.7s $0.05
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 14.1s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 23.6s $0.03

✅ All tests passed!

Total: 43 | Passed: 43 | Failed: 0 | Total Cost: $1.91

View full workflow run

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Evaluation Triggered

  • Trigger: Release v1.12.0
  • SDK: 6e1cedc
  • Eval limit: 50
  • Models: claude-sonnet-4-5-20250929

@xingyaoww xingyaoww merged commit db9f0a7 into main Mar 5, 2026
71 of 72 checks passed
@xingyaoww xingyaoww deleted the rel-1.12.0 branch March 5, 2026 19:27
zparnold pushed a commit to zparnold/software-agent-sdk that referenced this pull request Mar 5, 2026
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-test integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants