Release v1.12.0 by all-hands-bot · Pull Request #2302 · OpenHands/software-agent-sdk

all-hands-bot · 2026-03-04T20:27:41Z

Release v1.12.0

This PR prepares the release for version 1.12.0.

Release Checklist

Next Steps

Review the version changes
Address any deprecation deadlines
Ensure integration tests pass
Ensure behavior tests pass
Ensure example tests pass
Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:f03e068-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-f03e068-python \
  ghcr.io/openhands/agent-server:f03e068-python

All tags pushed for this build

ghcr.io/openhands/agent-server:f03e068-golang-amd64
ghcr.io/openhands/agent-server:f03e068-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:f03e068-golang-arm64
ghcr.io/openhands/agent-server:f03e068-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:f03e068-java-amd64
ghcr.io/openhands/agent-server:f03e068-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:f03e068-java-arm64
ghcr.io/openhands/agent-server:f03e068-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:f03e068-python-amd64
ghcr.io/openhands/agent-server:f03e068-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:f03e068-python-arm64
ghcr.io/openhands/agent-server:f03e068-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:f03e068-golang
ghcr.io/openhands/agent-server:f03e068-java
ghcr.io/openhands/agent-server:f03e068-python

About Multi-Architecture Support

Each variant tag (e.g., f03e068-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., f03e068-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-03-04T20:27:50Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-03-04T20:27:51Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-03-04T20:27:52Z

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-03-04T20:27:52Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-03-04T20:28:12Z

API breakage checks (Griffe)

Result: Failed

Log excerpt (first 1000 characters)


============================================================
Checking openhands-sdk (openhands.sdk)
============================================================
Comparing openhands-sdk 1.12.0 against 1.11.5
::warning file=openhands-sdk/openhands/sdk/conversation/conversation.py,line=103,title=Conversation.__new__(delete_on_close)::Parameter default was changed: `False` -> `True`
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): temperature
::warning file=openhands-sdk/openhands/sdk/llm/llm.py,line=196,title=LLM.top_p::Attribute value was changed: `Field(default=1.0, ge=0, le=1)` -> `Field(default=None, ge=0, le=1, description='Nucleus sampling parameter. Defaults to None (uses provider default). Set to a value between 0 and 1 to control diversity of outputs.')`
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): prompt_cache_retention
Breaking changes detected (2) and version bump policy satisfied (1.11.5 -> 1.12.0)

Action log

github-actions · 2026-03-04T20:28:35Z

Agent server REST API breakage checks (OpenAPI)

Result: Passed

Action log

all-hands-bot

🟢 Good Taste - Clean Release Bump

Taste Rating: 🟢 Good taste - Mechanical version bump, exactly what it should be.

Review Summary

The version changes are clean and correct:

All 4 packages consistently bumped from 1.11.5 → 1.12.0
Workflow default updated to v1.12.0
Lock file properly synced
✅ Deprecation check passes (0 deadline violations)

Process Notes

The release checklist has incomplete items:

These should be completed before merge per the standard release workflow.

Verdict

✅ Version changes are correct - No technical issues with the version bumps themselves.

⏸️ Hold for checklist completion - Follow the release process checklist before merging.

Key Insight: This is a textbook mechanical release bump with zero technical issues. Just complete the process checklist and ship it.

github-actions · 2026-03-04T20:35:32Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
TOTAL	20534	5504	73%

report-only-changed-files is enabled. No files were changed during this commit :)

github-actions · 2026-03-04T20:36:58Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-03-04 20:52:45 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	24.2s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	18.9s	$0.02
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	12.6s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	40.7s	$0.03
01_standalone_sdk/09_pause_example.py	✅ PASS	16.3s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	31.3s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	33.2s	$0.04
01_standalone_sdk/12_custom_secrets.py	✅ PASS	10.1s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	20.1s	$0.01
01_standalone_sdk/14_context_condenser.py	✅ PASS	2m 23s	$0.18
01_standalone_sdk/17_image_input.py	✅ PASS	15.7s	$0.01
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	17.5s	$0.01
01_standalone_sdk/19_llm_routing.py	✅ PASS	13.1s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	16.0s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.8s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	14.6s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 19s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	3m 5s	$0.22
01_standalone_sdk/25_agent_delegation.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.29
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	19.3s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	28.8s	$0.04
01_standalone_sdk/29_llm_streaming.py	✅ PASS	37.0s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	9.6s	$0.00
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	4m 19s	$0.30
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	19.2s	$0.02
01_standalone_sdk/34_critic_example.py	✅ PASS	3m 11s	$0.27
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	12.2s	$0.01
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	4.2s	$0.00
01_standalone_sdk/38_browser_session_recording.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.05
01_standalone_sdk/39_llm_fallback.py	✅ PASS	9.9s	$0.01
01_standalone_sdk/40_acp_agent_example.py	❌ FAIL Exit code 1	11.1s	--
01_standalone_sdk/41_task_tool_set.py	❌ FAIL Exit code 1	4.4s	--
01_standalone_sdk/42_file_based_subagents.py	✅ PASS	59.4s	$0.06
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	43.2s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 21s	$0.03
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	1m 48s	$0.00
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	1m 26s	$0.02
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	24.0s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	4.5s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	22.8s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	1m 43s	$0.08
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	13.5s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	20.5s	$0.03

❌ Some tests failed

Total: 43 | Passed: 38 | Failed: 5 | Total Cost: $2.02

Failed examples:

examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
examples/01_standalone_sdk/38_browser_session_recording.py: Timed out after 600 seconds
examples/01_standalone_sdk/40_acp_agent_example.py: Exit code 1
examples/01_standalone_sdk/41_task_tool_set.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

github-actions · 2026-03-04T20:37:43Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-03-04 20:55:42 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	24.8s	$0.02
01_standalone_sdk/03_activate_skill.py	✅ PASS	26.1s	$0.02
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.9s	$0.00
01_standalone_sdk/07_mcp_integration.py	✅ PASS	36.3s	$0.02
01_standalone_sdk/09_pause_example.py	✅ PASS	16.8s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	28.1s	$0.01
01_standalone_sdk/11_async.py	✅ PASS	30.4s	$0.02
01_standalone_sdk/12_custom_secrets.py	✅ PASS	10.5s	$0.00
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	19.3s	$0.01
01_standalone_sdk/14_context_condenser.py	✅ PASS	3m 28s	$0.20
01_standalone_sdk/17_image_input.py	✅ PASS	15.6s	$0.01
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	26.4s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	16.6s	$0.01
01_standalone_sdk/20_stuck_detector.py	✅ PASS	15.9s	$0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	11.9s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	13.7s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	57.8s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	5m 12s	$0.33
01_standalone_sdk/25_agent_delegation.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.27
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	15.7s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	33.3s	$0.02
01_standalone_sdk/29_llm_streaming.py	✅ PASS	38.9s	$0.03
01_standalone_sdk/30_tom_agent.py	✅ PASS	9.7s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	5m 44s	$0.40
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	15.3s	$0.01
01_standalone_sdk/34_critic_example.py	✅ PASS	4m 2s	$0.37
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	9.5s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	4.3s	$0.00
01_standalone_sdk/38_browser_session_recording.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.03
01_standalone_sdk/39_llm_fallback.py	✅ PASS	10.6s	$0.01
01_standalone_sdk/40_acp_agent_example.py	❌ FAIL Exit code 1	10.9s	--
01_standalone_sdk/41_task_tool_set.py	❌ FAIL Exit code 1	4.5s	--
01_standalone_sdk/42_file_based_subagents.py	✅ PASS	50.5s	$0.04
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	1m 14s	$0.08
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 27s	$0.03
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	57.2s	$0.11
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	54.6s	$0.02
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	30.3s	$0.01
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	✅ PASS	3m 7s	$0.02
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	27.8s	$0.03
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	1m 17s	$0.08
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	12.1s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	22.5s	$0.04

❌ Some tests failed

Total: 43 | Passed: 39 | Failed: 4 | Total Cost: $2.37

Failed examples:

examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
examples/01_standalone_sdk/38_browser_session_recording.py: Timed out after 600 seconds
examples/01_standalone_sdk/40_acp_agent_example.py: Exit code 1
examples/01_standalone_sdk/41_task_tool_set.py: Exit code 1

View full workflow run

github-actions · 2026-03-04T20:38:43Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-03-04T20:40:12Z

🧪 Integration Tests Results

Overall Success Rate: 96.7%
Total Cost: $1.07
Models Tested: 4
Timestamp: 2026-03-04 20:40:03 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_anthropic_claude_sonnet_4_6: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	7/7	1	8	$0.04	620,054
litellm_proxy_gemini_3_pro_preview	100.0%	8/8	0	8	$0.52	320,049
litellm_proxy_anthropic_claude_sonnet_4_6	87.5%	7/8	0	8	$0.44	254,549
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	7/7	1	8	$0.07	241,384

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (7/7)
Total Cost: $0.04
Token Usage: prompt: 607,147, completion: 12,907, cache_read: 541,952, reasoning: 5,591
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_47bb174_deepseek_v3_2_reasoner_run_N8_20260304_202822
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.52
Token Usage: prompt: 312,851, completion: 7,198, cache_read: 104,522, reasoning: 5,222
Run Suffix: litellm_proxy_gemini_3_pro_preview_47bb174_gemini_3_pro_run_N8_20260304_202818

litellm_proxy_anthropic_claude_sonnet_4_6

Success Rate: 87.5% (7/8)
Total Cost: $0.44
Token Usage: prompt: 248,791, completion: 5,758, cache_read: 168,685, cache_write: 79,858, reasoning: 844
Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_47bb174_claude_sonnet_4_6_run_N8_20260304_203424

Failed Tests:

t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (7/7)
Total Cost: $0.07
Token Usage: prompt: 236,397, completion: 4,987, cache_read: 179,712
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_47bb174_kimi_k2_thinking_run_N8_20260304_202818
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

github-actions · 2026-03-04T20:46:50Z

🧪 Integration Tests Results

Overall Success Rate: 96.7%
Total Cost: $1.11
Models Tested: 4
Timestamp: 2026-03-04 20:46:41 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_anthropic_claude_sonnet_4_6: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	7/7	1	8	$0.04	635,066
litellm_proxy_gemini_3_pro_preview	100.0%	8/8	0	8	$0.57	418,384
litellm_proxy_anthropic_claude_sonnet_4_6	87.5%	7/8	0	8	$0.44	254,936
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	7/7	1	8	$0.08	255,849

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (7/7)
Total Cost: $0.04
Token Usage: prompt: 620,741, completion: 14,325, cache_read: 569,856, reasoning: 6,064
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_47bb174_deepseek_v3_2_reasoner_run_N8_20260304_202829
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.57
Token Usage: prompt: 409,439, completion: 8,945, cache_read: 200,531, reasoning: 6,840
Run Suffix: litellm_proxy_gemini_3_pro_preview_47bb174_gemini_3_pro_run_N8_20260304_202857

litellm_proxy_anthropic_claude_sonnet_4_6

Success Rate: 87.5% (7/8)
Total Cost: $0.44
Token Usage: prompt: 249,258, completion: 5,678, cache_read: 169,238, cache_write: 79,772, reasoning: 945
Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_47bb174_claude_sonnet_4_6_run_N8_20260304_202822

Failed Tests:

t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (7/7)
Total Cost: $0.08
Token Usage: prompt: 251,029, completion: 4,820, cache_read: 194,816
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_47bb174_kimi_k2_thinking_run_N8_20260304_202829
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

xingyaoww · 2026-03-04T20:48:30Z

@OpenHands check the log #2302 (comment) and tell me why claude sonnet 4.6 is failing

openhands-ai · 2026-03-04T20:48:53Z

I'm on it! xingyaoww can track my progress at all-hands.dev

enyst · 2026-03-04T20:50:15Z

@OpenHands make a new issue to decide the wanted behavior for the public API checks (griffe and oasdiff) workflows. Look at this PR comments / CI to start with:

I see a comment saying check Failed
it's a rel-* PR, so why doesn't it fail the workflow
LLM class has breaking changes, are they the kind we should require deprecated on? why don't we require it now
look at all the CI log for the checks to understand all that happened
read issue 2252 as part of your investigation

Make an issue describing the current execution and compare it with documented expectations. Give links to source code on github.

openhands-ai · 2026-03-04T20:50:33Z

I'm on it! enyst can track my progress at all-hands.dev

github-actions · 2026-03-04T20:50:38Z

🧪 Integration Tests Results

Overall Success Rate: 96.7%
Total Cost: $1.32
Models Tested: 4
Timestamp: 2026-03-04 20:50:30 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_anthropic_claude_sonnet_4_6: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	7/7	1	8	$0.03	584,273
litellm_proxy_gemini_3_pro_preview	100.0%	8/8	0	8	$0.80	712,620
litellm_proxy_anthropic_claude_sonnet_4_6	87.5%	7/8	0	8	$0.42	236,250
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	7/7	1	8	$0.07	226,273

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (7/7)
Total Cost: $0.03
Token Usage: prompt: 571,182, completion: 13,091, cache_read: 520,768, reasoning: 5,377
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_47bb174_deepseek_v3_2_reasoner_run_N8_20260304_203921
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.80
Token Usage: prompt: 703,586, completion: 9,034, cache_read: 395,723, reasoning: 6,658
Run Suffix: litellm_proxy_gemini_3_pro_preview_47bb174_gemini_3_pro_run_N8_20260304_203908

litellm_proxy_anthropic_claude_sonnet_4_6

Success Rate: 87.5% (7/8)
Total Cost: $0.42
Token Usage: prompt: 231,306, completion: 4,944, cache_read: 151,797, cache_write: 79,277, reasoning: 794
Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_47bb174_claude_sonnet_4_6_run_N8_20260304_203920

Failed Tests:

t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (7/7)
Total Cost: $0.07
Token Usage: prompt: 221,860, completion: 4,413, cache_read: 171,008
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_47bb174_kimi_k2_thinking_run_N8_20260304_203955
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

openhands-ai · 2026-03-04T20:52:16Z

Summary

You asked me to check the CI log from PR #2302 and explain why Claude Sonnet 4.6 is failing. Here's the answer:

Claude Sonnet 4.6 failed the t02_add_bash_hello integration test because the agent created the shell script but never made it executable (chmod +x).

Specifically, the agent:

✅ Created the shell/ directory with mkdir -p
✅ Created shell/hello.sh with correct content (#!/bin/sh + echo 'hello')
✅ Verified the script runs by invoking sh /tmp/.../shell/hello.sh — which doesn't require the execute bit
❌ Never ran chmod +x on the script

The test's verify_result() checks os.access(self.script_path, os.X_OK), which fails because the file_editor's create command writes files with default 0o644 permissions (not executable). The test returns: "Shell script is not executable".

This is a non-deterministic agent behavior issue, not a code bug. Claude Sonnet 4.6 simply didn't think to set the execute permission. It may pass on a re-run. All other models (DeepSeek, Gemini, Kimi) passed at 100%.

No code changes were made — this was purely an investigative/diagnostic task.

openhands-ai · 2026-03-04T20:52:56Z

Summary of work completed:

Investigated PR Release v1.12.0 #2302 comments and CI logs for the Griffe and OpenAPI breakage checks, confirming the Griffe job produced exit_code=1 while the workflow still concluded success, and that oasdiff returned exit code 102 with no JSON breakages.
Reviewed issue CI: API breakage checks failing on all PRs due to DockerDevWorkspace.server_image change #2252 and the in-repo workflow/script definitions to compare current behavior with documented expectations.
Created a new GitHub issue documenting findings, source links, and decision questions: Decide desired behavior for public API breakage checks (Griffe + oasdiff) #2304.

No repository code changes were made, so no git diff or push is applicable.

VascoSch92 · 2026-03-04T23:25:48Z

Failed Examples

Example	Result	Duration	Cost
`25_agent_delegation.py`	❌ Timed out after 600s	10m 0s	$0.30
`38_browser_session_recording.py`	❌ Timed out after 600s	10m 0s	$0.04
`40_acp_agent_example.py`	❌ Exit code 1	18.2s	--
`41_task_tool_set.py`	❌ Exit code 1	7.1s	--

Root Causes

1. Browser launch failure (`38_browser_session_recording.py`)

Chrome/Chromium failed to start on the CI runner. The browser_use library could not connect to Chrome's CDP endpoint:

ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 52511)
TimeoutError: Event handler BrowserSession.on_BrowserStartEvent timed out after 30.0s

The browser watchdog retried for 30s, failed, then the test burned the remaining time until the 600s timeout.

2. Agent delegation too slow (`25_agent_delegation.py`)

The delegation example spent $0.30 in LLM calls (using claude-haiku-4-5) but did not complete within the 600s budget. The delegated sub-agent may be stuck in a loop or the task is too complex for the timeout.

3. Immediate failures (`40_acp_agent_example.py`, `41_task_tool_set.py`)

Both failed fast (18.2s and 7.1s respectively) with exit code 1, suggesting code errors, missing dependencies, or configuration issues rather than timeouts. Further investigation needed on their specific error
output.

Note: 41_task_tool_set.py is an example that i wrote. It is an interactivce example, i.e., the user should give an input. I suppose is failing for that.

enyst · 2026-03-05T00:14:16Z

How about this one, not sure if it's still your @simonrosenberg version basically or maybe you rewrote it @VascoSch92 , just wondering if you guys have an idea what's up here? This used to work I think, it's number 25

Agent delegation too slow (25_agent_delegation.py)
The delegation example spent $0.30 in LLM calls (using claude-haiku-4-5) but did not complete within the 600s budget. The delegated sub-agent may be stuck in a loop or the task is too complex for the timeout.

Xingyao's agent excluded all 4, it just looks like this shouldn't maybe happen

xingyaoww · 2026-03-05T10:29:31Z

@OpenHands please revert 8a21c3f - most of the issues should've been fixed in the main branch that's just merged in. Can you re-tag test-examples and monitor the result

openhands-ai · 2026-03-05T10:29:59Z

I'm on it! xingyaoww can track my progress at all-hands.dev

…r external infra" This reverts commit 8a21c3f.

github-actions · 2026-03-05T10:40:41Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-03-05 11:03:18 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	21.6s	$0.02
01_standalone_sdk/03_activate_skill.py	✅ PASS	19.5s	$0.02
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	12.6s	$0.00
01_standalone_sdk/07_mcp_integration.py	✅ PASS	37.9s	$0.02
01_standalone_sdk/09_pause_example.py	✅ PASS	15.9s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	30.2s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	31.3s	$0.02
01_standalone_sdk/12_custom_secrets.py	✅ PASS	11.2s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	20.3s	$0.02
01_standalone_sdk/14_context_condenser.py	❌ FAIL Exit code 1	1m 6s	--
01_standalone_sdk/17_image_input.py	✅ PASS	16.7s	$0.01
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	29.1s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	13.5s	$0.00
01_standalone_sdk/20_stuck_detector.py	✅ PASS	14.8s	$0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	9.6s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	13.8s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 19s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	5m 54s	$0.38
01_standalone_sdk/25_agent_delegation.py	✅ PASS	1m 17s	$0.09
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	19.0s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	26.8s	$0.02
01_standalone_sdk/29_llm_streaming.py	✅ PASS	32.0s	$0.02
01_standalone_sdk/30_tom_agent.py	✅ PASS	10.3s	$0.00
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	9m 39s	$0.72
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	19.4s	$0.01
01_standalone_sdk/34_critic_example.py	❌ FAIL Timed out after 600 seconds	10m 0s	--
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	12.1s	$0.01
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	4.8s	$0.00
01_standalone_sdk/38_browser_session_recording.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.04
01_standalone_sdk/39_llm_fallback.py	✅ PASS	10.3s	$0.01
01_standalone_sdk/40_acp_agent_example.py	❌ FAIL Exit code 1	9.4s	--
01_standalone_sdk/41_task_tool_set.py	✅ PASS	28.4s	$0.02
01_standalone_sdk/42_file_based_subagents.py	✅ PASS	1m 19s	$0.06
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	1m 9s	$0.06
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 36s	$0.02
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	1m 17s	$0.00
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	1m 18s	$0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	32.0s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	✅ PASS	3m 23s	$0.02
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	20.0s	$0.03
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	1m 49s	$0.08
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	12.9s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	23.8s	$0.04

❌ Some tests failed

Total: 43 | Passed: 39 | Failed: 4 | Total Cost: $1.91

Failed examples:

examples/01_standalone_sdk/14_context_condenser.py: Exit code 1
examples/01_standalone_sdk/34_critic_example.py: Timed out after 600 seconds
examples/01_standalone_sdk/38_browser_session_recording.py: Timed out after 600 seconds
examples/01_standalone_sdk/40_acp_agent_example.py: Exit code 1

View full workflow run

enyst · 2026-03-05T10:44:15Z

@xingyaoww About the API breakage job failing:

I think it's not blocking release, but it's there to warn us and make us look (though we could make it mandatory anytime if it behaves as wanted)

These are the current failures:

the browser export existed only one release. It was my mistake, I exported all tools, and we reverted it because the CLI doesn't support it.
I think the delegate agent methods were also very short-lived. Vasco knows more.

IMHO, I think maybe we could "sin" this time and let these pass without deprecation? 🤔

xingyaoww · 2026-03-05T10:48:32Z

@OpenHands please check https://github.com/OpenHands/software-agent-sdk/actions/runs/22713720951/job/65857938260?pr=2302 and #2302 (comment)

Tell me which commit introduced these and which commit removed these? Tell me the release version before and after these commits. I want to assess the impact of ignoring these errors (eg not marking them deprecated since they are very short lived)

openhands-ai · 2026-03-05T10:48:55Z

I'm on it! xingyaoww can track my progress at all-hands.dev

openhands-ai · 2026-03-05T10:52:34Z

No code changes were made — this was a pure investigation/analysis task. Here's the summary:

API Breakage Analysis for PR #2302 (Release v1.12.0)

The CI check SDK programmatic API (Griffe) is failing because 3 symbols were removed from openhands.tools.__all__ without prior deprecation:

Timeline

Symbol	Added by	Removed by	Lifespan
`BrowserToolSet`	`23e167d6` (#2080) — Feb 15	`e25a1ef8` (#2197) — Feb 25	1 release
`get_agent_factory`	`23e167d6` (#2080) — Feb 15	`ab8f5f28` (#2183) — Feb 25	1 release
`register_agent`	`23e167d6` (#2080) — Feb 15	`ab8f5f28` (#2183) — Feb 25	1 release

Key Findings

All 3 were introduced in a single commit (23e167d6 — "ci: extend API breakage checks to openhands-tools"), which was the commit that created the __all__ for openhands-tools in the first place.
Before v1.11.5: openhands-tools/__init__.py had no __all__ at all — just a bare version-only init. These symbols were never in any tracked public API before.
v1.11.5 (Feb 20, 2026): The only released version containing all 3 symbols.
v1.12.0 (pending): All 3 removed.
Total lifespan: ~5 days, exactly 1 release.

Removal Reasons

BrowserToolSet: Forces heavy browser-use dependencies on downstream consumers (CLI bundle bloat).
get_agent_factory + register_agent: Removed as part of delegation refactor — file-based agent definitions replaced programmatic registration.

Impact Assessment

Very low. These were essentially an accidental over-export — added as part of setting up the API breakage CI for openhands-tools, then immediately cleaned up in the next development cycle. They existed for exactly 1 release with no prior public API presence. Skipping the deprecation cycle for these seems safe.

xingyaoww · 2026-03-05T11:00:26Z

@enyst agree that it is not release blocking :) Just need to get example tests fixed and we'd be good to go

github-actions · 2026-03-05T18:53:52Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-03-05T18:57:36Z

🧪 Integration Tests Results

Overall Success Rate: 93.3%
Total Cost: $1.13
Models Tested: 4
Timestamp: 2026-03-05 18:57:28 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_reasoner: 📥 View & Download Logs
litellm_proxy_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_anthropic_claude_sonnet_4_6: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs

📊 Summary

Model	Overall	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_deepseek_deepseek_reasoner	100.0%	7/7	1	8	$0.03	560,607
litellm_proxy_gemini_3_pro_preview	100.0%	8/8	0	8	$0.57	328,813
litellm_proxy_anthropic_claude_sonnet_4_6	87.5%	7/8	0	8	$0.44	254,527
litellm_proxy_moonshot_kimi_k2_thinking	85.7%	6/7	1	8	$0.09	289,602

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

Success Rate: 100.0% (7/7)
Total Cost: $0.03
Token Usage: prompt: 548,427, completion: 12,180, cache_read: 501,888, reasoning: 4,811
Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_6e1cedc_deepseek_v3_2_reasoner_run_N8_20260305_185422
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.57
Token Usage: prompt: 320,012, completion: 8,801, cache_read: 95,422, reasoning: 5,729
Run Suffix: litellm_proxy_gemini_3_pro_preview_6e1cedc_gemini_3_pro_run_N8_20260305_185422

litellm_proxy_anthropic_claude_sonnet_4_6

Success Rate: 87.5% (7/8)
Total Cost: $0.44
Token Usage: prompt: 248,733, completion: 5,794, cache_read: 168,632, cache_write: 79,853, reasoning: 1,038
Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_6e1cedc_claude_sonnet_4_6_run_N8_20260305_185423

Failed Tests:

t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 85.7% (6/7)
Total Cost: $0.09
Token Usage: prompt: 283,151, completion: 6,451, cache_read: 224,256
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_6e1cedc_kimi_k2_thinking_run_N8_20260305_185422
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

t02_add_bash_hello: Shell script is not executable (Cost: $0.01)

github-actions · 2026-03-05T19:03:58Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-03-05 19:16:05 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	26.6s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	20.6s	$0.02
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	13.2s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	34.3s	$0.02
01_standalone_sdk/09_pause_example.py	✅ PASS	17.8s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	29.1s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	36.4s	$0.04
01_standalone_sdk/12_custom_secrets.py	✅ PASS	10.1s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	22.4s	$0.01
01_standalone_sdk/14_context_condenser.py	✅ PASS	3m 2s	$0.21
01_standalone_sdk/17_image_input.py	✅ PASS	17.9s	$0.01
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	23.7s	$0.01
01_standalone_sdk/19_llm_routing.py	✅ PASS	14.5s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	16.9s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	12.3s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	18.1s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	2m 40s	$0.03
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	2m 49s	$0.18
01_standalone_sdk/25_agent_delegation.py	✅ PASS	55.7s	$0.06
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	18.8s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	31.6s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	50.0s	$0.04
01_standalone_sdk/30_tom_agent.py	✅ PASS	10.4s	$0.01
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	5m 24s	$0.36
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	21.1s	$0.02
01_standalone_sdk/34_critic_example.py	✅ PASS	1m 39s	$0.13
01_standalone_sdk/36_event_json_to_openai_messages.py	✅ PASS	10.3s	$0.00
01_standalone_sdk/37_llm_profile_store.py	✅ PASS	5.1s	$0.00
01_standalone_sdk/38_browser_session_recording.py	✅ PASS	56.1s	$0.02
01_standalone_sdk/39_llm_fallback.py	✅ PASS	11.8s	$0.01
01_standalone_sdk/40_acp_agent_example.py	✅ PASS	34.7s	$0.14
01_standalone_sdk/41_task_tool_set.py	✅ PASS	32.0s	$0.03
01_standalone_sdk/42_file_based_subagents.py	✅ PASS	1m 27s	$0.09
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	53.0s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	✅ PASS	1m 37s	$0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	✅ PASS	54.9s	$0.00
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	1m 29s	$0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	36.3s	$0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	✅ PASS	3m 14s	$0.01
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	26.0s	$0.03
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	58.7s	$0.05
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	14.1s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	23.6s	$0.03

✅ All tests passed!

Total: 43 | Passed: 43 | Failed: 0 | Total Cost: $1.91

View full workflow run

github-actions · 2026-03-05T19:22:46Z

Evaluation Triggered

Trigger: Release v1.12.0
SDK: 6e1cedc
Eval limit: 50
Models: claude-sonnet-4-5-20250929

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Xingyao Wang <xingyao@all-hands.dev>

Release v1.12.0

47bb174

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Mar 4, 2026

all-hands-bot commented Mar 4, 2026

View reviewed changes

xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Mar 4, 2026

enyst mentioned this pull request Mar 4, 2026

Decide desired behavior for public API breakage checks (Griffe + oasdiff) #2304

Closed

Merge branch 'main' into rel-1.12.0

ebbae60

Revert "fix(tests): exclude examples that require interactive input o…

1c72d96

…r external infra" This reverts commit 8a21c3f.

xingyaoww removed the test-examples Run all applicable "examples/" files. Expensive operation. label Mar 5, 2026

xingyaoww added the test-examples Run all applicable "examples/" files. Expensive operation. label Mar 5, 2026 — with OpenHands AI

xingyaoww added 2 commits March 5, 2026 23:35

Merge branch 'main' into rel-1.12.0

bf2b24c

Merge branch 'main' into rel-1.12.0

6e1cedc

xingyaoww approved these changes Mar 5, 2026

View reviewed changes

xingyaoww merged commit db9f0a7 into main Mar 5, 2026
71 of 72 checks passed

xingyaoww deleted the rel-1.12.0 branch March 5, 2026 19:27

enyst mentioned this pull request Mar 6, 2026

Investigation: CLI compatibility and API breakage check for PR #2133 #2343

Closed

openhands-ai bot mentioned this pull request Mar 6, 2026

feat: Add ACPAgent for Agent Client Protocol integration #2133

Merged

3 tasks

Conversation

all-hands-bot commented Mar 4, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release v1.12.0

Release Checklist

Next Steps

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API breakage checks (Griffe)

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Agent server REST API breakage checks (OpenAPI)

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

🟢 Good Taste - Clean Release Bump

Review Summary

Process Notes

Verdict

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

❌ Some tests failed

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

❌ Some tests failed

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

litellm_proxy_gemini_3_pro_preview

litellm_proxy_anthropic_claude_sonnet_4_6

litellm_proxy_moonshot_kimi_k2_thinking

Uh oh!

github-actions bot commented Mar 4, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

litellm_proxy_gemini_3_pro_preview

litellm_proxy_anthropic_claude_sonnet_4_6

litellm_proxy_moonshot_kimi_k2_thinking

Uh oh!

xingyaoww commented Mar 4, 2026

Uh oh!

openhands-ai bot commented Mar 4, 2026

Uh oh!

enyst commented Mar 4, 2026

Uh oh!

openhands-ai bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

litellm_proxy_gemini_3_pro_preview

all-hands-bot commented Mar 4, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Mar 4, 2026 •

edited

Loading

github-actions bot commented Mar 4, 2026 •

edited

Loading

github-actions bot commented Mar 4, 2026 •

edited

Loading

github-actions bot commented Mar 4, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

github-actions bot commented Mar 4, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

1. Browser launch failure (`38_browser_session_recording.py`)

2. Agent delegation too slow (`25_agent_delegation.py`)

3. Immediate failures (`40_acp_agent_example.py`, `41_task_tool_set.py`)

github-actions bot commented Mar 5, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

enyst commented Mar 5, 2026 •

edited

Loading

github-actions bot commented Mar 5, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`