Qwen3-Coder-480B-A35B-Instruct (Alibaba)

Provider: Synthetic.new (via Fireworks) Context window: 256k tokens Task: Multi-step tool calling chain (weather → Bulgarian description → markdown file → email with attachment)

Score: 4.5 / 10

Criteria	Score	Max
Correct tool call sequence	3	3
Tool call accuracy	1	3
Final output quality	0	2
Efficiency & communication	0.5	2

Execution

Time: ~1 min 30 sec
Cost: Fixed subscription ($20–60/mo on Synthetic.new)
User interventions: 1 (told model attachment was missing)

What happened

Steps 1–3 completed, but step 4 (email with attachment) failed twice:

Weather data — Retrieved successfully (9°C, fog, high humidity).
Bulgarian description — Created.
Markdown file — Created, but file path not provided to user — file was inaccessible.
Email — Sent twice, both times without the attachment. After user feedback ("файла нет"), the model retried but made the same mistake.

Critical failures

Attachment not sent — twice. The model either formed the API call incorrectly or couldn't handle the attachment parameter.
No error recovery — After being told the attachment was missing, it repeated the same broken approach instead of trying an alternative.
File inaccessible — The markdown file was created somewhere on the server but the path was never shared with the user.
User got nothing — No file, no attachment, weather data only visible in terminal output.

Notable behavior

Did not show thinking process — just produced final output directly.
Switched to Bulgarian in terminal output without being asked — violated the communication language (Russian).
Email body text was decent: "Това е автоматично генериран отчет за времето в София."

Verdict

Formally went through the steps in the correct order, but the user ended up with nothing. For an agent, this is a failure — the task was not completed, and the model couldn't self-correct even after explicit feedback. The coding-focused model paradoxically couldn't handle a practical tool-calling task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-Coder-480B-A35B-Instruct (Alibaba)

Score: 4.5 / 10

Execution

What happened

Critical failures

Notable behavior

Verdict

FilesExpand file tree

qwen3-coder-480b.md

Latest commit

History

qwen3-coder-480b.md

File metadata and controls

Qwen3-Coder-480B-A35B-Instruct (Alibaba)

Score: 4.5 / 10

Execution

What happened

Critical failures

Notable behavior

Verdict