Skip to content

Latest commit

 

History

History
46 lines (32 loc) · 2.21 KB

File metadata and controls

46 lines (32 loc) · 2.21 KB

Qwen3-Coder-480B-A35B-Instruct (Alibaba)

Provider: Synthetic.new (via Fireworks) Context window: 256k tokens Task: Multi-step tool calling chain (weather → Bulgarian description → markdown file → email with attachment)

Score: 4.5 / 10

Criteria Score Max
Correct tool call sequence 3 3
Tool call accuracy 1 3
Final output quality 0 2
Efficiency & communication 0.5 2

Execution

  • Time: ~1 min 30 sec
  • Cost: Fixed subscription ($20–60/mo on Synthetic.new)
  • User interventions: 1 (told model attachment was missing)

What happened

Steps 1–3 completed, but step 4 (email with attachment) failed twice:

  1. Weather data — Retrieved successfully (9°C, fog, high humidity).
  2. Bulgarian description — Created.
  3. Markdown file — Created, but file path not provided to user — file was inaccessible.
  4. Email — Sent twice, both times without the attachment. After user feedback ("файла нет"), the model retried but made the same mistake.

Critical failures

  • Attachment not sent — twice. The model either formed the API call incorrectly or couldn't handle the attachment parameter.
  • No error recovery — After being told the attachment was missing, it repeated the same broken approach instead of trying an alternative.
  • File inaccessible — The markdown file was created somewhere on the server but the path was never shared with the user.
  • User got nothing — No file, no attachment, weather data only visible in terminal output.

Notable behavior

  • Did not show thinking process — just produced final output directly.
  • Switched to Bulgarian in terminal output without being asked — violated the communication language (Russian).
  • Email body text was decent: "Това е автоматично генериран отчет за времето в София."

Verdict

Formally went through the steps in the correct order, but the user ended up with nothing. For an agent, this is a failure — the task was not completed, and the model couldn't self-correct even after explicit feedback. The coding-focused model paradoxically couldn't handle a practical tool-calling task.