Pocket guIDE — Round 2 Benchmark: Complex Web Automation

Test Parameters

Parameter	Value
Task	Navigate Domino's Pizza website, explore menus, find prices, fill forms, take screenshots, document findings
Prompt	"Go to dominos.com and navigate through the pizza ordering process. Browse the menu pages, find specialty pizzas and prices, explore customization options, and fill out any forms you encounter (delivery address, pizza builder, etc.) without actually placing an order. Document all your findings — page titles, menu items, prices, forms, and options — in a file called dominos-research.md. Take screenshots of key pages."
Date	2025-07-22
Platform	Pocket guIDE (pocket.graysoft.dev)
Browser	Chromium (headless, Playwright-managed)

Results Summary

Model	Tools Used	Failed	Tokens	Output Size	Screenshots	Agent Plan	Rating
GPT-OSS 120B	56	3	1.4M	4.1 KB (78 lines)	6 actual PNGs	Yes (9 steps)	★★★★★
Qwen 3 32B	11	4	~800K	~2 KB	Referenced	Yes (9 steps)	★★★★☆
ZAI GLM 4.7	24	3	714K	~1.5 KB (40 lines)	None	No	★★★☆☆
Qwen 3 235B	7	2	1.2M	~1.5 KB	Placeholder refs	No	★★½☆☆
Llama 3.3 70B	8	3	731K	~0.8 KB (20 lines)	None	No	★★☆☆☆
Llama 3.1 8B	48	8	1.1M	~0.2 KB (4 lines)	None	No	★☆☆☆☆

Detailed Model Analysis

1. GPT-OSS 120B (Reasoning) — ★★★★★

The clear winner. Best output quality by a wide margin.

Tools: 56 (3 failed) — highest tool count, but purposeful and systematic
Tokens: 1.4M
Output: 78-line professional Markdown document with full page-by-page documentation
Highlights:
- Created an Agent Plan (9 steps) before executing — showed strategic thinking
- 6 actual screenshots saved to /workspace/screenshots/ (home, menu, pizza, specialty, build-your-own, customization)
- Specific prices with size breakdowns: Personal $5.99, Small $9.99, Medium $12.99, Large $14.99
- Form details documented at HTML level: radio buttons for size/crust/sauce/cheese, checkboxes for toppings, price per topping ($1.00 meat, $0.75 veggie), surcharges (Gluten-Free +$1.00)
- Complete specialty pizza list with all toppings for each variety (11 pizzas)
- Multiple pages visited: Home, Menu, Pizza, Specialty, Build Your Own, Deals, Store Locator, Careers, Contact
- Clean formatting with URLs, titles, and cross-references to screenshots
Weaknesses: Highest token consumption (1.4M), most tools used

2. Qwen 3 32B — ★★★★☆

Best efficiency-to-quality ratio. Strong structured output with minimal tool usage.

Tools: 11 (4 failed) — very efficient
Tokens: ~800K
Output: Well-structured doc with real prices and step-by-step documentation
Highlights:
- Created a 9-step Agent Plan before executing (like GPT-OSS)
- Real prices found: Medium $10.99, specialty crusts $12.49, toppings $2.50-$3.00
- Used thinking chains (<think> blocks) to reason about navigation strategy
- Good balance of breadth (multiple pages) and depth (prices, options)
- Most efficient model that still produced quality output
Weaknesses: Some failed tools (4/11), screenshot references but not actual files, less detail than GPT-OSS

3. ZAI GLM 4.7 (Default) — ★★★☆☆

Solid baseline performer. Functional output with reasonable coverage.

Tools: 24 (3 failed)
Tokens: 714K — most token-efficient of all models
Output: 40-line document with page titles, menu items, offers, and forms
Highlights:
- Found Mix & Match Deal at $6.99 each
- Filled a delivery address form (123 Main Street, ZIP 10001) — only model to actually fill address fields
- Documented offers, menu categories, and navigation structure
- Consistent and reliable execution
Weaknesses: No screenshots, no prices for individual pizzas, less depth than top performers

4. Qwen 3 235B (Preview) — ★★½☆☆

Underwhelming for its size. Generic output despite being the largest Qwen model.

Tools: 7 (2 failed) — fewest tools used
Tokens: 1.2M — high token usage for low output
Output: ~1.5 KB with page titles and category names but lacking depth
Highlights:
- Mentioned specific pizza names: "Create Your Own Pizza", "MeatZZa", "Honolulu Hawaiian"
- Documented page titles and general categories
- Fewest tool failures (2)
Weaknesses: No prices anywhere, placeholder screenshot references (not actual files), very generic descriptions, poor token efficiency (1.2M tokens for minimal output), did not explore customization or forms

5. Llama 3.3 70B — ★★☆☆☆

Minimal effort. Acknowledged limitations but didn't push through them.

Tools: 8 (3 failed)
Tokens: 731K
Output: 20-line document with basic page titles and a few prices
Highlights:
- Found Medium Handmade Pan Pizza at $13.99, toppings at $1.50
- Honest about limitations — noted "some pages weren't navigated"
- At least documented what it found accurately
Weaknesses: Only 20 lines of output, didn't explore customization forms, no screenshots, gave up too early on navigation challenges, low page coverage

6. Llama 3.1 8B (Fast) — ★☆☆☆☆

Failed the task. High activity, near-zero useful output.

Tools: 48 (8 failed) — second-highest tool count but wasteful
Tokens: 1.1M — very poor efficiency
Output: 4 lines — just a "Task Completion" checklist
Highlights:
- None — the model overwrote its own file multiple times, losing accumulated content each time
- High tool count (48) indicates frantic, uncoordinated activity
- 8 failed tools — highest failure rate (17%)
Weaknesses: Completely failed to produce useful documentation. Demonstrated inability to maintain state across tool calls. The file was rewritten from scratch multiple times, finally ending with a 4-line summary that contained no actual research data. Worst token efficiency of any model.

Key Findings

What Separates Good from Great

Agent Planning: Models that created an Agent Plan (GPT-OSS 120B, Qwen 3 32B) produced significantly better structured output
Screenshot capture: Only GPT-OSS 120B actually saved screenshot files — visual documentation is a differentiator
Price specificity: Top models found actual dollar amounts; weaker models only listed category names
Form interaction: Only ZAI GLM 4.7 filled address fields; GPT-OSS 120B documented form structure at HTML level
File management: Llama 3.1 8B's repeated file overwrites show that smaller models struggle with stateful multi-step workflows

Tool Efficiency Rankings

Model	Tools/KB Output	Assessment
Qwen 3 32B	~5.5	Most efficient
ZAI GLM 4.7	~16	Good
Qwen 3 235B	~4.7	Efficient tools, poor output
Llama 3.3 70B	~10	Average
GPT-OSS 120B	~13.7	Justified by quality
Llama 3.1 8B	~240	Extremely wasteful

Token Efficiency Rankings

Model	Tokens/KB Output	Assessment
ZAI GLM 4.7	~476K/KB	Best
Llama 3.3 70B	~914K/KB	Average
Qwen 3 32B	~400K/KB	Excellent
GPT-OSS 120B	~341K/KB	Best per KB
Qwen 3 235B	~800K/KB	Poor
Llama 3.1 8B	~5.5M/KB	Catastrophic

Combined Rankings (Round 1 + Round 2)

Model	Round 1 (Web+Game)	Round 2 (Web Automation)	Overall
GPT-OSS 120B	2nd	1st ★★★★★	1st
Qwen 3 32B	3rd	2nd ★★★★☆	2nd
ZAI GLM 4.7	1st	3rd ★★★☆☆	3rd
Qwen 3 235B	4th	4th ★★½☆☆	4th
Llama 3.3 70B	5th	5th ★★☆☆☆	5th
Llama 3.1 8B	6th	6th ★☆☆☆☆	6th

Analysis

GPT-OSS 120B rises from 2nd to 1st overall — its reasoning capabilities shine on complex multi-step tasks
Qwen 3 32B is the sleeper hit — excellent quality at lower cost, best efficiency-to-quality ratio
ZAI GLM 4.7 drops from 1st to 3rd — strong on creative tasks but less dominant on structured research
Qwen 3 235B underperforms its size — the "preview" label may indicate it's not fully optimized
Llama models consistently rank at the bottom for agentic web tasks — the 8B model in particular cannot handle stateful multi-step workflows

Benchmark conducted on Pocket guIDE v2.0 using Groq cloud inference. All models used identical prompts and browser environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pocket guIDE — Round 2 Benchmark: Complex Web Automation

Test Parameters

Results Summary

Detailed Model Analysis

1. GPT-OSS 120B (Reasoning) — ★★★★★

2. Qwen 3 32B — ★★★★☆

3. ZAI GLM 4.7 (Default) — ★★★☆☆

4. Qwen 3 235B (Preview) — ★★½☆☆

5. Llama 3.3 70B — ★★☆☆☆

6. Llama 3.1 8B (Fast) — ★☆☆☆☆

Key Findings

What Separates Good from Great

Tool Efficiency Rankings

Token Efficiency Rankings

Combined Rankings (Round 1 + Round 2)

Analysis

FilesExpand file tree

BENCHMARK_ROUND2_WEB_AUTOMATION.md

Latest commit

History

BENCHMARK_ROUND2_WEB_AUTOMATION.md

File metadata and controls

Pocket guIDE — Round 2 Benchmark: Complex Web Automation

Test Parameters

Results Summary

Detailed Model Analysis

1. GPT-OSS 120B (Reasoning) — ★★★★★

2. Qwen 3 32B — ★★★★☆

3. ZAI GLM 4.7 (Default) — ★★★☆☆

4. Qwen 3 235B (Preview) — ★★½☆☆

5. Llama 3.3 70B — ★★☆☆☆

6. Llama 3.1 8B (Fast) — ★☆☆☆☆

Key Findings

What Separates Good from Great

Tool Efficiency Rankings

Token Efficiency Rankings

Combined Rankings (Round 1 + Round 2)

Analysis