feat: add 24 Karen sales scenario benchmarks by deepmasq · Pull Request #299 · smallcloudai/flexus-client-kit

deepmasq · 2026-04-09T11:31:48Z

Summary

Add 24 new very_limited__*.yaml scenario files covering diverse industry verticals
Each scenario includes: company context (cd_instruction), multi-turn conversation, expected tool calls, and judge instructions
Establishes baseline benchmark: avg 6.0/10 on staging with grok-4-1-fast-reasoning

Verticals covered

adtech, aerospace, agriculture, automotive, construction, CPG, edtech, energy, fintech, gaming, government, healthcare, hospitality, insurance, legal, logistics, manufacturing, media, nonprofit, retail, sales qualification, shopify, support refund, telecom

Test plan

All 24 scenarios validated on staging — scores range 0-9/10
Review scenario quality (judge instructions, happy path completeness)

🤖 Generated with Claude Code

Add very_limited expert scenarios covering 24 industry verticals: adtech, aerospace, agriculture, automotive, construction, CPG, edtech, energy, fintech, gaming, government, healthcare, hospitality, insurance, legal, logistics, manufacturing, media, nonprofit, retail, sales qualification, shopify, support refund, telecom. Each scenario defines a realistic sales conversation with cd_instruction (company data), user messages, expected tool calls (vector search, CRM, kanban), and judge instructions for automated scoring. Baseline benchmark on staging with grok-4-1-fast-reasoning: avg 6.0/10 across 25 scenarios (including existing actual_support). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add 24 Karen sales scenario benchmarks#299

feat: add 24 Karen sales scenario benchmarks#299
deepmasq wants to merge 1 commit intomainfrom
feat/karen-sales-scenario-suite

deepmasq commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deepmasq commented Apr 9, 2026

Summary

Verticals covered

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant