Skip to content

feat: add 24 Karen sales scenario benchmarks#299

Draft
deepmasq wants to merge 1 commit intomainfrom
feat/karen-sales-scenario-suite
Draft

feat: add 24 Karen sales scenario benchmarks#299
deepmasq wants to merge 1 commit intomainfrom
feat/karen-sales-scenario-suite

Conversation

@deepmasq
Copy link
Copy Markdown
Contributor

@deepmasq deepmasq commented Apr 9, 2026

Summary

  • Add 24 new very_limited__*.yaml scenario files covering diverse industry verticals
  • Each scenario includes: company context (cd_instruction), multi-turn conversation, expected tool calls, and judge instructions
  • Establishes baseline benchmark: avg 6.0/10 on staging with grok-4-1-fast-reasoning

Verticals covered

adtech, aerospace, agriculture, automotive, construction, CPG, edtech, energy, fintech, gaming, government, healthcare, hospitality, insurance, legal, logistics, manufacturing, media, nonprofit, retail, sales qualification, shopify, support refund, telecom

Test plan

  • All 24 scenarios validated on staging — scores range 0-9/10
  • Review scenario quality (judge instructions, happy path completeness)

🤖 Generated with Claude Code

Add very_limited expert scenarios covering 24 industry verticals:
adtech, aerospace, agriculture, automotive, construction, CPG, edtech,
energy, fintech, gaming, government, healthcare, hospitality, insurance,
legal, logistics, manufacturing, media, nonprofit, retail, sales
qualification, shopify, support refund, telecom.

Each scenario defines a realistic sales conversation with cd_instruction
(company data), user messages, expected tool calls (vector search, CRM,
kanban), and judge instructions for automated scoring.

Baseline benchmark on staging with grok-4-1-fast-reasoning: avg 6.0/10
across 25 scenarios (including existing actual_support).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant