feat: expand eval coverage for backend SDKs and SvelteKit#78
feat: expand eval coverage for backend SDKs and SvelteKit#78
Conversation
Switch all SKILL.md WebFetch URLs from github.com/blob/ (HTML) to raw.githubusercontent.com (plain text) for cleaner parsing. Add tests/fixtures/README.md documenting fixture state conventions and per-language guidance for upcoming backend SDK eval expansion.
Add partial-install and conflicting-auth fixtures for all 8 backend SDKs (Node, Python, Ruby, Go, PHP, PHP-Laravel, Kotlin, Elixir) and expand SvelteKit from 1 to 5 test states. Each backend SDK now has 4 eval states (up from 2), matching frontend skill coverage. Includes: - 20 new fixture directories with validated, buildable projects - SKILL.md updates with partial install recovery and conflicting auth detection sections for 9 skills - Grader bonus checks for preserved routes and conflicting auth - 24 new eval scenarios registered in runner.ts
Align React, React Router, TanStack Start, and Vanilla JS skills to match the Next.js verification checklist pattern: numbered bash commands with comments, "(ALL MUST PASS)" header, and recovery guidance for critical checks.
v1.x-v2.x require illuminate/support ^5-9 via workos-php, which conflicts with Laravel 11. v5.x requires workos-php ^4.29 with no illuminate/support constraint, resolving the conflict.
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
Steps 3, 4, and error recovery still referenced the old @workos-inc namespace. The npm package is @workos/authkit-sveltekit.
Summary
Problem
Backend SDK skills (Node, Python, Ruby, Go, PHP, PHP-Laravel, Kotlin, Elixir) had only 2 eval test states each (
exampleandexample-auth0), while frontend skills had 4-6 states covering partial installs, conflicting middleware, auth migrations, and strict TypeScript. This left edge cases untested for backend integrations — partial installs, conflicting auth systems, and framework-specific gotchas that the evals didn't catch.SvelteKit had just 1 test state (
example), far behind all other frontend frameworks.Additionally, skill content quality was inconsistent: README fetch URLs mixed HTML and plaintext formats, verification checklist structures varied across skills, and backend skills lacked the edge case coverage that makes frontend skills robust.
Changes
Eval Coverage Expansion (24 new scenarios)
Backend SDKs — 16 new fixtures (2 per SDK):
partial-installconflicting-authSvelteKit — 4 new fixtures:
example-auth0— Auth0 SPA auth with hooks and callback routepartial-install— SDK installed but hooks.server.ts exports passthrough, no callbackconflicting-auth— Full Lucia v3 auth with login, logout, dashboard, session cookiestypescript-strict— Strict TS with noImplicitAny, strictNullChecks, exactOptionalPropertyTypesSkill Content Improvements (14 skills updated)
9 backend + SvelteKit skills — added two new sections to each:
5 frontend skills — standardized verification checklist format to match Next.js pattern (numbered bash commands, "(ALL MUST PASS)" header, recovery guidance)
Grader Updates (9 graders)
Added bonus checks (non-blocking) for:
Infrastructure
github.com/blob/(HTML) toraw.githubusercontent.com(plaintext)tests/fixtures/README.mddocumenting fixture state conventions and per-language guidancerunner.tsResults
Full eval suite: 62/62 passing (98.4% first-attempt, 100% with-correction)