test(eval): add golden fixtures and pnpm eval harness#72
test(eval): add golden fixtures and pnpm eval harness#72amirbahador-hub wants to merge 21 commits into
Conversation
Shared, validated contract for CV Builder surfaces: Resume, JobDescription, Archetype, Issue, Claim, and EvalResult (with required rubric/archetype versions). Closes #47
Scoring brain the prompts reference: rubric v1 (six weighted dimensions with 0-5 anchors), three role archetypes (Software Engineer, Product Manager, Data & ML Engineer), keyword-based detectArchetype, and the ATS and claim validator specs. Closes #63
The three Phase 1 prompts a power user's agent runs. Markdown templates are the source of truth; renderers inject the live rubric, archetype weights, and claim rules so each assembled prompt is self-contained and stays in sync with the intelligence package. Every prompt asks for JSON matching its schema. Closes #64
Clone the repo, open Claude Code, run /evaluate-cv: the cv-evaluation skill runs extract -> detect -> score -> validate-claims locally, reading the prompts and rubric straight from the repo (no build). A SessionStart hook greets the user with the available commands on open. Skill subfiles point at the package sources to avoid drift. Closes #65
Five fixtures across the three archetypes (including an ATS-hostile resume) guard the deterministic layer: for each, the detected archetype and ATS read must match expectations and the resume must parse against the schema. No LLM involved. Wired as a pnpm eval task and a CI step. Closes #66
|
@coderabbitai review |
1 similar comment
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds Phase 1 local CV evaluation: Zod schemas, intelligence (rubric, archetypes, detectors, validators), prompt templates and renderers, eval golden fixtures and tests, Claude Code skill/commands and docs, and CI/turbo wiring to run deterministic evals. ChangesCV Builder Phase 1: Schemas, Intelligence, Prompts, Eval, and CLI
Sequence Diagram(s)sequenceDiagram
participant UserCLI
participant PromptRenderer
participant ExtractPrompt
participant ScorePrompt
participant ValidateClaimsPrompt
participant EvalHarness
UserCLI->>PromptRenderer: request renderExtractPrompt()
PromptRenderer->>ExtractPrompt: load template
UserCLI->>PromptRenderer: request renderScorePrompt(archetype, jdKeywords?)
PromptRenderer->>ScorePrompt: inject RUBRIC, WEIGHTS, JD_KEYWORDS
UserCLI->>PromptRenderer: request renderValidateClaimsPrompt()
PromptRenderer->>ValidateClaimsPrompt: inject CLAIM_RULES
EvalHarness->>ScorePrompt: run scoring prompt (local LLM/adapter)
EvalHarness->>ValidateClaimsPrompt: run claims validation (local LLM/adapter)
EvalHarness-->>UserCLI: validated EvalResult JSON
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
packages/prompts/src/index.ts (1)
50-58: ⚡ Quick winUse
.replaceAll()consistently across all template placeholder replacements.Line 55 uses
.replace()for{{WEIGHTS}}, while lines 51–54 and 56–57 use.replaceAll(). Functionally equivalent (each placeholder appears once), but standardizing to.replaceAll()improves readability and prevents bugs if placeholders are ever duplicated in the template.♻️ Proposed fix: standardize to `.replaceAll()`
return loadTemplate("score") .replaceAll("{{ARCHETYPE_NAME}}", archetype.name) .replaceAll("{{ARCHETYPE_ID}}", archetype.id) .replaceAll("{{ARCHETYPE_VERSION}}", archetype.version) .replaceAll("{{RUBRIC_VERSION}}", RUBRIC_VERSION) - .replace("{{WEIGHTS}}", weights) + .replaceAll("{{WEIGHTS}}", weights) - .replace("{{RUBRIC}}", rubric) + .replaceAll("{{RUBRIC}}", rubric) - .replace("{{JD_KEYWORDS}}", keywords); + .replaceAll("{{JD_KEYWORDS}}", keywords);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/prompts/src/index.ts` around lines 50 - 58, The template replacement chain returned from loadTemplate("score") mixes .replace() and .replaceAll(), so change the .replace("{{WEIGHTS}}", weights) call to .replaceAll("{{WEIGHTS}}", weights) to be consistent with the other calls (e.g., the existing .replaceAll for "{{ARCHETYPE_NAME}}", "{{ARCHETYPE_ID}}", "{{ARCHETYPE_VERSION}}", "{{RUBRIC_VERSION}}", "{{RUBRIC}}", and "{{JD_KEYWORDS}}") to avoid subtle bugs if placeholders are duplicated.CLAUDE.md (1)
3-3: 💤 Low valueConsider hyphenating "open source" as a compound adjective.
The LanguageTool linter suggests using a hyphen when joining words that form a compound adjective. In this context, "open-source resume evaluator" would be more formally correct than "open source resume evaluator" when used as a compound modifier before the noun.
✏️ Proposed improvement
-Open source, privacy-first resume evaluator. This repo doubles as a **power-user +Open-source, privacy-first resume evaluator. This repo doubles as a **power-user🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CLAUDE.md` at line 3, The phrase "open source resume evaluator" in CLAUDE.md should be hyphenated when used as a compound adjective; update the text to "open-source resume evaluator" (and scan for any other occurrences of "open source" used as a modifier) to conform to compound-adjective style.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/skills/cv-evaluation/SKILL.md:
- Line 8: Update the description line that currently reads "Run a resume through
the same three-step evaluation" to "Run a resume through the same four-step
evaluation" so it matches the Pipeline section which lists Extract, Detect,
Score, and Validate Claims; locate the phrase in SKILL.md (the sentence
beginning "Run a resume through the same three-step evaluation") and change
"three-step" to "four-step."
In `@apps/cli/README.md`:
- Around line 21-23: The README code fences for the CLI examples are missing
language specifiers (MD040); update the backtick fences around the commands
shown (the examples at the blocks containing "/evaluate-cv ./my-resume.pdf" and
the other block around lines 27-29) to include a language tag such as "bash"
(e.g., change ``` to ```bash) so the code blocks are properly declared for
markdownlint and syntax highlighting.
In `@packages/eval/src/__tests__/fixtures.test.ts`:
- Around line 43-69: Add deterministic acceptance assertions to each fixture
test: for each fixture f in fixtures, assert that f.expected.rubricWeights
exists and is valid (e.g., keys present and numeric weights sum to 1 within a
small epsilon) and that f.expected.requiredFindings is present and every key
from that array appears in the parsed findings for the resume; locate the test
loop over fixtures and add checks after the "parses as a Resume" assertion using
the existing symbols fixtures, f.expected.rubricWeights,
f.expected.requiredFindings, ResumeSchema (use the parsed result), and any
existing parsing function that yields findings to verify required keys and
numeric weight validity deterministically.
In `@packages/intelligence/src/archetypes/index.ts`:
- Around line 8-12: The ARCHETYPES array is exported and listArchetypes()
currently returns it directly, allowing callers to mutate the internal registry;
change listArchetypes() to return an immutable copy (e.g., a shallow copy or
Object.freeze'd array) instead of ARCHETYPES so external code cannot alter the
module state—update references to ARCHETYPES and the listArchetypes function
accordingly to ensure callers receive a safe, non-mutating view.
In `@packages/intelligence/src/detect.ts`:
- Around line 4-6: countMatches currently uses plain substring includes which
causes false positives; update it to perform case-insensitive whole-word (or
phrase) matching instead: normalize the input text once (e.g., textLower =
text.toLowerCase()), escape each keyword for regex, build a RegExp using
word-boundaries (\b) around the escaped keyword (or a safe alternative for
multi-word phrases), test globally and count matches (match?.length || 0) per
keyword rather than using text.includes; reference the countMatches function and
add a small helper (e.g., escapeRegExp) to safely escape keyword characters
before constructing the RegExp.
In `@packages/intelligence/src/validators/ats.ts`:
- Around line 2-8: The ATS_RULES includes a "no-graphics" rule but
checkAtsCompatibility never checks for images; update the checkAtsCompatibility
function to enforce the 'no-graphics' rule by scanning the input content for
markdown image syntax (e.g., ), inline HTML <img ...> tags, and
common image container tags (e.g., <figure>, <picture>) and mark the rule as
failed when any are found; ensure the rule id 'no-graphics' from ATS_RULES is
used when producing the compatibility result and update/add tests to cover
markdown and HTML image cases.
---
Nitpick comments:
In `@CLAUDE.md`:
- Line 3: The phrase "open source resume evaluator" in CLAUDE.md should be
hyphenated when used as a compound adjective; update the text to "open-source
resume evaluator" (and scan for any other occurrences of "open source" used as a
modifier) to conform to compound-adjective style.
In `@packages/prompts/src/index.ts`:
- Around line 50-58: The template replacement chain returned from
loadTemplate("score") mixes .replace() and .replaceAll(), so change the
.replace("{{WEIGHTS}}", weights) call to .replaceAll("{{WEIGHTS}}", weights) to
be consistent with the other calls (e.g., the existing .replaceAll for
"{{ARCHETYPE_NAME}}", "{{ARCHETYPE_ID}}", "{{ARCHETYPE_VERSION}}",
"{{RUBRIC_VERSION}}", "{{RUBRIC}}", and "{{JD_KEYWORDS}}") to avoid subtle bugs
if placeholders are duplicated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 202f8977-49bb-458b-9bba-5f13b4fad38e
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml,!**/pnpm-lock.yaml
📒 Files selected for processing (64)
.claude/commands/evaluate-cv.md.claude/commands/setup-profile.md.claude/settings.json.claude/skills/cv-evaluation/SKILL.md.claude/skills/cv-evaluation/archetypes.md.claude/skills/cv-evaluation/claim-validation.md.claude/skills/cv-evaluation/rubric.md.claude/skills/cv-evaluation/scoring.md.claude/welcome.sh.github/workflows/ci.ymlCLAUDE.mdapps/cli/README.mdapps/cli/package.jsonpackage.jsonpackages/eval/README.mdpackages/eval/fixtures/data-ml-strong/expected.jsonpackages/eval/fixtures/data-ml-strong/resume.mdpackages/eval/fixtures/pm-strong/expected.jsonpackages/eval/fixtures/pm-strong/resume.mdpackages/eval/fixtures/swe-strong/expected.jsonpackages/eval/fixtures/swe-strong/resume.mdpackages/eval/fixtures/swe-weak-table/expected.jsonpackages/eval/fixtures/swe-weak-table/resume.mdpackages/eval/fixtures/swe-with-jd/expected.jsonpackages/eval/fixtures/swe-with-jd/jd.mdpackages/eval/fixtures/swe-with-jd/resume.mdpackages/eval/package.jsonpackages/eval/src/__tests__/fixtures.test.tspackages/eval/tsconfig.jsonpackages/eval/vitest.config.tspackages/intelligence/README.mdpackages/intelligence/package.jsonpackages/intelligence/src/__tests__/intelligence.test.tspackages/intelligence/src/archetypes/data-ml-engineer.tspackages/intelligence/src/archetypes/index.tspackages/intelligence/src/archetypes/product-manager.tspackages/intelligence/src/archetypes/software-engineer.tspackages/intelligence/src/detect.tspackages/intelligence/src/index.tspackages/intelligence/src/rubric.tspackages/intelligence/src/validators/ats.tspackages/intelligence/src/validators/claims.tspackages/intelligence/tsconfig.jsonpackages/intelligence/vitest.config.tspackages/prompts/README.mdpackages/prompts/package.jsonpackages/prompts/prompts/extract.mdpackages/prompts/prompts/score.mdpackages/prompts/prompts/validate-claims.mdpackages/prompts/src/__tests__/prompts.test.tspackages/prompts/src/index.tspackages/prompts/tsconfig.jsonpackages/prompts/vitest.config.tspackages/schemas/README.mdpackages/schemas/package.jsonpackages/schemas/src/__tests__/schemas.test.tspackages/schemas/src/archetype.tspackages/schemas/src/evaluation.tspackages/schemas/src/index.tspackages/schemas/src/job-description.tspackages/schemas/src/resume.tspackages/schemas/tsconfig.jsonpackages/schemas/vitest.config.tsturbo.json
|
@coderabbitai review |
✅ Action performedReview finished.
|
…to feature/intelligence-rubric-archetypes
…etypes' into feature/prompts-pack
…le readme example
…re/cli-power-user-pack
…o feature/eval-harness # Conflicts: # package.json
What does this PR do?
Adds
packages/eval— golden fixtures that guard the deterministic intelligencelayer, run via
pnpm evaland in CI.archetype and the ATS read match
expected.json, and that the resume parsesagainst the schema.
evalturbo task, a rootpnpm evalscript, and a newCI step.
Scoring quality (the LLM
EvalResult) is out of scope here — that needs aprovider adapter, a separate surface.
Related issue
Closes #66
Type of change
Checklist
pnpm evalruns 5 fixtures green in CINote for reviewers
Stacked on #65 → #64 → #63 → #47.
Summary by CodeRabbit
New Features
Documentation
Chores / Tests