Skip to content

test(eval): add golden fixtures and pnpm eval harness#72

Open
amirbahador-hub wants to merge 21 commits into
mainfrom
feature/eval-harness
Open

test(eval): add golden fixtures and pnpm eval harness#72
amirbahador-hub wants to merge 21 commits into
mainfrom
feature/eval-harness

Conversation

@amirbahador-hub

@amirbahador-hub amirbahador-hub commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds packages/eval — golden fixtures that guard the deterministic intelligence
layer, run via pnpm eval and in CI.

optional `jd.md`, `expected.json`), including an ATS-hostile resume.
  • For each fixture the harness asserts — with no LLM — that the detected
    archetype and the ATS read match expected.json, and that the resume parses
    against the schema.
  • Wired as a first-class eval turbo task, a root pnpm eval script, and a new
    CI step.

Scoring quality (the LLM EvalResult) is out of scope here — that needs a
provider adapter, a separate surface.

Related issue

Closes #66

Type of change

  • New feature / test infrastructure (non-breaking)

Checklist

  • pnpm eval runs 5 fixtures green in CI
  • Regression guard verified: breaking a fixture fails the harness
  • Lint/test/build all green

Note for reviewers

Stacked on #65#64#63#47.

Summary by CodeRabbit

  • New Features

    • Added /evaluate-cv and /setup-profile workflows for fully local resume evaluation (0–5 overall score, per-dimension breakdown, ATS compatibility, unsupported-claim detection, ≥3 quoted issues with fixes) and an in-session CLI menu.
  • Documentation

    • New guides, READMEs, and prompt docs detailing the CV Builder flow, expected outputs, privacy-first local usage, and power-user quickstarts.
  • Chores / Tests

    • Added local eval harness, deterministic fixtures, and CI step to run eval checks.

AmirBahador Bahadori and others added 6 commits June 7, 2026 18:22
   Shared, validated contract for CV Builder surfaces: Resume, JobDescription,
   Archetype, Issue, Claim, and EvalResult (with required rubric/archetype
   versions).

   Closes #47
Scoring brain the prompts reference: rubric v1 (six weighted dimensions
with 0-5 anchors), three role archetypes (Software Engineer, Product
Manager, Data & ML Engineer), keyword-based detectArchetype, and the ATS
and claim validator specs.

Closes #63
The three Phase 1 prompts a power user's agent runs. Markdown templates are
the source of truth; renderers inject the live rubric, archetype weights, and
claim rules so each assembled prompt is self-contained and stays in sync with
the intelligence package. Every prompt asks for JSON matching its schema.

Closes #64
Clone the repo, open Claude Code, run /evaluate-cv: the cv-evaluation skill
runs extract -> detect -> score -> validate-claims locally, reading the prompts
and rubric straight from the repo (no build). A SessionStart hook greets the
user with the available commands on open. Skill subfiles point at the package
sources to avoid drift.

Closes #65
Five fixtures across the three archetypes (including an ATS-hostile resume)
guard the deterministic layer: for each, the detected archetype and ATS read
must match expectations and the resume must parse against the schema. No LLM
involved. Wired as a pnpm eval task and a CI step.

Closes #66
@amirbahador-hub

Copy link
Copy Markdown
Collaborator Author

@coderabbitai

@amirbahador-hub

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@amirbahador-hub

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

1 similar comment
@amirbahador-hub

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8b75dee0-9623-4c30-8bb3-63288b1a430d

📥 Commits

Reviewing files that changed from the base of the PR and between 46ffd70 and 3329bb3.

📒 Files selected for processing (1)
  • .claude/skills/cv-evaluation/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • .claude/skills/cv-evaluation/SKILL.md

📝 Walkthrough

Walkthrough

Adds Phase 1 local CV evaluation: Zod schemas, intelligence (rubric, archetypes, detectors, validators), prompt templates and renderers, eval golden fixtures and tests, Claude Code skill/commands and docs, and CI/turbo wiring to run deterministic evals.

Changes

CV Builder Phase 1: Schemas, Intelligence, Prompts, Eval, and CLI

Layer / File(s) Summary
Schema Foundation: Zod Validation & Types
packages/schemas/README.md, packages/schemas/package.json, packages/schemas/tsconfig.json, packages/schemas/vitest.config.ts, packages/schemas/src/archetype.ts, packages/schemas/src/evaluation.ts, packages/schemas/src/job-description.ts, packages/schemas/src/resume.ts, packages/schemas/src/index.ts, packages/schemas/src/__tests__/schemas.test.ts
Zod schemas and tests for Resume, JobDescription, Archetype, EvaluationDimension, Issue, Claim, and EvalResult; barrel exports and README document Phase 1 validation guarantees.
Intelligence Layer: Rubric & Archetypes
packages/intelligence/README.md, packages/intelligence/package.json, packages/intelligence/tsconfig.json, packages/intelligence/vitest.config.ts, packages/intelligence/src/rubric.ts, packages/intelligence/src/archetypes/*, packages/intelligence/src/__tests__/intelligence.test.ts
Rubric v1 and three archetypes (software-engineer, product-manager, data-ml-engineer) with evaluation weights and metadata; tests verify weight coverage and schema conformance.
Intelligence Layer: Archetype Detection & Validators
packages/intelligence/src/detect.ts, packages/intelligence/src/validators/ats.ts, packages/intelligence/src/validators/claims.ts, packages/intelligence/src/index.ts
Keyword-counting archetype detection (whole-word aware) with fallback to Software Engineer; deterministic ATS compatibility checks and claim-rule definitions exported from package root; unit tests cover detection and ATS heuristics.
Prompts Package: Templates & Rendering
packages/prompts/README.md, packages/prompts/package.json, packages/prompts/tsconfig.json, packages/prompts/vitest.config.ts, packages/prompts/prompts/extract.md, packages/prompts/prompts/score.md, packages/prompts/prompts/validate-claims.md, packages/prompts/src/index.ts, packages/prompts/src/__tests__/prompts.test.ts
Phase 1 prompt templates for extract, score, and validate-claims plus renderers that inject archetype weights, rubric dimensions, and claim rules; tests ensure placeholders are rendered and schema markers are present.
Eval Package: Golden Fixtures & Deterministic Tests
packages/eval/README.md, packages/eval/package.json, packages/eval/tsconfig.json, packages/eval/vitest.config.ts, packages/eval/src/__tests__/fixtures.test.ts, packages/eval/fixtures/{swe-strong,swe-weak-table,swe-with-jd,pm-strong,data-ml-strong}/*
Vitest harness and five golden fixtures validating archetype detection, ATS compatibility, and Resume schema parsing without LLMs; pnpm eval runs this harness.
Claude Code Skill: Evaluate-CV & Setup-Profile Commands
.claude/welcome.sh, .claude/settings.json, .claude/commands/evaluate-cv.md, .claude/commands/setup-profile.md, .claude/skills/cv-evaluation/SKILL.md, .claude/skills/cv-evaluation/rubric.md, .claude/skills/cv-evaluation/archetypes.md, .claude/skills/cv-evaluation/scoring.md, .claude/skills/cv-evaluation/claim-validation.md
Skill and command docs implement /evaluate-cv (extract → detect → score → validate-claims locally) and /setup-profile helper; welcome script emits a session-start menu; configs declare local-only behavior.
Power-User Documentation: CLI Guide & Quickstart
apps/cli/package.json, apps/cli/README.md, CLAUDE.md
CLI README and CLAUDE.md document how to run local evaluation, optional --jd, expected outputs (0–5 score, per-dimension breakdown, quoted issues with fixes, unsupported claims, ATS feedback), and privacy guarantees.
Build Infrastructure: Turbo, CI, and npm Scripts
package.json, turbo.json, .github/workflows/ci.yml
Adds scripts.evalturbo eval, turbo task eval depends on ^build, and CI inserts pnpm eval between pnpm test and pnpm build to run deterministic fixture checks in CI.

Sequence Diagram(s)

sequenceDiagram
  participant UserCLI
  participant PromptRenderer
  participant ExtractPrompt
  participant ScorePrompt
  participant ValidateClaimsPrompt
  participant EvalHarness

  UserCLI->>PromptRenderer: request renderExtractPrompt()
  PromptRenderer->>ExtractPrompt: load template
  UserCLI->>PromptRenderer: request renderScorePrompt(archetype, jdKeywords?)
  PromptRenderer->>ScorePrompt: inject RUBRIC, WEIGHTS, JD_KEYWORDS
  UserCLI->>PromptRenderer: request renderValidateClaimsPrompt()
  PromptRenderer->>ValidateClaimsPrompt: inject CLAIM_RULES
  EvalHarness->>ScorePrompt: run scoring prompt (local LLM/adapter)
  EvalHarness->>ValidateClaimsPrompt: run claims validation (local LLM/adapter)
  EvalHarness-->>UserCLI: validated EvalResult JSON
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • #15 — Archetype detection overlap: adds detectArchetype and archetypes; related.
  • #65 — Power-user pack and eval harness in CI: PR implements pnpm eval, fixtures, and CI step.
  • #64 — Prompt pack and renderers: PR adds extract/score/validate-claims templates and render functions.
  • #67 — Power-user quickstart docs: PR adds CLI README and CLAUDE.md quickstart.
  • #40 — Add data-ML archetype: PR adds data-ml-engineer archetype and registers it.

"I hopped through code and left a trail,
Schemas snug in a carrot-mail,
Archetypes baked and prompts in tune,
Fixtures glitter like the moon,
Local, safe—a rabbit's cheerful tale."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: adding golden fixtures and a pnpm eval harness for testing purposes.
Linked Issues check ✅ Passed The PR fully satisfies issue #66 requirements: adds packages/eval with Vitest harness, creates 5+ fixtures spanning 3+ archetypes, implements deterministic assertions without LLM calls, includes pnpm eval script and CI integration, and validates archetype detection and schema parsing against expected results.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #66 scope: the eval harness, golden fixtures, Turbo task configuration, CI integration, and supporting infrastructure (schemas, intelligence layer, prompts) are all necessary dependencies for the stated objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/eval-harness

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
packages/prompts/src/index.ts (1)

50-58: ⚡ Quick win

Use .replaceAll() consistently across all template placeholder replacements.

Line 55 uses .replace() for {{WEIGHTS}}, while lines 51–54 and 56–57 use .replaceAll(). Functionally equivalent (each placeholder appears once), but standardizing to .replaceAll() improves readability and prevents bugs if placeholders are ever duplicated in the template.

♻️ Proposed fix: standardize to `.replaceAll()`
  return loadTemplate("score")
    .replaceAll("{{ARCHETYPE_NAME}}", archetype.name)
    .replaceAll("{{ARCHETYPE_ID}}", archetype.id)
    .replaceAll("{{ARCHETYPE_VERSION}}", archetype.version)
    .replaceAll("{{RUBRIC_VERSION}}", RUBRIC_VERSION)
-   .replace("{{WEIGHTS}}", weights)
+   .replaceAll("{{WEIGHTS}}", weights)
-   .replace("{{RUBRIC}}", rubric)
+   .replaceAll("{{RUBRIC}}", rubric)
-   .replace("{{JD_KEYWORDS}}", keywords);
+   .replaceAll("{{JD_KEYWORDS}}", keywords);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/prompts/src/index.ts` around lines 50 - 58, The template replacement
chain returned from loadTemplate("score") mixes .replace() and .replaceAll(), so
change the .replace("{{WEIGHTS}}", weights) call to .replaceAll("{{WEIGHTS}}",
weights) to be consistent with the other calls (e.g., the existing .replaceAll
for "{{ARCHETYPE_NAME}}", "{{ARCHETYPE_ID}}", "{{ARCHETYPE_VERSION}}",
"{{RUBRIC_VERSION}}", "{{RUBRIC}}", and "{{JD_KEYWORDS}}") to avoid subtle bugs
if placeholders are duplicated.
CLAUDE.md (1)

3-3: 💤 Low value

Consider hyphenating "open source" as a compound adjective.

The LanguageTool linter suggests using a hyphen when joining words that form a compound adjective. In this context, "open-source resume evaluator" would be more formally correct than "open source resume evaluator" when used as a compound modifier before the noun.

✏️ Proposed improvement
-Open source, privacy-first resume evaluator. This repo doubles as a **power-user
+Open-source, privacy-first resume evaluator. This repo doubles as a **power-user
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CLAUDE.md` at line 3, The phrase "open source resume evaluator" in CLAUDE.md
should be hyphenated when used as a compound adjective; update the text to
"open-source resume evaluator" (and scan for any other occurrences of "open
source" used as a modifier) to conform to compound-adjective style.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/cv-evaluation/SKILL.md:
- Line 8: Update the description line that currently reads "Run a resume through
the same three-step evaluation" to "Run a resume through the same four-step
evaluation" so it matches the Pipeline section which lists Extract, Detect,
Score, and Validate Claims; locate the phrase in SKILL.md (the sentence
beginning "Run a resume through the same three-step evaluation") and change
"three-step" to "four-step."

In `@apps/cli/README.md`:
- Around line 21-23: The README code fences for the CLI examples are missing
language specifiers (MD040); update the backtick fences around the commands
shown (the examples at the blocks containing "/evaluate-cv ./my-resume.pdf" and
the other block around lines 27-29) to include a language tag such as "bash"
(e.g., change ``` to ```bash) so the code blocks are properly declared for
markdownlint and syntax highlighting.

In `@packages/eval/src/__tests__/fixtures.test.ts`:
- Around line 43-69: Add deterministic acceptance assertions to each fixture
test: for each fixture f in fixtures, assert that f.expected.rubricWeights
exists and is valid (e.g., keys present and numeric weights sum to 1 within a
small epsilon) and that f.expected.requiredFindings is present and every key
from that array appears in the parsed findings for the resume; locate the test
loop over fixtures and add checks after the "parses as a Resume" assertion using
the existing symbols fixtures, f.expected.rubricWeights,
f.expected.requiredFindings, ResumeSchema (use the parsed result), and any
existing parsing function that yields findings to verify required keys and
numeric weight validity deterministically.

In `@packages/intelligence/src/archetypes/index.ts`:
- Around line 8-12: The ARCHETYPES array is exported and listArchetypes()
currently returns it directly, allowing callers to mutate the internal registry;
change listArchetypes() to return an immutable copy (e.g., a shallow copy or
Object.freeze'd array) instead of ARCHETYPES so external code cannot alter the
module state—update references to ARCHETYPES and the listArchetypes function
accordingly to ensure callers receive a safe, non-mutating view.

In `@packages/intelligence/src/detect.ts`:
- Around line 4-6: countMatches currently uses plain substring includes which
causes false positives; update it to perform case-insensitive whole-word (or
phrase) matching instead: normalize the input text once (e.g., textLower =
text.toLowerCase()), escape each keyword for regex, build a RegExp using
word-boundaries (\b) around the escaped keyword (or a safe alternative for
multi-word phrases), test globally and count matches (match?.length || 0) per
keyword rather than using text.includes; reference the countMatches function and
add a small helper (e.g., escapeRegExp) to safely escape keyword characters
before constructing the RegExp.

In `@packages/intelligence/src/validators/ats.ts`:
- Around line 2-8: The ATS_RULES includes a "no-graphics" rule but
checkAtsCompatibility never checks for images; update the checkAtsCompatibility
function to enforce the 'no-graphics' rule by scanning the input content for
markdown image syntax (e.g., ![alt](url)), inline HTML <img ...> tags, and
common image container tags (e.g., <figure>, <picture>) and mark the rule as
failed when any are found; ensure the rule id 'no-graphics' from ATS_RULES is
used when producing the compatibility result and update/add tests to cover
markdown and HTML image cases.

---

Nitpick comments:
In `@CLAUDE.md`:
- Line 3: The phrase "open source resume evaluator" in CLAUDE.md should be
hyphenated when used as a compound adjective; update the text to "open-source
resume evaluator" (and scan for any other occurrences of "open source" used as a
modifier) to conform to compound-adjective style.

In `@packages/prompts/src/index.ts`:
- Around line 50-58: The template replacement chain returned from
loadTemplate("score") mixes .replace() and .replaceAll(), so change the
.replace("{{WEIGHTS}}", weights) call to .replaceAll("{{WEIGHTS}}", weights) to
be consistent with the other calls (e.g., the existing .replaceAll for
"{{ARCHETYPE_NAME}}", "{{ARCHETYPE_ID}}", "{{ARCHETYPE_VERSION}}",
"{{RUBRIC_VERSION}}", "{{RUBRIC}}", and "{{JD_KEYWORDS}}") to avoid subtle bugs
if placeholders are duplicated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 202f8977-49bb-458b-9bba-5f13b4fad38e

📥 Commits

Reviewing files that changed from the base of the PR and between facea8e and a8661ec.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml, !**/pnpm-lock.yaml
📒 Files selected for processing (64)
  • .claude/commands/evaluate-cv.md
  • .claude/commands/setup-profile.md
  • .claude/settings.json
  • .claude/skills/cv-evaluation/SKILL.md
  • .claude/skills/cv-evaluation/archetypes.md
  • .claude/skills/cv-evaluation/claim-validation.md
  • .claude/skills/cv-evaluation/rubric.md
  • .claude/skills/cv-evaluation/scoring.md
  • .claude/welcome.sh
  • .github/workflows/ci.yml
  • CLAUDE.md
  • apps/cli/README.md
  • apps/cli/package.json
  • package.json
  • packages/eval/README.md
  • packages/eval/fixtures/data-ml-strong/expected.json
  • packages/eval/fixtures/data-ml-strong/resume.md
  • packages/eval/fixtures/pm-strong/expected.json
  • packages/eval/fixtures/pm-strong/resume.md
  • packages/eval/fixtures/swe-strong/expected.json
  • packages/eval/fixtures/swe-strong/resume.md
  • packages/eval/fixtures/swe-weak-table/expected.json
  • packages/eval/fixtures/swe-weak-table/resume.md
  • packages/eval/fixtures/swe-with-jd/expected.json
  • packages/eval/fixtures/swe-with-jd/jd.md
  • packages/eval/fixtures/swe-with-jd/resume.md
  • packages/eval/package.json
  • packages/eval/src/__tests__/fixtures.test.ts
  • packages/eval/tsconfig.json
  • packages/eval/vitest.config.ts
  • packages/intelligence/README.md
  • packages/intelligence/package.json
  • packages/intelligence/src/__tests__/intelligence.test.ts
  • packages/intelligence/src/archetypes/data-ml-engineer.ts
  • packages/intelligence/src/archetypes/index.ts
  • packages/intelligence/src/archetypes/product-manager.ts
  • packages/intelligence/src/archetypes/software-engineer.ts
  • packages/intelligence/src/detect.ts
  • packages/intelligence/src/index.ts
  • packages/intelligence/src/rubric.ts
  • packages/intelligence/src/validators/ats.ts
  • packages/intelligence/src/validators/claims.ts
  • packages/intelligence/tsconfig.json
  • packages/intelligence/vitest.config.ts
  • packages/prompts/README.md
  • packages/prompts/package.json
  • packages/prompts/prompts/extract.md
  • packages/prompts/prompts/score.md
  • packages/prompts/prompts/validate-claims.md
  • packages/prompts/src/__tests__/prompts.test.ts
  • packages/prompts/src/index.ts
  • packages/prompts/tsconfig.json
  • packages/prompts/vitest.config.ts
  • packages/schemas/README.md
  • packages/schemas/package.json
  • packages/schemas/src/__tests__/schemas.test.ts
  • packages/schemas/src/archetype.ts
  • packages/schemas/src/evaluation.ts
  • packages/schemas/src/index.ts
  • packages/schemas/src/job-description.ts
  • packages/schemas/src/resume.ts
  • packages/schemas/tsconfig.json
  • packages/schemas/vitest.config.ts
  • turbo.json

Comment thread .claude/skills/cv-evaluation/SKILL.md Outdated
Comment thread apps/cli/README.md Outdated
Comment thread packages/eval/src/__tests__/fixtures.test.ts
Comment thread packages/intelligence/src/archetypes/index.ts Outdated
Comment thread packages/intelligence/src/detect.ts
Comment thread packages/intelligence/src/validators/ats.ts
@amirbahador-hub

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Golden fixtures + pnpm eval harness in CI

1 participant