Skip to content

feat: introduce evaluation feature#518

Open
notgitika wants to merge 8 commits intoaws:feature/introduce-eval-supportfrom
notgitika:feat/eval-support
Open

feat: introduce evaluation feature#518
notgitika wants to merge 8 commits intoaws:feature/introduce-eval-supportfrom
notgitika:feat/eval-support

Conversation

@notgitika
Copy link
Contributor

Description

Add evaluator and online eval config support to the CLI as new resource primitives. This includes:

  • New EvaluatorPrimitive and OnlineEvalConfigPrimitive for add/remove lifecycle
  • TUI screens and wizards for creating evaluators and online eval configs
  • Schema definitions for evaluator and online eval config resources
  • Deploy flow updates to handle eval resources
  • Status command updates to display eval resource status
  • Remove flow support for evaluators and online eval configs

Related Issue

Closes #517

Documentation PR

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Other (please describe):

Testing

How have you tested the change?

I manually created a custom evaluator and deployed it successfully

  • I ran npm run test:unit and npm run test:integ
  • I ran npm run typecheck
  • I ran npm run lint
  • If I modified src/assets/, I ran npm run test:update-snapshots and committed the updated snapshots

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.

@notgitika notgitika requested a review from a team March 9, 2026 15:56
@github-actions github-actions bot added the size/xl PR size: XL label Mar 9, 2026
@notgitika
Copy link
Contributor Author

E2E failures are because the vended cdk.test.ts includes evaluators and onlineEvalConfigs fields but the published @aws/agentcore-l3-cdk-constructs package doesn't have eval support yet, will resolve once the CDK package is published first.

@notgitika
Copy link
Contributor Author

/strands review

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Review Summary

I've completed a comprehensive review of this PR. The implementation is well-structured and follows the existing patterns correctly. The main concern is missing unit tests for the new primitives, which should be addressed before merge.


✅ Architecture & Design

Excellent work following the established primitive pattern:

  • ✅ Both EvaluatorPrimitive and OnlineEvalConfigPrimitive properly extend BasePrimitive
  • ✅ Correctly registered in registry.ts alongside existing primitives
  • ✅ Integrated into add/remove/status flows consistently
  • ✅ CLI commands follow the same pattern as other primitives with --json support
  • ✅ TUI screens follow the existing wizard pattern

✅ Schema Design

The schema definitions are well-designed with proper validation:

  • EvaluatorSchema includes type, name, level, description (optional), and config
  • OnlineEvalConfigSchema properly validates agents, evaluators, sampling rate, and enable flag
  • ✅ Cross-validation in AgentCoreProjectSpec checks that online eval configs reference valid agents/evaluators (lines 197-223 in agentcore-project.ts)
  • ✅ Properly handles built-in evaluators with BUILTIN_EVALUATOR_PREFIX check (line 215)
  • ✅ Rating scale validation enforces numerical XOR categorical (lines 38-50 in evaluator.ts)
  • ✅ Name validation patterns match other primitives (alphanumeric + underscore, 48 char max)

✅ Primitives Implementation

Well-implemented with proper safeguards:

EvaluatorPrimitive

  • ✅ Checks if evaluator is referenced by online eval configs before removal (lines 46-54)
  • ✅ Provides meaningful error messages when removal is blocked
  • ✅ Implements all required BasePrimitive methods

OnlineEvalConfigPrimitive

  • ✅ Validates sampling rate range (0.01-100) in CLI command (line 139)
  • ✅ Defaults enableOnCreate to true (line 211)
  • ✅ Properly validates required CLI options for non-interactive mode

✅ Integration

Comprehensive integration across all surfaces:

  • ✅ Added to AddScreen with appropriate descriptions (lines 8-9 in AddScreen.tsx)
  • AddFlow routes to evaluator and online-eval wizards (lines 179-186, 379-402)
  • ✅ ResourceGraph includes icons: for evaluator, for online-eval
  • ✅ Remove hooks implemented (useRemoveEvaluator, useRemoveOnlineEvalConfig)
  • ✅ Status command displays evaluators and online eval configs (lines 155-167 in status/action.ts)
  • ✅ Deploy preflight validation checks for evaluators (line 85 in preflight.ts)

⚠️ Testing (Blocking Issue)

Critical gap: No unit tests for the new primitives.

Missing:

  • ❌ No tests for EvaluatorPrimitive.ts
  • ❌ No tests for OnlineEvalConfigPrimitive.ts
  • ❌ No tests verifying the cross-reference validation (evaluator removal blocked by online eval configs)
  • ❌ No tests for edge cases like invalid sampling rates, duplicate names, or referencing non-existent agents/evaluators

Existing test updates are good:

  • preflight.test.ts updated with new error message
  • cdk.test.ts includes empty evaluators/onlineEvalConfigs arrays
  • ✅ Snapshot test updated

Recommendation:

// Suggested test file: src/cli/primitives/__tests__/EvaluatorPrimitive.test.ts
// Should test:
// - add() with valid config
// - add() with duplicate name
// - remove() when not referenced
// - remove() when referenced by online eval config (should fail)
// - previewRemove() generates correct summary
// - getRemovable() returns all evaluators

// Similar for OnlineEvalConfigPrimitive.test.ts

❓ Questions (Non-blocking)

1. Deployed State Tracking

File: src/cli/commands/status/action.ts (lines 155-167)

Evaluators and online eval configs are hardcoded as 'local-only':

const evaluators: ResourceStatusEntry[] = (project.evaluators ?? []).map(e => ({
  resourceType: 'evaluator',
  name: e.name,
  deploymentState: 'local-only' as ResourceDeploymentState,
  detail: `${e.level} — LLM-as-a-Judge`,
}));

Question: Is deployed state tracking for eval resources coming in a follow-up PR, or is this intentional for the initial release? The comment implies they will always be local-only, which seems inconsistent with other resources.

2. Sampling Rate Range

File: OnlineEvalConfigPrimitive.ts (line 139)

The validation allows 0.01 to 100 for sampling rate:

if (isNaN(samplingRate) || samplingRate < 0.01 || samplingRate > 100)

Question: Is this a percentage (0.01% to 100%)? Does this match the backend expectations? Consider adding a comment or JSDoc to clarify the units.

3. Default Model

File: src/cli/tui/screens/evaluator/types.ts (line 34)

export const DEFAULT_MODEL = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0';

Question: Is this model ID correct and available in all regions? Should it match the default from agent creation, or is the cross-region ID intentional for evaluators?


🔧 Minor Suggestions (Nit-level)

1. Article Consistency

File: OnlineEvalConfigPrimitive.ts (line 26)

override readonly article = 'an';

✅ Correct usage of "an" for "an Online Eval Config"

2. Error Message Clarity

File: EvaluatorPrimitive.ts (line 52)

error: `Evaluator "${evaluatorName}" is referenced by online eval config(s): ${configNames}. Remove those references first.`

Suggestion: Consider being more specific about how to remove references:

error: `Cannot remove evaluator "${evaluatorName}" because it is referenced by online eval config(s): ${configNames}. Remove it from those configs first, or delete the configs.`

3. TUI Consistency

The evaluator wizard includes helpful default templates and placeholder validation (lines 40-72 in evaluator/types.ts). This is excellent UX! Consider documenting these placeholders in user-facing docs.


📝 Documentation

Checklist items unchecked:

  • ⬜ "I have updated the documentation accordingly"
  • ⬜ "I have added an appropriate example to the documentation"

Recommendation: Update AGENTS.md with:

  • Instructions for adding new primitive types (evaluator and online-eval already follow the pattern)
  • The evaluator and online-eval primitives as examples in the registry

🎯 Overall Assessment

Status: Approve with required changes

This is a well-architected addition that follows existing patterns and includes proper validation. The schema design is solid with good cross-reference checking. The main blocker is the missing unit tests for the new primitives.

Before merge:

  1. blocking: Add unit tests for EvaluatorPrimitive and OnlineEvalConfigPrimitive
  2. blocking: Add tests for the cross-reference validation (evaluator removal blocked by online eval config)
  3. suggestion: Clarify the sampling rate units with a comment or JSDoc
  4. suggestion: Address the deployed state tracking question
  5. suggestion: Update AGENTS.md documentation

After addressing the testing gap, this PR will be ready to merge. 🚀

@notgitika notgitika changed the base branch from main to feature/introduce-eval-support March 12, 2026 00:13
@notgitika
Copy link
Contributor Author

/strands review

@notgitika
Copy link
Contributor Author

/strands review

@github-actions
Copy link
Contributor

Review Summary

I've completed a comprehensive review of this PR. The implementation is well-architected and follows existing patterns correctly. The previous review's main concern about missing unit tests has been fully addressed — both EvaluatorPrimitive.test.ts and OnlineEvalConfigPrimitive.test.ts now exist with comprehensive coverage.


✅ Strengths

1. Architecture & Pattern Adherence

  • ✅ Both primitives properly extend BasePrimitive with all required methods implemented
  • ✅ Correctly registered in registry.ts alongside existing primitives
  • ✅ Integrated into add/remove/status flows consistently
  • ✅ CLI commands follow established patterns with --json support
  • ✅ TUI screens follow the existing wizard pattern

2. Schema Design

  • EvaluatorSchema includes proper validation (name regex, rating scale XOR validation)
  • OnlineEvalConfigSchema validates agents, evaluators, sampling rate (0.01-100), enable flag
  • ✅ Cross-validation in AgentCoreProjectSpec checks that online eval configs reference valid agents/evaluators
  • ✅ Properly handles built-in evaluators with Builtin.* prefix check
  • ✅ Rating scale validation enforces numerical XOR categorical (lines 38-50 in evaluator.ts)

3. Test Coverage ✅

The missing tests concern from the previous review is now resolved:

  • EvaluatorPrimitive.test.ts tests add(), remove(), cross-reference blocking
  • OnlineEvalConfigPrimitive.test.ts tests add(), remove(), validation
  • run-eval.test.ts has 900+ lines covering ARN mode, session filtering, batch processing
  • logs-eval.test.ts, pause-resume.test.ts, storage.test.ts all present
  • ✅ AWS SDK wrapper tests: agentcore-control.test.ts, agentcore-evaluate.test.ts
  • ✅ Integration tests updated: preflight.test.ts, status/action.test.ts

4. Security & Robustness

  • ✅ Query sanitization in run-eval.ts (line 318): removes single quotes to prevent injection
  • ✅ Evaluator removal blocked when referenced by online eval configs (EvaluatorPrimitive.ts:46-54)
  • ✅ Error handling throughout with try/catch and getErrorMessage()
  • ✅ Input validation: sampling rate range enforced (OnlineEvalConfigPrimitive.ts:139)

5. Operations Implementation

  • run-eval supports both project mode and ARN mode (bypasses project config)
  • ✅ CloudWatch Logs querying with proper scope filtering (SUPPORTED_SCOPES)
  • ✅ Evaluator level detection (SESSION/TRACE/TOOL_CALL) with API fallback
  • ✅ Batching for targetSpanIds (max 10 per API call)
  • ✅ Aggregate score calculation excludes errored sessions

⚠️ Issues to Address

blocking: Missing Documentation Examples

Files: AGENTS.md, user-facing docs

The PR checklist shows:

  • ⬜ "I have added an appropriate example to the documentation to outline the feature, or no new docs are needed"

Required:

  1. Add examples to AGENTS.md showing how to:

    • Add an evaluator: agentcore add eval --name MyEval --level SESSION
    • Add an online eval config: agentcore add online-eval
    • Run on-demand evaluation: agentcore run eval --agent myagent --evaluator Builtin.GoalSuccessRate
    • Pause/resume online eval: agentcore pause online-eval MyConfig
    • View eval logs: agentcore logs eval --agent myagent
  2. Document the new primitive types in AGENTS.md "Current primitives" section:

    - `EvaluatorPrimitive` — evaluator creation/removal with cross-reference validation
    - `OnlineEvalConfigPrimitive` — online eval config creation/removal

suggestion: Clarify Sampling Rate Units

File: OnlineEvalConfigPrimitive.ts (line 139)

The validation allows 0.01 to 100:

if (isNaN(samplingRate) || samplingRate < 0.01 || samplingRate > 100)

Suggestion: Add a JSDoc comment to clarify units:

/**
 * Sampling rate as a percentage.
 * Valid range: 0.01% (1 in 10,000) to 100% (all requests)
 */
if (isNaN(samplingRate) || samplingRate < 0.01 || samplingRate > 100) {

This prevents confusion about whether 100 means "100%" or "10,000%".

suggestion: Deployed State Tracking

File: src/cli/commands/status/action.ts (lines 157-175)

Currently, evaluators and online eval configs show deployment state and are enriched with live status (API calls to getEvaluator and getOnlineEvaluationConfig). This is correct behavior and addresses a concern from the previous review.

The implementation properly:

  • ✅ Shows deployed when found in both local spec and deployed state
  • ✅ Shows local-only when not deployed
  • ✅ Shows pending-removal when deployed but removed from spec
  • ✅ Enriches with live status (lines 266-301)

Suggestion: Consider adding failureReason to the status detail for failed configs:

const detail = entry.detail
  ? `${entry.detail}${statusLabel}${configResult.failureReason ? ` (${configResult.failureReason})` : ''}`
  : statusLabel;

question: Default Model Cross-Region Compatibility

File: src/cli/tui/screens/evaluator/types.ts (line 34)

export const DEFAULT_MODEL = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0';

Question: Is this cross-region model ID available in all regions where AgentCore is deployed? If not, consider using a non-region-specific ID or checking the user's region and providing a region-appropriate default.

nit: Typo in Error Message

File: OnlineEvalConfigPrimitive.ts (line 99)

throw new Error(`At least one agent is required`);

The period is missing at the end. For consistency with other error messages:

throw new Error('At least one agent is required.');

📝 Additional Observations

Well-Handled Edge Cases

  • ✅ Tool call spans require gen_ai.tool.name attribute (not just kind=CLIENT) to avoid false positives
  • ✅ ARN mode validates ARN format before proceeding
  • ✅ Custom evaluators fetch level from API, default to SESSION on failure
  • ✅ CloudWatch query timeout (60s) with polling loop
  • ✅ ResourceNotFoundException handled gracefully in logs/streaming

Code Quality

  • ✅ Proper use of TypeScript discriminated unions for result types
  • ✅ Consistent error handling pattern: { success: boolean, error?: string }
  • ✅ Schema validation with Zod including custom refinements
  • ✅ Comprehensive mocking strategy in tests

🎯 Overall Assessment

Status:Approve with minor documentation updates

This is a high-quality PR that follows the established patterns and includes proper validation. The test coverage is comprehensive (the main blocker from the previous review has been resolved). The only blocking item is adding documentation examples to AGENTS.md.

Before merge:

  1. blocking: Add examples to AGENTS.md as outlined above
  2. suggestion: Address the sampling rate clarity and other suggestions

After these updates, this PR will be ready to merge. 🚀


Testing Notes

The PR author states:

"I manually created a custom evaluator and deployed it successfully"

This manual testing, combined with the comprehensive unit and integration test suite, gives high confidence in the implementation.

@notgitika notgitika changed the title feat: add evals control plane operations feat: introduce evaluation feature Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/xl PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add evals control plane operations

1 participant