feat: introduce evaluation feature by notgitika · Pull Request #518 · aws/agentcore-cli

notgitika · 2026-03-09T15:56:39Z

Description

Add evaluator and online eval config support to the CLI as new resource primitives. This includes:

New EvaluatorPrimitive and OnlineEvalConfigPrimitive for add/remove lifecycle
TUI screens and wizards for creating evaluators and online eval configs
Schema definitions for evaluator and online eval config resources
Deploy flow updates to handle eval resources
Status command updates to display eval resource status
Remove flow support for evaluators and online eval configs

Related Issue

Closes #517

Documentation PR

Type of Change

Testing

How have you tested the change?

I manually created a custom evaluator and deployed it successfully

I ran npm run test:unit and npm run test:integ
I ran npm run typecheck
I ran npm run lint
If I modified src/assets/, I ran npm run test:update-snapshots and committed the updated snapshots

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.

notgitika · 2026-03-09T16:14:07Z

E2E failures are because the vended cdk.test.ts includes evaluators and onlineEvalConfigs fields but the published @aws/agentcore-l3-cdk-constructs package doesn't have eval support yet, will resolve once the CDK package is published first.

notgitika · 2026-03-09T17:11:03Z

/strands review

github-actions · 2026-03-09T17:31:49Z

Review Summary

I've completed a comprehensive review of this PR. The implementation is well-structured and follows the existing patterns correctly. The main concern is missing unit tests for the new primitives, which should be addressed before merge.

✅ Architecture & Design

Excellent work following the established primitive pattern:

✅ Both EvaluatorPrimitive and OnlineEvalConfigPrimitive properly extend BasePrimitive
✅ Correctly registered in registry.ts alongside existing primitives
✅ Integrated into add/remove/status flows consistently
✅ CLI commands follow the same pattern as other primitives with --json support
✅ TUI screens follow the existing wizard pattern

✅ Schema Design

The schema definitions are well-designed with proper validation:

✅ EvaluatorSchema includes type, name, level, description (optional), and config
✅ OnlineEvalConfigSchema properly validates agents, evaluators, sampling rate, and enable flag
✅ Cross-validation in AgentCoreProjectSpec checks that online eval configs reference valid agents/evaluators (lines 197-223 in agentcore-project.ts)
✅ Properly handles built-in evaluators with BUILTIN_EVALUATOR_PREFIX check (line 215)
✅ Rating scale validation enforces numerical XOR categorical (lines 38-50 in evaluator.ts)
✅ Name validation patterns match other primitives (alphanumeric + underscore, 48 char max)

✅ Primitives Implementation

Well-implemented with proper safeguards:

EvaluatorPrimitive

✅ Checks if evaluator is referenced by online eval configs before removal (lines 46-54)
✅ Provides meaningful error messages when removal is blocked
✅ Implements all required BasePrimitive methods

OnlineEvalConfigPrimitive

✅ Validates sampling rate range (0.01-100) in CLI command (line 139)
✅ Defaults enableOnCreate to true (line 211)
✅ Properly validates required CLI options for non-interactive mode

✅ Integration

Comprehensive integration across all surfaces:

✅ Added to AddScreen with appropriate descriptions (lines 8-9 in AddScreen.tsx)
✅ AddFlow routes to evaluator and online-eval wizards (lines 179-186, 379-402)
✅ ResourceGraph includes icons: ✦ for evaluator, ↻ for online-eval
✅ Remove hooks implemented (useRemoveEvaluator, useRemoveOnlineEvalConfig)
✅ Status command displays evaluators and online eval configs (lines 155-167 in status/action.ts)
✅ Deploy preflight validation checks for evaluators (line 85 in preflight.ts)

⚠️ Testing (Blocking Issue)

Critical gap: No unit tests for the new primitives.

Missing:

❌ No tests for EvaluatorPrimitive.ts
❌ No tests for OnlineEvalConfigPrimitive.ts
❌ No tests verifying the cross-reference validation (evaluator removal blocked by online eval configs)
❌ No tests for edge cases like invalid sampling rates, duplicate names, or referencing non-existent agents/evaluators

Existing test updates are good:

✅ preflight.test.ts updated with new error message
✅ cdk.test.ts includes empty evaluators/onlineEvalConfigs arrays
✅ Snapshot test updated

Recommendation:

// Suggested test file: src/cli/primitives/__tests__/EvaluatorPrimitive.test.ts
// Should test:
// - add() with valid config
// - add() with duplicate name
// - remove() when not referenced
// - remove() when referenced by online eval config (should fail)
// - previewRemove() generates correct summary
// - getRemovable() returns all evaluators

// Similar for OnlineEvalConfigPrimitive.test.ts

❓ Questions (Non-blocking)

1. Deployed State Tracking

File: src/cli/commands/status/action.ts (lines 155-167)

Evaluators and online eval configs are hardcoded as 'local-only':

const evaluators: ResourceStatusEntry[] = (project.evaluators ?? []).map(e => ({
  resourceType: 'evaluator',
  name: e.name,
  deploymentState: 'local-only' as ResourceDeploymentState,
  detail: `${e.level} — LLM-as-a-Judge`,
}));

Question: Is deployed state tracking for eval resources coming in a follow-up PR, or is this intentional for the initial release? The comment implies they will always be local-only, which seems inconsistent with other resources.

2. Sampling Rate Range

File: OnlineEvalConfigPrimitive.ts (line 139)

The validation allows 0.01 to 100 for sampling rate:

if (isNaN(samplingRate) || samplingRate < 0.01 || samplingRate > 100)

Question: Is this a percentage (0.01% to 100%)? Does this match the backend expectations? Consider adding a comment or JSDoc to clarify the units.

3. Default Model

File: src/cli/tui/screens/evaluator/types.ts (line 34)

export const DEFAULT_MODEL = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0';

Question: Is this model ID correct and available in all regions? Should it match the default from agent creation, or is the cross-region ID intentional for evaluators?

🔧 Minor Suggestions (Nit-level)

1. Article Consistency

File: OnlineEvalConfigPrimitive.ts (line 26)

override readonly article = 'an';

✅ Correct usage of "an" for "an Online Eval Config"

2. Error Message Clarity

File: EvaluatorPrimitive.ts (line 52)

error: `Evaluator "${evaluatorName}" is referenced by online eval config(s): ${configNames}. Remove those references first.`

Suggestion: Consider being more specific about how to remove references:

error: `Cannot remove evaluator "${evaluatorName}" because it is referenced by online eval config(s): ${configNames}. Remove it from those configs first, or delete the configs.`

3. TUI Consistency

The evaluator wizard includes helpful default templates and placeholder validation (lines 40-72 in evaluator/types.ts). This is excellent UX! Consider documenting these placeholders in user-facing docs.

📝 Documentation

Checklist items unchecked:

⬜ "I have updated the documentation accordingly"
⬜ "I have added an appropriate example to the documentation"

Recommendation: Update AGENTS.md with:

Instructions for adding new primitive types (evaluator and online-eval already follow the pattern)
The evaluator and online-eval primitives as examples in the registry

🎯 Overall Assessment

Status: Approve with required changes

This is a well-architected addition that follows existing patterns and includes proper validation. The schema design is solid with good cross-reference checking. The main blocker is the missing unit tests for the new primitives.

Before merge:

blocking: Add unit tests for EvaluatorPrimitive and OnlineEvalConfigPrimitive
blocking: Add tests for the cross-reference validation (evaluator removal blocked by online eval config)
suggestion: Clarify the sampling rate units with a comment or JSDoc
suggestion: Address the deployed state tracking question
suggestion: Update AGENTS.md documentation

After addressing the testing gap, this PR will be ready to merge. 🚀

notgitika · 2026-03-12T15:05:55Z

/strands review

notgitika · 2026-03-12T20:29:45Z

/strands review

github-actions · 2026-03-12T21:07:18Z

Review Summary

I've completed a comprehensive review of this PR. The implementation is well-architected and follows existing patterns correctly. The previous review's main concern about missing unit tests has been fully addressed — both EvaluatorPrimitive.test.ts and OnlineEvalConfigPrimitive.test.ts now exist with comprehensive coverage.

✅ Strengths

1. Architecture & Pattern Adherence

✅ Both primitives properly extend BasePrimitive with all required methods implemented
✅ Correctly registered in registry.ts alongside existing primitives
✅ Integrated into add/remove/status flows consistently
✅ CLI commands follow established patterns with --json support
✅ TUI screens follow the existing wizard pattern

2. Schema Design

✅ EvaluatorSchema includes proper validation (name regex, rating scale XOR validation)
✅ OnlineEvalConfigSchema validates agents, evaluators, sampling rate (0.01-100), enable flag
✅ Cross-validation in AgentCoreProjectSpec checks that online eval configs reference valid agents/evaluators
✅ Properly handles built-in evaluators with Builtin.* prefix check
✅ Rating scale validation enforces numerical XOR categorical (lines 38-50 in evaluator.ts)

3. Test Coverage ✅

The missing tests concern from the previous review is now resolved:

✅ EvaluatorPrimitive.test.ts tests add(), remove(), cross-reference blocking
✅ OnlineEvalConfigPrimitive.test.ts tests add(), remove(), validation
✅ run-eval.test.ts has 900+ lines covering ARN mode, session filtering, batch processing
✅ logs-eval.test.ts, pause-resume.test.ts, storage.test.ts all present
✅ AWS SDK wrapper tests: agentcore-control.test.ts, agentcore-evaluate.test.ts
✅ Integration tests updated: preflight.test.ts, status/action.test.ts

4. Security & Robustness

✅ Query sanitization in run-eval.ts (line 318): removes single quotes to prevent injection
✅ Evaluator removal blocked when referenced by online eval configs (EvaluatorPrimitive.ts:46-54)
✅ Error handling throughout with try/catch and getErrorMessage()
✅ Input validation: sampling rate range enforced (OnlineEvalConfigPrimitive.ts:139)

5. Operations Implementation

✅ run-eval supports both project mode and ARN mode (bypasses project config)
✅ CloudWatch Logs querying with proper scope filtering (SUPPORTED_SCOPES)
✅ Evaluator level detection (SESSION/TRACE/TOOL_CALL) with API fallback
✅ Batching for targetSpanIds (max 10 per API call)
✅ Aggregate score calculation excludes errored sessions

⚠️ Issues to Address

blocking: Missing Documentation Examples

Files: AGENTS.md, user-facing docs

The PR checklist shows:

⬜ "I have added an appropriate example to the documentation to outline the feature, or no new docs are needed"

Required:

Add examples to AGENTS.md showing how to:
- Add an evaluator: agentcore add eval --name MyEval --level SESSION
- Add an online eval config: agentcore add online-eval
- Run on-demand evaluation: agentcore run eval --agent myagent --evaluator Builtin.GoalSuccessRate
- Pause/resume online eval: agentcore pause online-eval MyConfig
- View eval logs: agentcore logs eval --agent myagent

Document the new primitive types in AGENTS.md "Current primitives" section:

- `EvaluatorPrimitive` — evaluator creation/removal with cross-reference validation
- `OnlineEvalConfigPrimitive` — online eval config creation/removal

suggestion: Clarify Sampling Rate Units

File: OnlineEvalConfigPrimitive.ts (line 139)

The validation allows 0.01 to 100:

if (isNaN(samplingRate) || samplingRate < 0.01 || samplingRate > 100)

Suggestion: Add a JSDoc comment to clarify units:

/**
 * Sampling rate as a percentage.
 * Valid range: 0.01% (1 in 10,000) to 100% (all requests)
 */
if (isNaN(samplingRate) || samplingRate < 0.01 || samplingRate > 100) {

This prevents confusion about whether 100 means "100%" or "10,000%".

suggestion: Deployed State Tracking

File: src/cli/commands/status/action.ts (lines 157-175)

Currently, evaluators and online eval configs show deployment state and are enriched with live status (API calls to getEvaluator and getOnlineEvaluationConfig). This is correct behavior and addresses a concern from the previous review.

The implementation properly:

✅ Shows deployed when found in both local spec and deployed state
✅ Shows local-only when not deployed
✅ Shows pending-removal when deployed but removed from spec
✅ Enriches with live status (lines 266-301)

Suggestion: Consider adding failureReason to the status detail for failed configs:

const detail = entry.detail
  ? `${entry.detail} — ${statusLabel}${configResult.failureReason ? ` (${configResult.failureReason})` : ''}`
  : statusLabel;

question: Default Model Cross-Region Compatibility

File: src/cli/tui/screens/evaluator/types.ts (line 34)

export const DEFAULT_MODEL = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0';

Question: Is this cross-region model ID available in all regions where AgentCore is deployed? If not, consider using a non-region-specific ID or checking the user's region and providing a region-appropriate default.

nit: Typo in Error Message

File: OnlineEvalConfigPrimitive.ts (line 99)

throw new Error(`At least one agent is required`);

The period is missing at the end. For consistency with other error messages:

throw new Error('At least one agent is required.');

📝 Additional Observations

Well-Handled Edge Cases

✅ Tool call spans require gen_ai.tool.name attribute (not just kind=CLIENT) to avoid false positives
✅ ARN mode validates ARN format before proceeding
✅ Custom evaluators fetch level from API, default to SESSION on failure
✅ CloudWatch query timeout (60s) with polling loop
✅ ResourceNotFoundException handled gracefully in logs/streaming

Code Quality

✅ Proper use of TypeScript discriminated unions for result types
✅ Consistent error handling pattern: { success: boolean, error?: string }
✅ Schema validation with Zod including custom refinements
✅ Comprehensive mocking strategy in tests

🎯 Overall Assessment

Status: ✅ Approve with minor documentation updates

This is a high-quality PR that follows the established patterns and includes proper validation. The test coverage is comprehensive (the main blocker from the previous review has been resolved). The only blocking item is adding documentation examples to AGENTS.md.

Before merge:

blocking: Add examples to AGENTS.md as outlined above
suggestion: Address the sampling rate clarity and other suggestions

After these updates, this PR will be ready to merge. 🚀

Testing Notes

The PR author states:

"I manually created a custom evaluator and deployed it successfully"

This manual testing, combined with the comprehensive unit and integration test suite, gives high confidence in the implementation.

feat: add evals control plane operations

076a794

notgitika requested a review from a team March 9, 2026 15:56

notgitika had a problem deploying to e2e-testing March 9, 2026 15:56 — with GitHub Actions Failure

github-actions bot added the size/xl PR size: XL label Mar 9, 2026

github-actions bot added the strands-running label Mar 9, 2026

github-actions bot removed the strands-running label Mar 9, 2026

notgitika changed the base branch from main to feature/introduce-eval-support March 12, 2026 00:13

notgitika added 7 commits March 11, 2026 20:25

feat: add functionality to run evaluation and online evals

674c6b1

fix: eval data plane code clean up

fe33a20

tests: add test coverage for evals

056ce7a

add support for running evals on agents created outside the cli

173f83d

feat: add eval discovery commands, status enrichment, and schema updates

c64d4e4

fix: add API limit of 10 on spanIds and TUI changes

cc37874

feat: add evals in resourcegraph

2acb378

github-actions bot added strands-running and removed strands-running labels Mar 12, 2026

github-actions bot added the strands-running label Mar 12, 2026

github-actions bot removed the strands-running label Mar 12, 2026

notgitika changed the title ~~feat: add evals control plane operations~~ feat: introduce evaluation feature Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce evaluation feature#518

feat: introduce evaluation feature#518
notgitika wants to merge 8 commits intoaws:feature/introduce-eval-supportfrom
notgitika:feat/eval-support

notgitika commented Mar 9, 2026

Uh oh!

notgitika commented Mar 9, 2026

Uh oh!

notgitika commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

notgitika commented Mar 12, 2026

Uh oh!

notgitika commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

notgitika commented Mar 9, 2026

Description

Related Issue

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

notgitika commented Mar 9, 2026

Uh oh!

notgitika commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Review Summary

✅ Architecture & Design

✅ Schema Design

✅ Primitives Implementation

EvaluatorPrimitive

OnlineEvalConfigPrimitive

✅ Integration

⚠️ Testing (Blocking Issue)

❓ Questions (Non-blocking)

1. Deployed State Tracking

2. Sampling Rate Range

3. Default Model

🔧 Minor Suggestions (Nit-level)

1. Article Consistency

2. Error Message Clarity

3. TUI Consistency

📝 Documentation

🎯 Overall Assessment

Uh oh!

notgitika commented Mar 12, 2026

Uh oh!

notgitika commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Review Summary

✅ Strengths

1. Architecture & Pattern Adherence

2. Schema Design

3. Test Coverage ✅

4. Security & Robustness

5. Operations Implementation

⚠️ Issues to Address

blocking: Missing Documentation Examples

suggestion: Clarify Sampling Rate Units

suggestion: Deployed State Tracking

question: Default Model Cross-Region Compatibility

nit: Typo in Error Message

📝 Additional Observations

Well-Handled Edge Cases

Code Quality

🎯 Overall Assessment

Testing Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant