feat: Add generic BaseAdapter framework for third-party evaluator integration (DeepEval + Autoevals)#528
feat: Add generic BaseAdapter framework for third-party evaluator integration (DeepEval + Autoevals)#528stone-coding wants to merge 10 commits into
Conversation
Introduces a new integrations/deepeval/ module that adapts AgentCore Lambda evaluation events into DeepEval LLMTestCase objects, runs any BaseMetric, and returns structured score/label/explanation responses.
…leTurnParams deprecation
…d EvaluatorInput support
| @@ -0,0 +1,5 @@ | |||
| """DeepEval integration for AgentCore Evaluation.""" | |||
|
|
|||
| from bedrock_agentcore.evaluation.integrations.deepeval.handler import DeepEvalHandler | |||
There was a problem hiding this comment.
Since this is using eval's custom code evaluator please but this under custom_code_based_evaluators.
| import threading | ||
| from typing import Any, Callable, Dict, Optional | ||
|
|
||
| from deepeval.metrics import BaseMetric |
There was a problem hiding this comment.
Please add an integ test for this. Look into tests_integ for examples.
| import threading | ||
| from typing import Any, Callable, Dict, Optional | ||
|
|
||
| from deepeval.metrics import BaseMetric |
There was a problem hiding this comment.
Also, let's add this in our pyproject as an optional dependency, so customer's know which deepeval version we support.
|
|
||
|
|
||
| @dataclass | ||
| class ParsedEvaluationEvent: |
There was a problem hiding this comment.
Please use EvaluatorInput from our code_based_evaluator. No need to duplicate lambda logic.
| Error: {"errorCode": str, "errorMessage": str} | ||
| """ | ||
| try: | ||
| if isinstance(event, EvaluatorInput): |
There was a problem hiding this comment.
ParsedEvaluationEvent and EvaluatorInput look like they're doing the same job — both just turn the raw lambda event into a structured input. call even copies one into the other field-for-field. Is there a reason we need a second type instead of reusing EvaluatorInput?
Proposal: make it a requirement that customers place these adapters within the @code_based_evaluators decorator. That way the adapter stops owning input/output validation and the decorator does it instead. Keeps the adapter focused on just running the eval.
| } | ||
|
|
||
|
|
||
| def _get_required_params(metric: BaseMetric) -> List[str]: |
There was a problem hiding this comment.
metric.measure() already calls check_llm_test_case_params with the metric's own _required_params and raises MissingTestCaseParamsError.
So we can drop the registry: build the LLMTestCase with whatever fields we have, call measure(), and catch that error.
By doing this, we let customers use GEval too — its required fields aren't fixed on the class, they're whatever the customer passes to evaluation_params at construction, so a static registry can never cover it. Letting the metric validate itself handles that case for free.
| ) -> Dict[str, Any]: | ||
| """Extract evaluation fields from AgentCore session spans. | ||
|
|
||
| Parses _eval_log_records from span attributes, filters by target_trace_id, |
There was a problem hiding this comment.
Can you tell me what otel agent semantic you are following here? Because I haven't seen any agent SDK emit this _eval_log_records?
| self.validate_fields(fields) | ||
| return fields | ||
|
|
||
| def validate_fields(self, fields: Dict[str, Any]) -> None: |
There was a problem hiding this comment.
Can you add @AbstractMethod here please? The no-op default means a subclass that forgets to override it silently skips validation, and bad fields fail deeper in execute instead. Both adapters override it anyway, so abstract just makes each one declare its required fields on purpose.
|
|
||
| thread = threading.Thread(target=target, daemon=True) | ||
| thread.start() | ||
| thread.join(timeout=self.timeout) |
There was a problem hiding this comment.
When the thread "times out" here, it doesn't actually end join just returns back to the caller while the worker keeps running. So if Lambda reuses the same container, we can have a background thread from a previous invocation still executing during the next one. I've heard this is a real failure case, so let's drop the thread machinery and let the AWS Lambda timeout handle it for us instead.
|
|
||
| def __init__( | ||
| self, | ||
| field_mapper: Optional[Callable[[Dict[str, Any]], Dict[str, Any]]] = None, |
There was a problem hiding this comment.
Can we make extract_fields_from_spans the default value of field_mapper in the constructor? Then we have one extraction path instead of the if-field_mapper-else branch.
Issue P446281164 — Third-Party Evaluator Integration (Phase 1)
Description of changes Generic BaseAdapter framework that adapts 3P evaluation libraries into AgentCore-compatible Lambda handlers. Supports DeepEval, Autoevals, and is extensible for future libraries (RAGAS, etc.).
Key components:
BaseAdapter — shared orchestration: parse event (supports EvaluatorInput from @custom_code_based_evaluator() decorator) → extract fields → validate → execute with timeout → error handling
DeepEvalAdapter — runs any DeepEval BaseMetric. DeepEvalHandler alias for backward compat.
AutoevalsAdapter — runs any Autoevals scorer
Field extraction from _eval_log_records in ADOT spans (input, actual_output, retrieval_context, expected_output)
Thread-based timeout (default 290s)
field_mapper escape hatch for custom span extraction
Design decisions:
Composes with @custom_code_based_evaluator() decorator — accepts EvaluatorInput directly
Never raises unhandled exceptions — always returns valid response dict
Adding a new library = one ~20 line subclass with execute() method