Skip to content

fix: Remove evaluation metric key from schema which failed on some LLMs#105

Open
jsonbailey wants to merge 6 commits intomainfrom
jb/aic-1897/remove-keys-from-evaluation-structure
Open

fix: Remove evaluation metric key from schema which failed on some LLMs#105
jsonbailey wants to merge 6 commits intomainfrom
jb/aic-1897/remove-keys-from-evaluation-structure

Conversation

@jsonbailey
Copy link
Contributor

@jsonbailey jsonbailey commented Mar 11, 2026

fix: Improve metric token collection for Judge evaluations when using LangChain
fix: Include raw response when performing Judge evaluations


Note

Medium Risk
Changes the structured-output contract for Judge evaluations and modifies LangChain structured invocation/metrics extraction, which could affect evaluation parsing and reported token usage across providers.

Overview
Judge structured evaluation output is simplified from a dynamic evaluations[{metricKey}] shape to a fixed evaluation { score, reasoning } schema, and parsing/validation is updated to key results by evaluation_metric_key at runtime (with tests adjusted accordingly).

LangChain structured invocations now request include_raw=True, propagate the raw model message + token usage into StructuredResponse, treat parsing errors as failures, and improve provider handling by mapping Bedrock bedrock:* to bedrock_converse (including passing the original provider string via parameters when needed) and reading token usage from usage_metadata when available.

Written by Cursor Bugbot for commit 1ed23cf. This will update automatically on new commits. Configure here.

@jsonbailey jsonbailey requested a review from a team as a code owner March 11, 2026 22:40
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

# Bedrock requires the foundation provider (e.g. Bedrock:Anthropic) passed in
# parameters separately from model_provider, which is used for LangChain routing.
if mapped_provider == 'bedrock_converse' and 'provider' not in parameters:
parameters['provider'] = provider
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bedrock provider parameter passes wrong format to LangChain

High Severity

The provider variable holds the raw LaunchDarkly provider name (e.g., "Bedrock:Anthropic" or "Bedrock"), which gets passed directly as parameters['provider'] to init_chat_model / ChatBedrockConverse. However, ChatBedrockConverse expects the provider parameter to be just the model family name in lowercase (e.g., "anthropic"), not the full LD-formatted name. Passing "Bedrock:Anthropic" will cause incorrect provider inference and likely break Bedrock model initialization.

Fix in Cursor Fix in Web

usage=TokenUsage(total=0, input=0, output=0),
),
)
return structured_response
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception handler may return success=True after partial mutation

Low Severity

The except handler returns the shared mutable structured_response without resetting metrics.success. After line 110, get_ai_metrics_from_response replaces the metrics with success=True. If any exception occurs between that point and the explicit returns, the handler returns a response indicating success despite the failure. The previous code defensively created a fresh StructuredResponse with success=False in the handler.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants