Fix `_TaskNavigationEfficiencyEvaluator` rejecting valid inputs from cloud Foundry/ACA runtime by Copilot · Pull Request #46745 · Azure/azure-sdk-for-python

Copilot · 2026-05-06T03:51:40Z

The cloud Foundry/ACA evaluation runtime serializes list/object dataMapping fields to JSON-encoded strings before invoking the Python evaluator entry-point, causing _TaskNavigationEfficiencyEvaluator to always fail with 'response' must be a list of messages. even when the dataset contains valid lists.

Changes

_task_navigation_efficiency.py
- Add _maybe_json_decode(value, field_name) static method — transparently decodes JSON-string inputs; raises EvaluationException(category=INVALID_VALUE) on malformed JSON
- Override _real_call to decode response and ground_truth before validation, making cloud-runtime and in-process call paths equivalent
- Extend _do_eval to treat a 2-element [list, dict] as equivalent to a native (list, dict) tuple — required because JSON has no tuple type, so ground_truth=(["a"], {...}) round-trips to [["a"], {...}]
_task_navigation_efficiency_validator.py
- Extract shared tuple-validation logic into _validate_tool_names_and_params(tool_names, parameters) helper
- Update _validate_ground_truth to accept the JSON round-tripped 2-element [list, dict] form alongside native tuples and plain string lists; all existing error messages and blame/category/target values preserved
tests/unittests/test_task_navigation_efficiency_string_inputs.py (new)
- Native list inputs (existing behaviour unchanged)
- JSON-string response + ground_truth accepted and produce identical results to native inputs
- JSON-string tuple-form ground_truth triggers parameter matching
- Invalid JSON string → EvaluationException(category=INVALID_VALUE) with new message
- Non-string/non-list value → original 'response' must be a list of messages. error preserved

evaluator = _TaskNavigationEfficiencyEvaluator()

# Previously failed in cloud runtime (strings from dataMapping substitution):
result = evaluator(
    response='[{"role":"assistant","content":[{"type":"tool_call","name":"search","arguments":{}}]}]',
    ground_truth='["search"]',
)
# Now succeeds — same result as passing the decoded lists directly

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

pypi.org
- Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.10/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-q0mabqjx/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0 (dns block)
- Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/pip pip install httpx -q (dns block)
scanning-api.github.com
- Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

Summary

_TaskNavigationEfficiencyEvaluator fails in cloud Foundry / ACA evaluation runs with:

(UserError) 'response' must be a list of messages.

even when the user's dataset rows contain perfectly valid response lists and ground_truth lists. (The published asset in Azure/azureml-assets — same code, with parameter names renamed to actions / expected_actions — exhibits the identical bug; a parallel PR is being opened against that repo.)

Root cause

When this evaluator is invoked through the OpenAI‑Evals dataMapping pipeline as a code evaluator, the runtime delivers list/object fields to the Python entrypoint as JSON-encoded strings (e.g. "[{\"role\":\"user\",...}]") instead of Python lists. This happens because the templating substitution ${data.response} does not preserve container types for complex values; the runtime calls json.dumps before invoking the grader.

The validator at:

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.py

rejects strings outright in _validate_response:

if not isinstance(response, list):
    return EvaluationException(
        message="'response' must be a list of messages.",
        ...
    )

…and in _validate_ground_truth, the list branch insists every element is a str, which also rejects the JSON-decoded tuple-form [[...names...], {...params...}] (JSON has no tuple type, so the documented tuple form (list, dict) round-trips to a 2-element list).

This contradicts (a) the evaluator's own typing — response: Union[str, List[Dict[str, Any]]] in the __call__ overloads at _task_navigation_efficiency.py — and (b) the behavior of peer evaluators (ToolCallAccuracy, TaskAdherence, IntentResolution, etc.) which already accept stringified inputs from the same runtime.

Required changes

1. `sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py`

Add a small JSON-decode helper and call it at the top of _real_call, before self._validator.validate_eval_input(kwargs), so both the validator and _do_eval see Python objects:

@staticmethod
def _maybe_json_decode(value: Any, field_name: str) -> Any:
    """Decode a JSON-encoded string into a Python object.

    The cloud Foundry / ACA evaluation runtime delivers list/object fields
    to code-type evaluators as JSON-encoded strings via dataMapping
    templating (e.g. ${data.response}). Accept that shape transparently
    so that callers using either the in-process Python SDK or the cloud
    runtime work.
    """
    if isinstance(value, str):
        try:
            return json.loads(value)
        except json.JSONDecodeError as exc:
            raise EvaluationException(
                message=(
                    f"'{field_name}' arrived as a string but is not valid JSON: {exc}"
                ),
                internal_message=str(exc),
                target=ErrorTarget.TASK_NAVIGATION_EFFICIENCY_EVALUATOR,
                category=ErrorCategory.INVALID_VALUE,
            )
    return value

@override
async def _real_call(self, **kwargs):
    if "response" in kwargs:
        kwargs["response"] = self._maybe_json_decode(kwargs["response"], "response")
    if "ground_truth" in kwargs:
        kwargs["ground_truth"] = self._maybe_json_decode(kwargs["ground_truth"], "ground_truth")
    self._validator.validate_eval_input(kwargs)
    return await super()._real_call(**kwargs)

_do_eval already accepts ground_truth as either a tuple or a 2-element list (see existing isinstance(ground_truth, tuple) and len(ground_truth) == 2 and the list branch). No changes are required to _do_eval itself; please verify the existing logic still works after the fix.

2. `sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.py`

Update _validate_ground_truth so that, in addition to the existing tuple and "list of strings" branches, a 2-element list/tuple whose first element is a list and whose second element is a dict is treated as the tuple-form (tool_names_list, parameters_dict) and validated using the same rules as the existing tuple branch.

Refactor the existing tuple-validation logic into a private helper (e.g. _validate_tool_names_and_params(self, tool_names, parameters) -> Optional[EvaluationException]) so it is shared between the tuple path and the new "2-element list" path. Keep all existing user-facing error messages and ErrorBlame / ErrorCategory / target values for cases already covered. The fall-through "must be either a list of tool names or a tuple of (tool_names_list, parameters_dict)" error must still be raised for genuinely invalid shapes.

The decision tree should be roughly:

ground_truth is None or empty → existing missing-field error.
ground_truth is tuple → ex...

This pull request was created from Copilot chat.

…om cloud runtime Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e644088f-cada-4fcf-a537-c3d6c26212eb Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e644088f-cada-4fcf-a537-c3d6c26212eb Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Initial plan

d2ab51c

Copilot AI assigned Copilot and m7md7sien May 6, 2026

Copilot started work on behalf of m7md7sien May 6, 2026 03:51 View session

Copilot AI and others added 2 commits May 6, 2026 03:59

Fix TaskNavigationEfficiencyEvaluator to accept JSON-string inputs fr…

7d8cb9d

…om cloud runtime Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e644088f-cada-4fcf-a537-c3d6c26212eb Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Address code review: improve CHANGELOG and docstring clarity

8b90a2e

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e644088f-cada-4fcf-a537-c3d6c26212eb Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix response validation for TaskNavigationEfficiencyEvaluator~~ Fix _TaskNavigationEfficiencyEvaluator rejecting valid inputs from cloud Foundry/ACA runtime May 6, 2026

Copilot finished work on behalf of m7md7sien May 6, 2026 04:00

Copilot AI requested a review from m7md7sien May 6, 2026 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `_TaskNavigationEfficiencyEvaluator` rejecting valid inputs from cloud Foundry/ACA runtime#46745

Fix `_TaskNavigationEfficiencyEvaluator` rejecting valid inputs from cloud Foundry/ACA runtime#46745
Copilot wants to merge 3 commits intomainfrom
copilot/fix-response-list-validation

Copilot AI commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

I tried to connect to the following addresses, but was blocked by firewall rules:

Summary

Root cause

Required changes

1. sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py

2. sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 6, 2026 •

edited

Loading

1. `sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py`

2. `sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.py`