Skip to content

Fix _TaskNavigationEfficiencyEvaluator rejecting valid inputs from cloud Foundry/ACA runtime#46745

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/fix-response-list-validation
Draft

Fix _TaskNavigationEfficiencyEvaluator rejecting valid inputs from cloud Foundry/ACA runtime#46745
Copilot wants to merge 3 commits intomainfrom
copilot/fix-response-list-validation

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 6, 2026

The cloud Foundry/ACA evaluation runtime serializes list/object dataMapping fields to JSON-encoded strings before invoking the Python evaluator entry-point, causing _TaskNavigationEfficiencyEvaluator to always fail with 'response' must be a list of messages. even when the dataset contains valid lists.

Changes

  • _task_navigation_efficiency.py

    • Add _maybe_json_decode(value, field_name) static method — transparently decodes JSON-string inputs; raises EvaluationException(category=INVALID_VALUE) on malformed JSON
    • Override _real_call to decode response and ground_truth before validation, making cloud-runtime and in-process call paths equivalent
    • Extend _do_eval to treat a 2-element [list, dict] as equivalent to a native (list, dict) tuple — required because JSON has no tuple type, so ground_truth=(["a"], {...}) round-trips to [["a"], {...}]
  • _task_navigation_efficiency_validator.py

    • Extract shared tuple-validation logic into _validate_tool_names_and_params(tool_names, parameters) helper
    • Update _validate_ground_truth to accept the JSON round-tripped 2-element [list, dict] form alongside native tuples and plain string lists; all existing error messages and blame/category/target values preserved
  • tests/unittests/test_task_navigation_efficiency_string_inputs.py (new)

    • Native list inputs (existing behaviour unchanged)
    • JSON-string response + ground_truth accepted and produce identical results to native inputs
    • JSON-string tuple-form ground_truth triggers parameter matching
    • Invalid JSON string → EvaluationException(category=INVALID_VALUE) with new message
    • Non-string/non-list value → original 'response' must be a list of messages. error preserved
evaluator = _TaskNavigationEfficiencyEvaluator()

# Previously failed in cloud runtime (strings from dataMapping substitution):
result = evaluator(
    response='[{"role":"assistant","content":[{"type":"tool_call","name":"search","arguments":{}}]}]',
    ground_truth='["search"]',
)
# Now succeeds — same result as passing the decoded lists directly

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • pypi.org
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.10/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-q0mabqjx/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0 (dns block)
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/pip pip install httpx -q (dns block)
  • scanning-api.github.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

Summary

_TaskNavigationEfficiencyEvaluator fails in cloud Foundry / ACA evaluation runs with:

(UserError) 'response' must be a list of messages.

even when the user's dataset rows contain perfectly valid response lists and ground_truth lists. (The published asset in Azure/azureml-assets — same code, with parameter names renamed to actions / expected_actions — exhibits the identical bug; a parallel PR is being opened against that repo.)

Root cause

When this evaluator is invoked through the OpenAI‑Evals dataMapping pipeline as a code evaluator, the runtime delivers list/object fields to the Python entrypoint as JSON-encoded strings (e.g. "[{\"role\":\"user\",...}]") instead of Python lists. This happens because the templating substitution ${data.response} does not preserve container types for complex values; the runtime calls json.dumps before invoking the grader.

The validator at:

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.py

rejects strings outright in _validate_response:

if not isinstance(response, list):
    return EvaluationException(
        message="'response' must be a list of messages.",
        ...
    )

…and in _validate_ground_truth, the list branch insists every element is a str, which also rejects the JSON-decoded tuple-form [[...names...], {...params...}] (JSON has no tuple type, so the documented tuple form (list, dict) round-trips to a 2-element list).

This contradicts (a) the evaluator's own typing — response: Union[str, List[Dict[str, Any]]] in the __call__ overloads at _task_navigation_efficiency.py — and (b) the behavior of peer evaluators (ToolCallAccuracy, TaskAdherence, IntentResolution, etc.) which already accept stringified inputs from the same runtime.

Required changes

1. sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.py

Add a small JSON-decode helper and call it at the top of _real_call, before self._validator.validate_eval_input(kwargs), so both the validator and _do_eval see Python objects:

@staticmethod
def _maybe_json_decode(value: Any, field_name: str) -> Any:
    """Decode a JSON-encoded string into a Python object.

    The cloud Foundry / ACA evaluation runtime delivers list/object fields
    to code-type evaluators as JSON-encoded strings via dataMapping
    templating (e.g. ${data.response}). Accept that shape transparently
    so that callers using either the in-process Python SDK or the cloud
    runtime work.
    """
    if isinstance(value, str):
        try:
            return json.loads(value)
        except json.JSONDecodeError as exc:
            raise EvaluationException(
                message=(
                    f"'{field_name}' arrived as a string but is not valid JSON: {exc}"
                ),
                internal_message=str(exc),
                target=ErrorTarget.TASK_NAVIGATION_EFFICIENCY_EVALUATOR,
                category=ErrorCategory.INVALID_VALUE,
            )
    return value

@override
async def _real_call(self, **kwargs):
    if "response" in kwargs:
        kwargs["response"] = self._maybe_json_decode(kwargs["response"], "response")
    if "ground_truth" in kwargs:
        kwargs["ground_truth"] = self._maybe_json_decode(kwargs["ground_truth"], "ground_truth")
    self._validator.validate_eval_input(kwargs)
    return await super()._real_call(**kwargs)

_do_eval already accepts ground_truth as either a tuple or a 2-element list (see existing isinstance(ground_truth, tuple) and len(ground_truth) == 2 and the list branch). No changes are required to _do_eval itself; please verify the existing logic still works after the fix.

2. sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.py

Update _validate_ground_truth so that, in addition to the existing tuple and "list of strings" branches, a 2-element list/tuple whose first element is a list and whose second element is a dict is treated as the tuple-form (tool_names_list, parameters_dict) and validated using the same rules as the existing tuple branch.

Refactor the existing tuple-validation logic into a private helper (e.g. _validate_tool_names_and_params(self, tool_names, parameters) -> Optional[EvaluationException]) so it is shared between the tuple path and the new "2-element list" path. Keep all existing user-facing error messages and ErrorBlame / ErrorCategory / target values for cases already covered. The fall-through "must be either a list of tool names or a tuple of (tool_names_list, parameters_dict)" error must still be raised for genuinely invalid shapes.

The decision tree should be roughly:

  1. ground_truth is None or empty → existing missing-field error.
  2. ground_truth is tuple → ex...

This pull request was created from Copilot chat.

Copilot AI and others added 2 commits May 6, 2026 03:59
Copilot AI changed the title [WIP] Fix response validation for TaskNavigationEfficiencyEvaluator Fix _TaskNavigationEfficiencyEvaluator rejecting valid inputs from cloud Foundry/ACA runtime May 6, 2026
Copilot AI requested a review from m7md7sien May 6, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants