Feat/gptoss accuracy agentic by tianmu-li · Pull Request #376 · mlcommons/endpoints

tianmu-li · 2026-06-26T01:55:01Z

What does this PR do?

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

…nchmarks Add aime25::gptoss, gpqa::gptoss, and livecodebench::gptoss as accuracy evaluation datasets alongside the agentic performance phase in both the Kimi and Qwen agentic benchmark configs. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ceConfig defaults Scale the perf-phase progress total by num_trajectories_to_issue when set, so the TUI shows ~31 expected turns instead of the full 30,335-turn dataset size for a single-trajectory run. The scaling uses proportional estimation: num_trajectories * (total_client_turns / num_conversations). Change enable_salt and inject_tool_delay defaults to True — these are required for all benchmark runs and were always being set explicitly in every config. Remove the now-redundant fields from qwen_agentic_benchmark.yaml (turn_timeout_s: 14400 remains explicit as it differs from the 24h default). Update test fixtures to pass enable_salt=False explicitly where tests use minimal metadata without pre-built messages. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…PUT burst Ported from origin/feat/tool_sequences (72c20f5). Issuing all accuracy requests simultaneously exhausted the KV pool and produced truncated outputs. Accuracy phases now inherit the performance load pattern; AGENTIC_INFERENCE is downgraded to CONCURRENCY with the same target_concurrency since plain accuracy datasets are single-turn. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-26T01:55:11Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request adds new agentic benchmark configurations for Qwen and Kimi, refactors runtime settings to scale sample counts proportionally for agentic inference, and updates accuracy phases to downgrade agentic load patterns to concurrency. Additionally, default settings for enable_salt and inject_tool_delay are changed to True. The reviewer feedback highlights a hardcoded local dataset path in the new Qwen configuration, warns against using assert for runtime validation in execute.py (suggesting config.get_single_dataset() for reusability), and notes that enabling salt by default could break backward compatibility for datasets lacking system prompts.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-26T01:56:31Z

+datasets:
+  - name: agentic_coding
+    type: performance
+    path: /home/tianmuli/vllm_test/datasets/agentic_combined_v4.jsonl


The dataset path is hardcoded to a local absolute path (/home/tianmuli/vllm_test/datasets/agentic_combined_v4.jsonl). This makes the example configuration non-portable and will fail for other users. Please use a placeholder path (e.g., /path/to/agentic_combined_v4.jsonl) or a relative path instead.

path: /path/to/agentic_combined_v4.jsonl

gemini-code-assist · 2026-06-26T01:56:31Z

+    agentic_overrides: dict = {}
+    if isinstance(dataloader, AgenticInferenceDataset):
+        perf_cfgs = [d for d in config.datasets if d.type == DatasetType.PERFORMANCE]
+        agentic_cfg = perf_cfgs[0].agentic_inference if perf_cfgs else None
+        assert dataloader.conversation_metadata is not None
+        agentic_overrides = {
+            "agentic_num_conversations": dataloader.conversation_metadata.num_conversations,
+            "agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue
+            if agentic_cfg is not None
+            else None,
+        }
+    rt_settings = RuntimeSettings.from_config(
+        config, dataloader.num_samples(), **agentic_overrides
+    )


There are two improvement opportunities here:

Defensive Programming: Using assert for runtime validation/control flow is discouraged because assertions can be globally disabled in Python (e.g., via the -O flag or PYTHONOPTIMIZE environment variable). If assertions are disabled, dataloader.conversation_metadata could be None and lead to an unhandled AttributeError on the next line. It is safer to explicitly check for None and raise a SetupError.

Code Reusability: Instead of manually filtering config.datasets to find the performance dataset, you can leverage the existing config.get_single_dataset() helper method.

Suggested change

agentic_overrides: dict = {}

if isinstance(dataloader, AgenticInferenceDataset):

perf_cfgs = [d for d in config.datasets if d.type == DatasetType.PERFORMANCE]

agentic_cfg = perf_cfgs[0].agentic_inference if perf_cfgs else None

assert dataloader.conversation_metadata is not None

agentic_overrides = {

"agentic_num_conversations": dataloader.conversation_metadata.num_conversations,

"agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue

if agentic_cfg is not None

else None,

}

rt_settings = RuntimeSettings.from_config(

config, dataloader.num_samples(), **agentic_overrides

)

agentic_overrides: dict = {}

if isinstance(dataloader, AgenticInferenceDataset):

perf_ds = config.get_single_dataset()

agentic_cfg = perf_ds.agentic_inference if perf_ds is not None else None

metadata = dataloader.conversation_metadata

if metadata is None:

raise SetupError("AgenticInferenceDataset is missing conversation metadata.")

agentic_overrides = {

"agentic_num_conversations": metadata.num_conversations,

"agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue

if agentic_cfg is not None

else None,

}

rt_settings = RuntimeSettings.from_config(

config, dataloader.num_samples(), **agentic_overrides

)

gemini-code-assist · 2026-06-26T01:56:32Z

    enable_salt: bool = Field(
-        False,
+        True,
        description=(
            "Add deterministic salt markers before and after the system prompt "
            "to prevent KV cache reuse across trajectories in agentic inference setting."
        ),
    )


Changing the default of enable_salt to True introduces a strict requirement that every conversation in the dataset must have a system prompt. If a dataset lacks a system prompt, AgenticInferenceStrategy._validate_salt_system_prompts will raise an InputValidationError, breaking backward compatibility for simpler datasets.

To make this default robust, consider updating AgenticInferenceStrategy (in a separate PR or file) to automatically prepend an empty system prompt if none is present, rather than failing. Otherwise, keeping the default as False might be safer to avoid unexpected validation failures for users.

tianmu-li and others added 3 commits June 25, 2026 11:17

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/gptoss accuracy agentic#376

Feat/gptoss accuracy agentic#376
tianmu-li wants to merge 3 commits into
mlcommons:mainfrom
tianmu-li:feat/gptoss-accuracy-agentic

tianmu-li commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tianmu-li commented Jun 26, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant