Skip to content

Feat/gptoss accuracy agentic#376

Draft
tianmu-li wants to merge 3 commits into
mlcommons:mainfrom
tianmu-li:feat/gptoss-accuracy-agentic
Draft

Feat/gptoss accuracy agentic#376
tianmu-li wants to merge 3 commits into
mlcommons:mainfrom
tianmu-li:feat/gptoss-accuracy-agentic

Conversation

@tianmu-li

Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

tianmu-li and others added 3 commits June 25, 2026 11:17
…nchmarks

Add aime25::gptoss, gpqa::gptoss, and livecodebench::gptoss as accuracy
evaluation datasets alongside the agentic performance phase in both the
Kimi and Qwen agentic benchmark configs.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ceConfig defaults

Scale the perf-phase progress total by num_trajectories_to_issue when set,
so the TUI shows ~31 expected turns instead of the full 30,335-turn dataset
size for a single-trajectory run. The scaling uses proportional estimation:
num_trajectories * (total_client_turns / num_conversations).

Change enable_salt and inject_tool_delay defaults to True — these are
required for all benchmark runs and were always being set explicitly in every
config. Remove the now-redundant fields from qwen_agentic_benchmark.yaml
(turn_timeout_s: 14400 remains explicit as it differs from the 24h default).

Update test fixtures to pass enable_salt=False explicitly where tests use
minimal metadata without pre-built messages.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…PUT burst

Ported from origin/feat/tool_sequences (72c20f5). Issuing all accuracy
requests simultaneously exhausted the KV pool and produced truncated
outputs. Accuracy phases now inherit the performance load pattern;
AGENTIC_INFERENCE is downgraded to CONCURRENCY with the same
target_concurrency since plain accuracy datasets are single-turn.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds new agentic benchmark configurations for Qwen and Kimi, refactors runtime settings to scale sample counts proportionally for agentic inference, and updates accuracy phases to downgrade agentic load patterns to concurrency. Additionally, default settings for enable_salt and inject_tool_delay are changed to True. The reviewer feedback highlights a hardcoded local dataset path in the new Qwen configuration, warns against using assert for runtime validation in execute.py (suggesting config.get_single_dataset() for reusability), and notes that enabling salt by default could break backward compatibility for datasets lacking system prompts.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

datasets:
- name: agentic_coding
type: performance
path: /home/tianmuli/vllm_test/datasets/agentic_combined_v4.jsonl

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dataset path is hardcoded to a local absolute path (/home/tianmuli/vllm_test/datasets/agentic_combined_v4.jsonl). This makes the example configuration non-portable and will fail for other users. Please use a placeholder path (e.g., /path/to/agentic_combined_v4.jsonl) or a relative path instead.

    path: /path/to/agentic_combined_v4.jsonl

Comment on lines +398 to +411
agentic_overrides: dict = {}
if isinstance(dataloader, AgenticInferenceDataset):
perf_cfgs = [d for d in config.datasets if d.type == DatasetType.PERFORMANCE]
agentic_cfg = perf_cfgs[0].agentic_inference if perf_cfgs else None
assert dataloader.conversation_metadata is not None
agentic_overrides = {
"agentic_num_conversations": dataloader.conversation_metadata.num_conversations,
"agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue
if agentic_cfg is not None
else None,
}
rt_settings = RuntimeSettings.from_config(
config, dataloader.num_samples(), **agentic_overrides
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two improvement opportunities here:

  1. Defensive Programming: Using assert for runtime validation/control flow is discouraged because assertions can be globally disabled in Python (e.g., via the -O flag or PYTHONOPTIMIZE environment variable). If assertions are disabled, dataloader.conversation_metadata could be None and lead to an unhandled AttributeError on the next line. It is safer to explicitly check for None and raise a SetupError.
  2. Code Reusability: Instead of manually filtering config.datasets to find the performance dataset, you can leverage the existing config.get_single_dataset() helper method.
Suggested change
agentic_overrides: dict = {}
if isinstance(dataloader, AgenticInferenceDataset):
perf_cfgs = [d for d in config.datasets if d.type == DatasetType.PERFORMANCE]
agentic_cfg = perf_cfgs[0].agentic_inference if perf_cfgs else None
assert dataloader.conversation_metadata is not None
agentic_overrides = {
"agentic_num_conversations": dataloader.conversation_metadata.num_conversations,
"agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue
if agentic_cfg is not None
else None,
}
rt_settings = RuntimeSettings.from_config(
config, dataloader.num_samples(), **agentic_overrides
)
agentic_overrides: dict = {}
if isinstance(dataloader, AgenticInferenceDataset):
perf_ds = config.get_single_dataset()
agentic_cfg = perf_ds.agentic_inference if perf_ds is not None else None
metadata = dataloader.conversation_metadata
if metadata is None:
raise SetupError("AgenticInferenceDataset is missing conversation metadata.")
agentic_overrides = {
"agentic_num_conversations": metadata.num_conversations,
"agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue
if agentic_cfg is not None
else None,
}
rt_settings = RuntimeSettings.from_config(
config, dataloader.num_samples(), **agentic_overrides
)

Comment on lines 275 to 281
enable_salt: bool = Field(
False,
True,
description=(
"Add deterministic salt markers before and after the system prompt "
"to prevent KV cache reuse across trajectories in agentic inference setting."
),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the default of enable_salt to True introduces a strict requirement that every conversation in the dataset must have a system prompt. If a dataset lacks a system prompt, AgenticInferenceStrategy._validate_salt_system_prompts will raise an InputValidationError, breaking backward compatibility for simpler datasets.

To make this default robust, consider updating AgenticInferenceStrategy (in a separate PR or file) to automatically prepend an empty system prompt if none is present, rather than failing. Otherwise, keeping the default as False might be safer to avoid unexpected validation failures for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant