Feat/gptoss accuracy agentic#376
Conversation
…nchmarks Add aime25::gptoss, gpqa::gptoss, and livecodebench::gptoss as accuracy evaluation datasets alongside the agentic performance phase in both the Kimi and Qwen agentic benchmark configs. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ceConfig defaults Scale the perf-phase progress total by num_trajectories_to_issue when set, so the TUI shows ~31 expected turns instead of the full 30,335-turn dataset size for a single-trajectory run. The scaling uses proportional estimation: num_trajectories * (total_client_turns / num_conversations). Change enable_salt and inject_tool_delay defaults to True — these are required for all benchmark runs and were always being set explicitly in every config. Remove the now-redundant fields from qwen_agentic_benchmark.yaml (turn_timeout_s: 14400 remains explicit as it differs from the 24h default). Update test fixtures to pass enable_salt=False explicitly where tests use minimal metadata without pre-built messages. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…PUT burst Ported from origin/feat/tool_sequences (72c20f5). Issuing all accuracy requests simultaneously exhausted the KV pool and produced truncated outputs. Accuracy phases now inherit the performance load pattern; AGENTIC_INFERENCE is downgraded to CONCURRENCY with the same target_concurrency since plain accuracy datasets are single-turn. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request adds new agentic benchmark configurations for Qwen and Kimi, refactors runtime settings to scale sample counts proportionally for agentic inference, and updates accuracy phases to downgrade agentic load patterns to concurrency. Additionally, default settings for enable_salt and inject_tool_delay are changed to True. The reviewer feedback highlights a hardcoded local dataset path in the new Qwen configuration, warns against using assert for runtime validation in execute.py (suggesting config.get_single_dataset() for reusability), and notes that enabling salt by default could break backward compatibility for datasets lacking system prompts.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| datasets: | ||
| - name: agentic_coding | ||
| type: performance | ||
| path: /home/tianmuli/vllm_test/datasets/agentic_combined_v4.jsonl |
There was a problem hiding this comment.
The dataset path is hardcoded to a local absolute path (/home/tianmuli/vllm_test/datasets/agentic_combined_v4.jsonl). This makes the example configuration non-portable and will fail for other users. Please use a placeholder path (e.g., /path/to/agentic_combined_v4.jsonl) or a relative path instead.
path: /path/to/agentic_combined_v4.jsonl| agentic_overrides: dict = {} | ||
| if isinstance(dataloader, AgenticInferenceDataset): | ||
| perf_cfgs = [d for d in config.datasets if d.type == DatasetType.PERFORMANCE] | ||
| agentic_cfg = perf_cfgs[0].agentic_inference if perf_cfgs else None | ||
| assert dataloader.conversation_metadata is not None | ||
| agentic_overrides = { | ||
| "agentic_num_conversations": dataloader.conversation_metadata.num_conversations, | ||
| "agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue | ||
| if agentic_cfg is not None | ||
| else None, | ||
| } | ||
| rt_settings = RuntimeSettings.from_config( | ||
| config, dataloader.num_samples(), **agentic_overrides | ||
| ) |
There was a problem hiding this comment.
There are two improvement opportunities here:
- Defensive Programming: Using
assertfor runtime validation/control flow is discouraged because assertions can be globally disabled in Python (e.g., via the-Oflag orPYTHONOPTIMIZEenvironment variable). If assertions are disabled,dataloader.conversation_metadatacould beNoneand lead to an unhandledAttributeErroron the next line. It is safer to explicitly check forNoneand raise aSetupError. - Code Reusability: Instead of manually filtering
config.datasetsto find the performance dataset, you can leverage the existingconfig.get_single_dataset()helper method.
| agentic_overrides: dict = {} | |
| if isinstance(dataloader, AgenticInferenceDataset): | |
| perf_cfgs = [d for d in config.datasets if d.type == DatasetType.PERFORMANCE] | |
| agentic_cfg = perf_cfgs[0].agentic_inference if perf_cfgs else None | |
| assert dataloader.conversation_metadata is not None | |
| agentic_overrides = { | |
| "agentic_num_conversations": dataloader.conversation_metadata.num_conversations, | |
| "agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue | |
| if agentic_cfg is not None | |
| else None, | |
| } | |
| rt_settings = RuntimeSettings.from_config( | |
| config, dataloader.num_samples(), **agentic_overrides | |
| ) | |
| agentic_overrides: dict = {} | |
| if isinstance(dataloader, AgenticInferenceDataset): | |
| perf_ds = config.get_single_dataset() | |
| agentic_cfg = perf_ds.agentic_inference if perf_ds is not None else None | |
| metadata = dataloader.conversation_metadata | |
| if metadata is None: | |
| raise SetupError("AgenticInferenceDataset is missing conversation metadata.") | |
| agentic_overrides = { | |
| "agentic_num_conversations": metadata.num_conversations, | |
| "agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue | |
| if agentic_cfg is not None | |
| else None, | |
| } | |
| rt_settings = RuntimeSettings.from_config( | |
| config, dataloader.num_samples(), **agentic_overrides | |
| ) |
| enable_salt: bool = Field( | ||
| False, | ||
| True, | ||
| description=( | ||
| "Add deterministic salt markers before and after the system prompt " | ||
| "to prevent KV cache reuse across trajectories in agentic inference setting." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Changing the default of enable_salt to True introduces a strict requirement that every conversation in the dataset must have a system prompt. If a dataset lacks a system prompt, AgenticInferenceStrategy._validate_salt_system_prompts will raise an InputValidationError, breaking backward compatibility for simpler datasets.
To make this default robust, consider updating AgenticInferenceStrategy (in a separate PR or file) to automatically prepend an empty system prompt if none is present, rather than failing. Otherwise, keeping the default as False might be safer to avoid unexpected validation failures for users.
What does this PR do?
Type of change
Related issues
Testing
Checklist