Skip to content

Add CVDP benchmark resource server with apptainer instead of docker#928

Merged
cmunley1 merged 16 commits intoNVIDIA-NeMo:mainfrom
arti4nvj:arti/cvdp_v2
Apr 7, 2026
Merged

Add CVDP benchmark resource server with apptainer instead of docker#928
cmunley1 merged 16 commits intoNVIDIA-NeMo:mainfrom
arti4nvj:arti/cvdp_v2

Conversation

@arti4nvj
Copy link
Copy Markdown
Contributor

Part of Customer Eval Bench, this is adding CVDP (non-agentic, non-commercial) support to Gym. This is a single pass evaluation using vLLM as a backend. The current code matches the existing public CVDP infra as of 3/5.

resources_server/cvdp --> All the helper scripts and files needed to run the benchmark
resources_server/cvdp/cvdp_lib --> contain files and code straight from CVDP Public Github that are needed for the final report generation
resources_server/cvdp/scripts --> contain all the helper scripts to convert the dataset to gym, create the final report
responses_api_agents/cvdp_agent --> copy of Simple Agent except added support for retries

Instead of Docker, this runs with Apptainer

Comment thread benchmarks/aime24/config.yaml Outdated
Comment thread resources_servers/cvdp/configs/cvdp.yaml
Comment thread resources_servers/cvdp/env.yaml.example Outdated
Comment thread resources_servers/cvdp/README.md Outdated
Comment thread responses_api_agents/cvdp_agent/app.py
@arti4nvj arti4nvj force-pushed the arti/cvdp_v2 branch 3 times, most recently from d0044b0 to 7bec70a Compare March 24, 2026 03:52
@arti4nvj arti4nvj requested review from cmunley1, jmabry and roclark March 24, 2026 04:07
Comment thread resources_servers/cvdp/README.md Outdated
Comment thread resources_servers/cvdp/README.md Outdated
Comment thread resources_servers/cvdp/README.md Outdated
Comment thread resources_servers/cvdp/README.md
@jmabry
Copy link
Copy Markdown

jmabry commented Mar 24, 2026

Code review

Found 2 issues:

  1. Usage double-counting in responses_api_agents/cvdp_agent/app.pyusage = model_response.usage on the first iteration stores a reference, then the if usage: block immediately adds model_response.usage.input_tokens to itself, doubling the first call's token counts. PR feat: Fix duplicated usage counting and errors on empty usage in subsequent model calls #939 (commit c7bb3191) fixed this exact pattern in simple_agent by adding model_response.usage = None immediately after capture and guarding with if usage and model_response.usage:. The cvdp_agent was copied from the pre-fix version.

if not usage:
usage = model_response.usage
if usage:
usage.input_tokens += model_response.usage.input_tokens
usage.output_tokens += model_response.usage.output_tokens
usage.total_tokens += model_response.usage.total_tokens

  1. responses_api_agents/cvdp_agent/client.py is an unmodified copy-paste from example_single_tool_call / simple_agent — it hardcodes server_name="example_single_tool_call_simple_agent", calls a get_weather tool, and uses "going out in sf tn" as example input. It has module-level executable code that runs on import and would connect to the wrong server. This file should be removed or updated for CVDP.

server_client = ServerClient.load_from_global_config()
task = server_client.post(
server_name="example_single_tool_call_simple_agent",
url_path="/v1/responses",
json=NeMoGymResponseCreateParamsNonStreaming(
input=[

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

jmabry
jmabry previously approved these changes Mar 24, 2026
Copy link
Copy Markdown

@jmabry jmabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG%C

roclark
roclark previously approved these changes Mar 24, 2026
Copy link
Copy Markdown
Contributor

@roclark roclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @arti4nvj!

@arti4nvj arti4nvj force-pushed the arti/cvdp_v2 branch 2 times, most recently from 6c42b52 to be4eff7 Compare March 31, 2026 22:13
@roclark roclark requested a review from cmunley1 April 6, 2026 15:28
roclark
roclark previously approved these changes Apr 6, 2026
arti4nvj added 8 commits April 7, 2026 12:42
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
arti4nvj and others added 8 commits April 7, 2026 12:42
…a copy of simple agent and irrelevent

Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Route comprehension categories to BLEU/ROUGE scoring instead of
docker-compose harness. Code-generation categories (2-5, 7, 12-14, 16)
are unchanged. Also updates convert_to_gym.py to handle comprehension
data and adds tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Mirrors CVDP's validate_commercial_eda_setup() — warns at startup if
eda_sim_image is not set, since categories 12/13/14 will fail at runtime
when harness files reference __VERIF_EDA_IMAGE__.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
Signed-off-by: Arti Jain <artij@nvidia.com>
class SimpleAgent(SimpleResponsesAPIAgent):
config: SimpleAgentConfig

async def responses(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to import rather than duplicate if you can

@@ -0,0 +1,9 @@
# Description


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a sentence or two here at least pointing to the resources server readme


# Licensing information
Code: Apache 2.0
Data: N/A
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and clarify data availability

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pointing to resources server docs is fine i guess

@cmunley1 cmunley1 merged commit 3a9db6f into NVIDIA-NeMo:main Apr 7, 2026
5 checks passed
cmunley1 pushed a commit that referenced this pull request Apr 8, 2026
…928)

Part of Customer Eval Bench, this is adding CVDP (non-agentic,
non-commercial) support to Gym. This is a single pass evaluation using
vLLM as a backend. The current code matches the existing [public CVDP
infra](https://github.com/NVlabs/cvdp_benchmark) as of 3/5.

resources_server/cvdp --> All the helper scripts and files needed to run
the benchmark
resources_server/cvdp/cvdp_lib --> contain files and code straight from
CVDP Public Github that are needed for the final report generation
resources_server/cvdp/scripts --> contain all the helper scripts to
convert the dataset to gym, create the final report
responses_api_agents/cvdp_agent --> copy of Simple Agent except added
support for retries

Instead of Docker, this runs with Apptainer

---------

Signed-off-by: Arti Jain <artij@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants