-
Notifications
You must be signed in to change notification settings - Fork 405
Add pre-built evaluation recipes for common benchmarks #1357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b9f579d
1a626bc
5158da7
e239f42
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -40,6 +40,14 @@ Test that `nel` is installed with `nel --version`. If not, instruct the user to | |||||||||||||||||
|
|
||||||||||||||||||
| If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running. | ||||||||||||||||||
|
|
||||||||||||||||||
| **Shortcut: use pre-built task snippets.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching task snippet. Available: mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. Task snippets contain only the task-specific config (name, params, repeats) — not the full NEL config. To use them: | ||||||||||||||||||
|
|
||||||||||||||||||
| 1. Read the task snippet(s) the user wants | ||||||||||||||||||
| 2. Use `recipes/examples/example_eval.yaml` as the base config template | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't found that I need to be this prescriptive. The agent can figure out the rest of the template from the |
||||||||||||||||||
| 3. Replace the `tasks:` section with the selected snippet(s) | ||||||||||||||||||
| 4. Do Step 3 (auto-detect model settings from checkpoint) and Step 4 (fill in `???` values) | ||||||||||||||||||
| 5. Proceed to Step 7.5/8 | ||||||||||||||||||
|
|
||||||||||||||||||
| **Step 2: Build the base config file** | ||||||||||||||||||
|
|
||||||||||||||||||
| Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion: | ||||||||||||||||||
|
|
@@ -123,6 +131,29 @@ If no `hf_quant_config.json`, also check `config.json` for a `quantization_confi | |||||||||||||||||
|
|
||||||||||||||||||
| > **Note:** Some models require additional env vars for deployment (e.g., `VLLM_NVFP4_GEMM_BACKEND=marlin` for Nemotron Super). These are not in `hf_quant_config.json` — they are discovered during model card research below. | ||||||||||||||||||
|
|
||||||||||||||||||
| **Auto-detect deployment settings from checkpoint:** | ||||||||||||||||||
|
|
||||||||||||||||||
| Read `config.json` from the checkpoint (or HF model card) and build `deployment.extra_args` dynamically: | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| cat <checkpoint_path>/config.json 2>/dev/null | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| | Field in `config.json` | What to set | Example | | ||||||||||||||||||
| | --- | --- | --- | | ||||||||||||||||||
| | `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` | | ||||||||||||||||||
| | `auto_map` exists | `--trust-remote-code` | Only add if model has custom code | | ||||||||||||||||||
|
|
||||||||||||||||||
|
Comment on lines
+142
to
+146
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: Using --trust-remote-code in vLLM allows execution of untrusted Python code from Hugging Face model repositories during model loading and tokenizer initialization. This poses severe security risks, including remote code execution (RCE), as attackers can craft malicious models that execute arbitrary code on the host system when loaded. Multiple CVEs highlight this danger, even when the flag is intended to be False: - CVE-2025-66448 and GHSA-8fr4-5q9j-m8gm: auto_map in config.json triggers get_class_from_dynamic_module, fetching and executing remote code bypassing trust_remote_code=False (fixed in v0.11.1+ with domain validation). - CVE-2026-27893 and GHSA-7972-pg2x-xr59: Hardcoded trust_remote_code=True in model files (e.g., Nemotron-VL, Kimi-K25) overrides user setting (fixed in v0.18.0+). - GHSA-2pc9-4j83-qjmr: Unconditional auto_map resolution without trust_remote_code check (fixed in v0.14.0). Official vLLM docs confirm the flag defaults to False in both LLM class and vllm serve CLI, explicitly for trusting remote code from Hugging Face. It is not recommended to enable --trust-remote-code by default, especially based solely on config.json auto_map, as this field has been exploited for RCE. Only enable explicitly for trusted models requiring custom code, after verifying the repository. Use the latest vLLM version (e.g., v0.20.0 as of 2026-04-27) with patches applied, and prefer models without remote code needs. Citations:
Don't auto-enable Automatically enabling this flag when Suggested wording adjustment-| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
+| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||
| Then use WebSearch to check the model card (HuggingFace page) for deployment-specific settings: | ||||||||||||||||||
|
|
||||||||||||||||||
| | Model card signal | What to set | | ||||||||||||||||||
| | --- | --- | | ||||||||||||||||||
| | Reasoning model (thinking/CoT) | `--reasoning-parser` and `--reasoning-parser-plugin` if a custom parser is provided | | ||||||||||||||||||
| | Tool-calling support | `--enable-auto-tool-choice --tool-call-parser <parser>` | | ||||||||||||||||||
| | Custom vLLM flags documented | Add as specified (e.g., `--mamba_ssm_cache_dtype float32`) | | ||||||||||||||||||
|
|
||||||||||||||||||
| Combine all detected flags into a single `deployment.extra_args` override. The recipe's default `--max-model-len 32768` is a fallback — always prefer the value from `config.json`. | ||||||||||||||||||
|
|
||||||||||||||||||
| **Quantization-aware benchmark defaults:** | ||||||||||||||||||
|
|
||||||||||||||||||
| When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include. | ||||||||||||||||||
|
|
@@ -218,7 +249,13 @@ ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null" | |||||||||||||||||
|
|
||||||||||||||||||
| Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run. | ||||||||||||||||||
|
|
||||||||||||||||||
| **Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands. | ||||||||||||||||||
| **Important**: Export required environment variables based on your config. If any tokens or keys are missing, point the user to `recipes/env.example` — it lists all possible keys with notes on which tasks need them. Ask the user to copy it, fill in their keys, and source it: | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| cp recipes/env.example .env | ||||||||||||||||||
| # Edit .env with your keys | ||||||||||||||||||
| set -a && source .env && set +a | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| # If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands): | ||||||||||||||||||
|
|
||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # Evaluation API Keys | ||
| # | ||
| # Copy this file and fill in the keys you need: | ||
| # cp recipes/env.example .env | ||
| # # Edit .env with your keys | ||
| # set -a && source .env && set +a | ||
| # | ||
| # Not all keys are required — only fill in what your tasks need. | ||
|
|
||
| # Required for all tasks (model/dataset downloads) | ||
| HF_TOKEN=hf_... | ||
|
|
||
| # Required for nemo_skills.* tasks (dummy value, not a real key) | ||
| DUMMY_API_KEY=dummy | ||
|
|
||
| # Required for NEL pre_cmd execution | ||
| NEMO_EVALUATOR_TRUST_PRE_CMD=1 | ||
|
|
||
| # --- Optional: task-specific keys --- | ||
|
|
||
| # AIME 2025 (simple_evals variant only, not ns_aime2025) | ||
| # JUDGE_API_KEY= | ||
|
|
||
| # tau2_bench_telecom (LLM judge) | ||
| # JUDGE_API_KEY_NVDEV_QWEN235B= | ||
|
|
||
| # terminal-bench-hard (AWS sandbox) | ||
| # AWS_ACCESS_KEY_ID= | ||
| # AWS_SECRET_ACCESS_KEY= |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| # Example: Quantization Validation Suite | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NEL config yaml may change quite frequently. Is this yaml for demo purpose or for day 0 model evals (internal usage)? Also some of the evals requires pinned eval docker image and specific settings for apple-to-apple comparison.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These recipes are for demo purpose. The task snippets that the agent composes into a working config. If NEL configs change and something breaks, the agent will diagnose and fix the incompatibility at runtime.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For evals that require pinned docker images and specific settings for apple-to-apple comparison, users can override container. |
||
| # | ||
| # A balanced set of benchmarks for validating quantized model quality. | ||
| # Copy this file and customize for your needs. | ||
| # Task snippets in recipes/tasks/ define per-task configs — the agent | ||
| # composes them into a runnable config like this one. | ||
| # | ||
| # Includes: | ||
| # - MMLU-Pro (knowledge, completions) | ||
| # - GPQA Diamond (reasoning, chat, 5 repeats) | ||
| # - LiveCodeBench v6 (code, chat, 3 repeats) | ||
| # - IFBench (instruction following, chat, 8 repeats) | ||
| # | ||
| # Usage: | ||
| # nel run --config recipes/examples/example_eval.yaml \ | ||
| # -o deployment.checkpoint_path=/path/to/quantized/checkpoint \ | ||
| # -o deployment.served_model_name=my-model-nvfp4 \ | ||
| # -o execution.hostname=<slurm_host> \ | ||
| # -o execution.account=<slurm_account> \ | ||
| # -o execution.output_dir=/path/to/output | ||
| # | ||
| # For quantized checkpoints, also add the quantization flag: | ||
| # -o 'deployment.extra_args=--max-model-len 32768 --trust-remote-code --quantization modelopt_fp4' | ||
| # | ||
| # Run a single task: | ||
| # nel run --config ... -t ns_gpqa | ||
| # | ||
| # Smoke test (2 samples): | ||
| # nel run --config ... -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=2 | ||
| defaults: | ||
| - execution: slurm/default | ||
| - deployment: vllm | ||
| - _self_ | ||
| execution: | ||
| hostname: ??? | ||
| username: ${oc.env:USER} | ||
| account: ??? | ||
| output_dir: ??? | ||
| walltime: "04:00:00" | ||
| mounts: | ||
| mount_home: false | ||
| deployment: | ||
| env_vars: | ||
| HF_TOKEN: host:HF_TOKEN | ||
| checkpoint_path: ??? | ||
| hf_model_handle: | ||
| served_model_name: ??? | ||
| tensor_parallel_size: 1 | ||
| data_parallel_size: 1 | ||
| # For models with custom code, add: --trust-remote-code | ||
| extra_args: --max-model-len 32768 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we let the agent decide the extra_args required based on the model card/config of the checkpoint? e.g., model-len, tool-call-parser, reasoning-parser ...
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a section for auto-detecting deployment settings from checkpoint. |
||
| evaluation: | ||
| env_vars: | ||
| HF_TOKEN: host:HF_TOKEN | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| request_timeout: 3600 | ||
| max_retries: 10 | ||
| parallelism: 16 | ||
| target: | ||
| api_endpoint: | ||
| api_key_name: DUMMY_API_KEY | ||
| tasks: | ||
| # Knowledge (chat endpoint, short) | ||
| - name: ns_mmlu_pro | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| num_repeats: 1 | ||
| args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
|
|
||
| # Reasoning (chat endpoint, 5 repeats, short) | ||
| - name: ns_gpqa | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| args: ++prompt_config=eval/aai/mcq-4choices | ||
| num_repeats: 5 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
|
|
||
| # Code (chat endpoint, 3 repeats, medium) | ||
| - name: ns_livecodebench | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| dataset_split: test_v6_2408_2505 | ||
| num_repeats: 3 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
|
|
||
| # Instruction following (chat endpoint, 8 repeats, super short) | ||
| - name: ns_ifbench | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| num_repeats: 8 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| # AIME 2025 (NeMo Skills, chat) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these task configurations fixed or just examples?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are preconfigured settings recommended by internal benchmarks, but users are free to modify the configs.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, if that's the case, I feel we can move these out of the .claude, into somewhere top level about evaluations using nel, because these settings are not just for the agents |
||
| # Primary metric: pass@1[avg-of-16] symbolic_correct | ||
| # Run time: Long (reasoning models generate lengthy thinking traces) | Repeats: 16 | ||
| # Note: The AA variant (simple_evals.AIME_2025) requires JUDGE_API_KEY. | ||
| # This NeMo Skills variant uses symbolic scoring — no external API keys needed. | ||
| - name: ns_aime2025 | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| request_timeout: 100000 | ||
| max_retries: 10 | ||
| extra: | ||
| num_repeats: 16 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # GPQA Diamond (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-5] symbolic_correct | ||
| # Run time: Short | Repeats: 5 | ||
| - name: ns_gpqa | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| args: ++prompt_config=eval/aai/mcq-4choices | ||
| num_repeats: 5 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # IFBench (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-8] prompt_strict_accuracy | ||
| # Run time: Super Short | Repeats: 8 | ||
| - name: ns_ifbench | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| num_repeats: 8 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # LiveCodeBench v6 (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-3] accuracy | ||
| # Run time: Medium | Repeats: 3 | ||
| - name: ns_livecodebench | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| max_retries: 10 | ||
| extra: | ||
| dataset_split: test_v6_2408_2505 | ||
| num_repeats: 3 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # MMLU-Pro (NeMo Skills, chat) | ||
| # Primary metric: symbolic_correct | ||
| # Run time: Short | Repeats: 1 | ||
| - name: ns_mmlu_pro | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| extra: | ||
| num_repeats: 1 | ||
| args: ++prompt_config=eval/aai/mcq-10choices-boxed ++inference.tokens_to_generate=null | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # SciCode (NeMo Skills, chat) | ||
| # Primary metric: pass@1[avg-of-3] subtask_accuracy | ||
| # Run time: Long | Repeats: 3 | ||
| - name: ns_scicode | ||
| nemo_evaluator_config: | ||
| config: | ||
| params: | ||
| max_retries: 10 | ||
| extra: | ||
| num_repeats: 3 | ||
| target: | ||
| api_endpoint: | ||
| adapter_config: | ||
| params_to_remove: | ||
| - max_new_tokens | ||
| - max_completion_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to think some about how we can merge the public skills with internal skills & configs. I'll ping offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to discuss offline. The split is intentional for now (public = demo task snippets, internal =pinned numerics configs) but we should align on a unified structure long-term.