-
Notifications
You must be signed in to change notification settings - Fork 400
feat(okr30): add EAGLE3 Claude Code skills for triage, validation, and new-model support #1429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,215 @@ | ||
| --- | ||
| name: eagle3-new-model | ||
| description: > | ||
| Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml | ||
| launcher config for a new model checkpoint, choosing the right hidden state dump | ||
| backend (TRT-LLM / HF / vLLM) and GPU configuration. | ||
| Use when user wants to run EAGLE3 on a model that does not yet have a YAML in | ||
| tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint. | ||
| --- | ||
|
|
||
| # EAGLE3 New Model Configuration | ||
|
|
||
| This skill guides you through creating `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml` | ||
| for a new model. | ||
|
|
||
| ## Step 1 — Look up the model architecture | ||
|
|
||
| Determine these values from the HuggingFace model card, `config.json`, and vLLM docs: | ||
|
|
||
| | Property | Where to find it | | ||
| |---|---| | ||
| | Total / active parameters | Model card | | ||
| | Dense or MoE? | `config.json` → `num_experts`, `num_experts_per_tok` | | ||
| | Attention type (MHA / GQA / MLA / SWA) | Model card | | ||
| | Multimodal? (vision encoder) | Model card | | ||
| | BF16 weight size (GB) | `total_params × 2 bytes` | | ||
| | Special serving flags | vLLM docs, model README (`--trust-remote-code`, parsers) | | ||
|
|
||
| ## Step 2 — Calculate GPU requirements (OCI-HSG / GB200) | ||
|
|
||
| OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node** | ||
|
|
||
| ```text | ||
| BF16 weight size = total_params × 2 bytes | ||
| GPUs needed = ceil(weight_size_GB / 192) | ||
| nodes = ceil(gpus_needed / 4) | ||
| tp = min(gpus_needed, 4) | ||
| ``` | ||
|
|
||
| | Model | Weights (BF16) | GPUs | nodes | tp | | ||
| |---|---|---|---|---| | ||
| | 8B dense | ~16 GB | 1 | 1 | 4 | | ||
| | 70B dense | ~140 GB | 1 | 1 | 4 | | ||
| | 685B MoE | ~340 GB | 2 | 1 | 4 | | ||
| | 1T MoE | ~595 GB | 4 | 1 | 4 | | ||
|
|
||
| ## Step 3 — Choose the hidden state dump backend | ||
|
|
||
| | Backend | Script | When to use | | ||
| |---------|--------|-------------| | ||
| | vLLM | `common/eagle3/dump_offline_data_vllm.sh` | Default; broad coverage via vLLM + speculators | | ||
| | HF | `common/eagle3/dump_offline_data_hf.sh` | VLMs, custom-code models, SWA attention | | ||
| | TRT-LLM | `common/eagle3/dump_offline_data.sh` | Pure-text models with TRT-LLM support (needs `--tp`/`--moe-ep`) | | ||
|
|
||
| Use **HF** when the model is a VLM or uses sliding window attention (TRT-LLM does not support these). | ||
| Use **vLLM** for everything else as the default. | ||
|
|
||
| ## Step 4 — Write the YAML | ||
|
|
||
| Create `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`. | ||
| Use an existing config as a reference (e.g., `tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml`). | ||
|
|
||
| ### Header comment | ||
|
|
||
| ```yaml | ||
| # EAGLE3 offline speculative decoding pipeline for <org>/<model>. | ||
| # | ||
| # <Model> is a <size> <dense|MoE> model. <brief notes: attention type, special reqs> | ||
| # BF16 weights ~<size> GB — fits on <N> GB200 node(s) (<N> × 192 GB). | ||
| # | ||
| # <Special requirements (if any)> | ||
| # | ||
| # 4-step pipeline: | ||
| # task_0: Data synthesis — query vLLM server to generate prompt samples | ||
| # task_1: Dump hidden states — run target model to capture hidden states | ||
| # task_2: Offline training — train the EAGLE3 draft head | ||
| # task_3: Benchmark — evaluate speculative decoding speedup via VLLM | ||
| # | ||
| # Usage: | ||
| # uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes | ||
| # uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes | ||
|
|
||
| job_name: <Model>_EAGLE3_offline | ||
| pipeline: | ||
| allow_to_fail: false | ||
| skip: false | ||
| note: | ||
|
|
||
| global_vars: | ||
| hf_model: /hf-local/<org>/<model> | ||
| ``` | ||
|
|
||
| ### task_0 — Data synthesis (`common/vllm/query.sh`) | ||
|
|
||
| Args before `--` go to the vLLM server; args after `--` go to `query.py`. | ||
|
|
||
| ```yaml | ||
| task_0: | ||
| script: common/vllm/query.sh | ||
| args: | ||
| - --model <<global_vars.hf_model>> | ||
| - --tensor-parallel-size <TP> | ||
| - --trust-remote-code # add only if required | ||
| - -- # separator | ||
| - --data /hf-local/modelopt/Speculative-Decoding-Dataset-v2-default | ||
| - --save /scratchspace/data | ||
| environment: | ||
| - HF_LOCAL: /hf-local | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: <nodes> | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 4 | ||
| container: vllm/vllm-openai:latest | ||
| ``` | ||
|
|
||
| ### task_1 — Hidden states (vLLM backend, default) | ||
|
|
||
| ```yaml | ||
| task_1: | ||
| script: common/eagle3/dump_offline_data_vllm.sh | ||
| args: | ||
| - --input-data /scratchspace/data | ||
| - --output-dir /scratchspace/offline_hidden_states | ||
| - --max-seq-len 8192 | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: <nodes> | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 4 | ||
| container: vllm/vllm-openai:latest | ||
| ``` | ||
|
|
||
| For **HF backend** (VLMs, SWA models), use `dump_offline_data_hf.sh` instead — same args, no TP flags needed. | ||
|
|
||
| For **TRT-LLM backend**, use `dump_offline_data.sh` and add `--tp <TP>` and `--moe-ep 1` (or appropriate EP). | ||
|
|
||
| ### task_2 — Offline training (`common/eagle3/train_eagle.sh`) | ||
|
|
||
| ```yaml | ||
| task_2: | ||
| script: common/eagle3/train_eagle.sh | ||
| args: | ||
| - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml | ||
| - model.model_name_or_path=<<global_vars.hf_model>> | ||
| - data.offline_data_path=/scratchspace/offline_hidden_states | ||
| - training.output_dir=/scratchspace/eagle3 | ||
| - training.training_seq_len=4096 | ||
| - training.disable_tqdm=true | ||
| - training.ar_validate_steps=500000 | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: 1 | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 4 | ||
| container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 | ||
| ``` | ||
|
|
||
| > **MoE note:** For MoE models with large per-expert hidden dims, consider increasing | ||
| > `intermediate_size` in `eagle_config.json` to match the model's `moe_intermediate_size`. | ||
|
|
||
| ### task_3 — Benchmark (`common/specdec_bench/quick_check.sh`) | ||
|
|
||
| ```yaml | ||
| task_3: | ||
| script: common/specdec_bench/quick_check.sh | ||
| args: | ||
| - --draft_model_dir /scratchspace/export | ||
| - --draft_length 3 | ||
| - --output_length 4096 | ||
| - --engine VLLM | ||
| - --tp_size <TP> | ||
| - --ep_size 1 | ||
| - --speculative_algorithm EAGLE3 | ||
| - --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl | ||
| - --concurrency 1 | ||
| environment: | ||
| - HF_LOCAL: /hf-local | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| slurm_config: | ||
| _factory_: "slurm_factory" | ||
| nodes: <nodes> | ||
| ntasks_per_node: 1 | ||
| gpus_per_node: 4 | ||
| container: vllm/vllm-openai:latest | ||
| ``` | ||
|
|
||
| ## Step 5 — Common model-specific adjustments | ||
|
|
||
| | Situation | What to change | | ||
| |---|---| | ||
| | Requires `--trust-remote-code` | Add to task_0 vLLM args (before `--`) | | ||
| | VLM / multimodal | Use `dump_offline_data_hf.sh` for task_1 | | ||
| | Sliding window attention | Use `dump_offline_data_hf.sh` or `_vllm.sh` for task_1 | | ||
| | MoE with large expert hidden dim | Increase `intermediate_size` in eagle_config.json | | ||
| | Non-standard attention (MLA) | Verify `eagle_decoder_type` in the eagle3 recipe YAML | | ||
| | Custom tokenizer (e.g., tiktoken) | Set `TIKTOKEN_RS_CACHE_DIR` env var in task_0 and task_1 | | ||
| | NVFP4 quant model | task_0/task_3 use quant container; task_1/task_2 use BF16 base model — add `hf_model_bf16` global_var | | ||
| | Model needs `trust_remote_code` at benchmark | Add `--trust-remote-code` to task_3 args | | ||
|
|
||
| ## Step 6 — Test with dry run | ||
|
|
||
| Preview the resolved config before submitting: | ||
|
|
||
| ```bash | ||
| uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --dryrun --yes -v | ||
| ``` | ||
|
|
||
| ## Step 7 — Update triage chart | ||
|
|
||
| After adding a new model, add a row to the test matrix in | ||
| `tools/launcher/examples/EAGLE3_TRIAGE.md` with status 🔲 (not yet tested). | ||
| Fill in results after running. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| --- | ||
| name: eagle3-review-logs | ||
| description: > | ||
| Review EAGLE3 pipeline experiment logs from the launcher's experiments/ directory. | ||
| Summarizes pass/fail status for all 4 tasks, diagnoses failures with root causes | ||
| and fixes, and flags warnings. Use when the user asks to review job logs, | ||
| check experiment results, or diagnose why a specific task failed. | ||
| user_invocable: true | ||
| --- | ||
|
|
||
| # Review EAGLE3 Experiment Logs | ||
|
|
||
| Analyze output logs from an EAGLE3 pipeline run launched via `launch.py` or `slurm.py`. | ||
|
|
||
| ## Step 0 — Find experiment logs | ||
|
|
||
| Locate the experiment directory. The default is `experiments/` relative to the launcher root, | ||
| or wherever `--job-dir` was pointed. | ||
|
|
||
| ```bash | ||
| ls -td experiments/cicd/cicd_* | head -10 | ||
| ``` | ||
|
|
||
| Each experiment has one subdirectory per task (0–3). Logs are `sbatch_*.out` files inside: | ||
|
|
||
| ```bash | ||
| find experiments/<exp_id>/ -name "sbatch_*.out" | sort | ||
| ``` | ||
|
|
||
| Do this in a single Bash call. If no experiments exist, ask the user for the directory. | ||
|
|
||
| ## Step 1 — Read all task logs | ||
|
|
||
| Read the last 200 lines of each log in parallel. Errors appear at the end: | ||
|
|
||
| ```bash | ||
| for f in $(find experiments/<exp_id>/ -name "sbatch_*.out" | sort); do | ||
| echo "=== $f ==="; tail -200 "$f"; echo | ||
| done | ||
| ``` | ||
|
Comment on lines
+34
to
+40
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. “Read in parallel” does not match the provided command The snippet is sequential, so the instruction is currently inaccurate. Either remove “in parallel” or update the command to actually parallelize log tails. Suggested doc fix (true parallel read)-Read the last 200 lines of each log in parallel. Errors appear at the end:
+Read the last 200 lines of each log. Errors appear at the end:
```bash
-for f in $(find experiments/<exp_id>/ -name "sbatch_*.out" | sort); do
- echo "=== $f ==="; tail -200 "$f"; echo
-done
+find experiments/<exp_id>/ -name "sbatch_*.out" | sort | \
+ xargs -I{} -P 8 sh -c 'echo "=== {} ==="; tail -200 "{}"; echo'Verify each finding against current code. Fix only still-valid issues, skip the In @.claude/skills/eagle3-review-logs/SKILL.md around lines 34 - 40, The text |
||
|
|
||
| ## Step 2 — Analyze | ||
|
|
||
| For each task log, check: | ||
|
|
||
| - **Exit / cancellation**: `DUE TO TIME LIMIT`, `FAILED`, signal (e.g., `signal 15`) | ||
| - **Python exceptions / tracebacks**: last exception is usually the root cause | ||
| - **CUDA errors**: OOM, NCCL timeout | ||
| - **Slurm state**: COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY | ||
| - **Success indicators**: "Saved N samples", "Successfully processed N conversations", training loss line, AR output | ||
|
|
||
| ## Step 3 — Produce report | ||
|
|
||
| Output a structured markdown report: | ||
|
|
||
| ### Summary | ||
|
|
||
| - Overall status: PASSED / FAILED / MIXED / PARTIAL | ||
| - Task breakdown: e.g., task_0 TIMEOUT, task_1 FAIL, task_2 skipped, task_3 skipped | ||
|
|
||
| ### Task Results | ||
|
|
||
| For each task (0–3): | ||
|
|
||
| **Task N — \<name\>: PASS / FAIL / TIMEOUT** | ||
| - Key output: (e.g., "3277/3295 samples generated" or "Script not found") | ||
| - Error (if failed): quoted error message, max 10 lines | ||
| - Root cause: one-line diagnosis | ||
| - Suggested fix: actionable step | ||
|
|
||
| ### Warnings | ||
|
|
||
| Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput). | ||
|
|
||
| ## Step 4 — Suggest next steps | ||
|
|
||
| Based on results: | ||
|
|
||
| - If a task failed due to a known issue, suggest the fix and how to re-run from that task: | ||
|
|
||
| ```bash | ||
| uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml \ | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| pipeline.task_0.skip=true \ | ||
| --yes | ||
| ``` | ||
|
|
||
| - If the failure pattern is new (not in `tools/launcher/examples/EAGLE3_TRIAGE.md`), | ||
| suggest adding it to the triage chart using `/eagle3-triage` guidance. | ||
|
|
||
| - If all tasks passed, suggest running `/eagle3-validate` to confirm AR meets threshold. | ||
|
|
||
| ## Known benign patterns (do NOT mark as failures) | ||
|
|
||
| | Pattern | Explanation | | ||
| |---|---| | ||
| | vLLM server exit code 143 | SIGTERM — server was killed after queries completed. Expected. | | ||
| | `CANCELLED AT ... DUE TO TASK FAILURE` after `exit code: 0` | Slurm cleanup of worker nodes after main task succeeded. | | ||
| | `destroy_process_group() was not called` | Benign PyTorch shutdown warning. | | ||
| | `tokenizer class ... not equal to the registered tokenizer class` | Harmless tokenizer mismatch warning. | | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing
user_invocable: truein skill frontmatterThis skill is described as user-invocable in the PR objective, but the metadata omits the flag. Without it, the skill may not be callable directly.
Suggested fix
--- name: eagle3-new-model description: > Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml launcher config for a new model checkpoint, choosing the right hidden state dump backend (TRT-LLM / HF / vLLM) and GPU configuration. Use when user wants to run EAGLE3 on a model that does not yet have a YAML in tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint. +user_invocable: true ---📝 Committable suggestion
🤖 Prompt for AI Agents