Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 215 additions & 0 deletions .claude/skills/eagle3-new-model/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
---
name: eagle3-new-model
description: >
Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml
launcher config for a new model checkpoint, choosing the right hidden state dump
backend (TRT-LLM / HF / vLLM) and GPU configuration.
Use when user wants to run EAGLE3 on a model that does not yet have a YAML in
tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint.
---
Comment on lines +1 to +9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing user_invocable: true in skill frontmatter

This skill is described as user-invocable in the PR objective, but the metadata omits the flag. Without it, the skill may not be callable directly.

Suggested fix
 ---
 name: eagle3-new-model
 description: >
   Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml
   launcher config for a new model checkpoint, choosing the right hidden state dump
   backend (TRT-LLM / HF / vLLM) and GPU configuration.
   Use when user wants to run EAGLE3 on a model that does not yet have a YAML in
   tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint.
+user_invocable: true
 ---
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
---
name: eagle3-new-model
description: >
Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml
launcher config for a new model checkpoint, choosing the right hidden state dump
backend (TRT-LLM / HF / vLLM) and GPU configuration.
Use when user wants to run EAGLE3 on a model that does not yet have a YAML in
tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint.
---
---
name: eagle3-new-model
description: >
Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml
launcher config for a new model checkpoint, choosing the right hidden state dump
backend (TRT-LLM / HF / vLLM) and GPU configuration.
Use when user wants to run EAGLE3 on a model that does not yet have a YAML in
tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint.
user_invocable: true
---
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/eagle3-new-model/SKILL.md around lines 1 - 9, Add the missing
frontmatter key user_invocable: true to the skill metadata so the skill becomes
callable; edit the SKILL.md frontmatter for the eagle3-new-model skill and
insert user_invocable: true (boolean) alongside name/description so the YAML now
includes user_invocable: true.


# EAGLE3 New Model Configuration

This skill guides you through creating `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`
for a new model.

## Step 1 — Look up the model architecture

Determine these values from the HuggingFace model card, `config.json`, and vLLM docs:

| Property | Where to find it |
|---|---|
| Total / active parameters | Model card |
| Dense or MoE? | `config.json` → `num_experts`, `num_experts_per_tok` |
| Attention type (MHA / GQA / MLA / SWA) | Model card |
| Multimodal? (vision encoder) | Model card |
| BF16 weight size (GB) | `total_params × 2 bytes` |
| Special serving flags | vLLM docs, model README (`--trust-remote-code`, parsers) |

## Step 2 — Calculate GPU requirements (OCI-HSG / GB200)

OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node**

```text
BF16 weight size = total_params × 2 bytes
GPUs needed = ceil(weight_size_GB / 192)
nodes = ceil(gpus_needed / 4)
tp = min(gpus_needed, 4)
```

| Model | Weights (BF16) | GPUs | nodes | tp |
|---|---|---|---|---|
| 8B dense | ~16 GB | 1 | 1 | 4 |
| 70B dense | ~140 GB | 1 | 1 | 4 |
| 685B MoE | ~340 GB | 2 | 1 | 4 |
| 1T MoE | ~595 GB | 4 | 1 | 4 |

## Step 3 — Choose the hidden state dump backend

| Backend | Script | When to use |
|---------|--------|-------------|
| vLLM | `common/eagle3/dump_offline_data_vllm.sh` | Default; broad coverage via vLLM + speculators |
| HF | `common/eagle3/dump_offline_data_hf.sh` | VLMs, custom-code models, SWA attention |
| TRT-LLM | `common/eagle3/dump_offline_data.sh` | Pure-text models with TRT-LLM support (needs `--tp`/`--moe-ep`) |

Use **HF** when the model is a VLM or uses sliding window attention (TRT-LLM does not support these).
Use **vLLM** for everything else as the default.

## Step 4 — Write the YAML

Create `tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml`.
Use an existing config as a reference (e.g., `tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml`).

### Header comment

```yaml
# EAGLE3 offline speculative decoding pipeline for <org>/<model>.
#
# <Model> is a <size> <dense|MoE> model. <brief notes: attention type, special reqs>
# BF16 weights ~<size> GB — fits on <N> GB200 node(s) (<N> × 192 GB).
#
# <Special requirements (if any)>
#
# 4-step pipeline:
# task_0: Data synthesis — query vLLM server to generate prompt samples
# task_1: Dump hidden states — run target model to capture hidden states
# task_2: Offline training — train the EAGLE3 draft head
# task_3: Benchmark — evaluate speculative decoding speedup via VLLM
#
# Usage:
# uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes
# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/<Org>/<Model>/hf_offline_eagle3.yaml --yes

job_name: <Model>_EAGLE3_offline
pipeline:
allow_to_fail: false
skip: false
note:

global_vars:
hf_model: /hf-local/<org>/<model>
```

### task_0 — Data synthesis (`common/vllm/query.sh`)

Args before `--` go to the vLLM server; args after `--` go to `query.py`.

```yaml
task_0:
script: common/vllm/query.sh
args:
- --model <<global_vars.hf_model>>
- --tensor-parallel-size <TP>
- --trust-remote-code # add only if required
- -- # separator
- --data /hf-local/modelopt/Speculative-Decoding-Dataset-v2-default
- --save /scratchspace/data
environment:
- HF_LOCAL: /hf-local
slurm_config:
_factory_: "slurm_factory"
nodes: <nodes>
ntasks_per_node: 1
gpus_per_node: 4
container: vllm/vllm-openai:latest
```

### task_1 — Hidden states (vLLM backend, default)

```yaml
task_1:
script: common/eagle3/dump_offline_data_vllm.sh
args:
- --input-data /scratchspace/data
- --output-dir /scratchspace/offline_hidden_states
- --max-seq-len 8192
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
slurm_config:
_factory_: "slurm_factory"
nodes: <nodes>
ntasks_per_node: 1
gpus_per_node: 4
container: vllm/vllm-openai:latest
```

For **HF backend** (VLMs, SWA models), use `dump_offline_data_hf.sh` instead — same args, no TP flags needed.

For **TRT-LLM backend**, use `dump_offline_data.sh` and add `--tp <TP>` and `--moe-ep 1` (or appropriate EP).

### task_2 — Offline training (`common/eagle3/train_eagle.sh`)

```yaml
task_2:
script: common/eagle3/train_eagle.sh
args:
- --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml
- model.model_name_or_path=<<global_vars.hf_model>>
- data.offline_data_path=/scratchspace/offline_hidden_states
- training.output_dir=/scratchspace/eagle3
- training.training_seq_len=4096
- training.disable_tqdm=true
- training.ar_validate_steps=500000
slurm_config:
_factory_: "slurm_factory"
nodes: 1
ntasks_per_node: 1
gpus_per_node: 4
container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
```

> **MoE note:** For MoE models with large per-expert hidden dims, consider increasing
> `intermediate_size` in `eagle_config.json` to match the model's `moe_intermediate_size`.

### task_3 — Benchmark (`common/specdec_bench/quick_check.sh`)

```yaml
task_3:
script: common/specdec_bench/quick_check.sh
args:
- --draft_model_dir /scratchspace/export
- --draft_length 3
- --output_length 4096
- --engine VLLM
- --tp_size <TP>
- --ep_size 1
- --speculative_algorithm EAGLE3
- --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl
- --concurrency 1
environment:
- HF_LOCAL: /hf-local
- HF_MODEL_CKPT: <<global_vars.hf_model>>
slurm_config:
_factory_: "slurm_factory"
nodes: <nodes>
ntasks_per_node: 1
gpus_per_node: 4
container: vllm/vllm-openai:latest
```

## Step 5 — Common model-specific adjustments

| Situation | What to change |
|---|---|
| Requires `--trust-remote-code` | Add to task_0 vLLM args (before `--`) |
| VLM / multimodal | Use `dump_offline_data_hf.sh` for task_1 |
| Sliding window attention | Use `dump_offline_data_hf.sh` or `_vllm.sh` for task_1 |
| MoE with large expert hidden dim | Increase `intermediate_size` in eagle_config.json |
| Non-standard attention (MLA) | Verify `eagle_decoder_type` in the eagle3 recipe YAML |
| Custom tokenizer (e.g., tiktoken) | Set `TIKTOKEN_RS_CACHE_DIR` env var in task_0 and task_1 |
| NVFP4 quant model | task_0/task_3 use quant container; task_1/task_2 use BF16 base model — add `hf_model_bf16` global_var |
| Model needs `trust_remote_code` at benchmark | Add `--trust-remote-code` to task_3 args |

## Step 6 — Test with dry run

Preview the resolved config before submitting:

```bash
uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml --dryrun --yes -v
```

## Step 7 — Update triage chart

After adding a new model, add a row to the test matrix in
`tools/launcher/examples/EAGLE3_TRIAGE.md` with status 🔲 (not yet tested).
Fill in results after running.
99 changes: 99 additions & 0 deletions .claude/skills/eagle3-review-logs/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
name: eagle3-review-logs
description: >
Review EAGLE3 pipeline experiment logs from the launcher's experiments/ directory.
Summarizes pass/fail status for all 4 tasks, diagnoses failures with root causes
and fixes, and flags warnings. Use when the user asks to review job logs,
check experiment results, or diagnose why a specific task failed.
user_invocable: true
---

# Review EAGLE3 Experiment Logs

Analyze output logs from an EAGLE3 pipeline run launched via `launch.py` or `slurm.py`.

## Step 0 — Find experiment logs

Locate the experiment directory. The default is `experiments/` relative to the launcher root,
or wherever `--job-dir` was pointed.

```bash
ls -td experiments/cicd/cicd_* | head -10
```

Each experiment has one subdirectory per task (0–3). Logs are `sbatch_*.out` files inside:

```bash
find experiments/<exp_id>/ -name "sbatch_*.out" | sort
```

Do this in a single Bash call. If no experiments exist, ask the user for the directory.

## Step 1 — Read all task logs

Read the last 200 lines of each log in parallel. Errors appear at the end:

```bash
for f in $(find experiments/<exp_id>/ -name "sbatch_*.out" | sort); do
echo "=== $f ==="; tail -200 "$f"; echo
done
```
Comment on lines +34 to +40
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

“Read in parallel” does not match the provided command

The snippet is sequential, so the instruction is currently inaccurate. Either remove “in parallel” or update the command to actually parallelize log tails.

Suggested doc fix (true parallel read)
-Read the last 200 lines of each log in parallel. Errors appear at the end:
+Read the last 200 lines of each log. Errors appear at the end:

 ```bash
-for f in $(find experiments/<exp_id>/ -name "sbatch_*.out" | sort); do
-  echo "=== $f ==="; tail -200 "$f"; echo
-done
+find experiments/<exp_id>/ -name "sbatch_*.out" | sort | \
+  xargs -I{} -P 8 sh -c 'echo "=== {} ==="; tail -200 "{}"; echo'
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/eagle3-review-logs/SKILL.md around lines 34 - 40, The text
says “Read the last 200 lines of each log in parallel” but the shown for-loop is
sequential; either remove the phrase “in parallel” or replace the sequential
for-loop block with a true parallel command. Concretely, update the snippet that
currently uses the for f in $(find ...); do ... tail -200 ... done to use the
parallel xargs pipeline (find ... | sort | xargs -I{} -P 8 sh -c 'echo "=== {}
==="; tail -200 "{}"; echo') so it actually runs tails in parallel, or else
change the prose to say “sequentially” and keep the existing for-loop.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- d98c2f50 -->

<!-- This is an auto-generated comment by CodeRabbit -->


## Step 2 — Analyze

For each task log, check:

- **Exit / cancellation**: `DUE TO TIME LIMIT`, `FAILED`, signal (e.g., `signal 15`)
- **Python exceptions / tracebacks**: last exception is usually the root cause
- **CUDA errors**: OOM, NCCL timeout
- **Slurm state**: COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY
- **Success indicators**: "Saved N samples", "Successfully processed N conversations", training loss line, AR output

## Step 3 — Produce report

Output a structured markdown report:

### Summary

- Overall status: PASSED / FAILED / MIXED / PARTIAL
- Task breakdown: e.g., task_0 TIMEOUT, task_1 FAIL, task_2 skipped, task_3 skipped

### Task Results

For each task (0–3):

**Task N — \<name\>: PASS / FAIL / TIMEOUT**
- Key output: (e.g., "3277/3295 samples generated" or "Script not found")
- Error (if failed): quoted error message, max 10 lines
- Root cause: one-line diagnosis
- Suggested fix: actionable step

### Warnings

Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput).

## Step 4 — Suggest next steps

Based on results:

- If a task failed due to a known issue, suggest the fix and how to re-run from that task:

```bash
uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml \
Comment thread
coderabbitai[bot] marked this conversation as resolved.
pipeline.task_0.skip=true \
--yes
```

- If the failure pattern is new (not in `tools/launcher/examples/EAGLE3_TRIAGE.md`),
suggest adding it to the triage chart using `/eagle3-triage` guidance.

- If all tasks passed, suggest running `/eagle3-validate` to confirm AR meets threshold.

## Known benign patterns (do NOT mark as failures)

| Pattern | Explanation |
|---|---|
| vLLM server exit code 143 | SIGTERM — server was killed after queries completed. Expected. |
| `CANCELLED AT ... DUE TO TASK FAILURE` after `exit code: 0` | Slurm cleanup of worker nodes after main task succeeded. |
| `destroy_process_group() was not called` | Benign PyTorch shutdown warning. |
| `tokenizer class ... not equal to the registered tokenizer class` | Harmless tokenizer mismatch warning. |
Loading
Loading