From 6c580d8e1d1556f88f422809a85fafda0c742144 Mon Sep 17 00:00:00 2001 From: Ye Yu Date: Mon, 11 May 2026 10:55:40 -0700 Subject: [PATCH 1/3] feat(okr30): add EAGLE3 Claude Code skills for triage, validation, and new-model support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four user-invocable skills for the EAGLE3 offline pipeline: - eagle3-triage: diagnose failed pipeline runs step-by-step; failure tables for all 4 tasks (vLLM data synthesis, hidden state dump with 3 backends, training, benchmark); new-model-specific issue checklist - eagle3-validate: verify completed runs; artifact checks; AR threshold (>= 2.1); structured validation report with next-step guidance - eagle3-new-model: guided workflow for adding a new model; architecture lookup, GPU/TP calculation for GB200, backend selection, full YAML template with correct public-launcher script paths - eagle3-review-logs: lightweight log reader; finds sbatch .out files, reads all task logs, produces pass/fail summary with root causes Skills use public launcher paths (common/eagle3/, common/vllm/, etc.) and read sbatch .out files directly — no sandbox-specific tooling required. Co-Authored-By: Claude Sonnet 4.6 Signed-off-by: Ye Yu --- .claude/skills/eagle3-new-model/SKILL.md | 215 +++++++++++++++++++++ .claude/skills/eagle3-review-logs/SKILL.md | 96 +++++++++ .claude/skills/eagle3-triage/SKILL.md | 177 +++++++++++++++++ .claude/skills/eagle3-validate/SKILL.md | 121 ++++++++++++ 4 files changed, 609 insertions(+) create mode 100644 .claude/skills/eagle3-new-model/SKILL.md create mode 100644 .claude/skills/eagle3-review-logs/SKILL.md create mode 100644 .claude/skills/eagle3-triage/SKILL.md create mode 100644 .claude/skills/eagle3-validate/SKILL.md diff --git a/.claude/skills/eagle3-new-model/SKILL.md b/.claude/skills/eagle3-new-model/SKILL.md new file mode 100644 index 00000000000..df9cc90a9a9 --- /dev/null +++ b/.claude/skills/eagle3-new-model/SKILL.md @@ -0,0 +1,215 @@ +--- +name: eagle3-new-model +description: > + Add a new model to the EAGLE3 offline pipeline. Generates an hf_offline_eagle3.yaml + launcher config for a new model checkpoint, choosing the right hidden state dump + backend (TRT-LLM / HF / vLLM) and GPU configuration. + Use when user wants to run EAGLE3 on a model that does not yet have a YAML in + tools/launcher/examples/ or asks how to configure the pipeline for a new checkpoint. +--- + +# EAGLE3 New Model Configuration + +This skill guides you through creating `tools/launcher/examples///hf_offline_eagle3.yaml` +for a new model. + +## Step 1 — Look up the model architecture + +Determine these values from the HuggingFace model card, `config.json`, and vLLM docs: + +| Property | Where to find it | +|---|---| +| Total / active parameters | Model card | +| Dense or MoE? | `config.json` → `num_experts`, `num_experts_per_tok` | +| Attention type (MHA / GQA / MLA / SWA) | Model card | +| Multimodal? (vision encoder) | Model card | +| BF16 weight size (GB) | `total_params × 2 bytes` | +| Special serving flags | vLLM docs, model README (`--trust-remote-code`, parsers) | + +## Step 2 — Calculate GPU requirements (OCI-HSG / GB200) + +OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node** + +``` +BF16 weight size = total_params × 2 bytes +GPUs needed = ceil(weight_size_GB / 192) +nodes = ceil(gpus_needed / 4) +tp = min(gpus_needed, 4) +``` + +| Model | Weights (BF16) | GPUs | nodes | tp | +|---|---|---|---|---| +| 8B dense | ~16 GB | 1 | 1 | 4 | +| 70B dense | ~140 GB | 1 | 1 | 4 | +| 685B MoE | ~340 GB | 2 | 1 | 4 | +| 1T MoE | ~595 GB | 4 | 1 | 4 | + +## Step 3 — Choose the hidden state dump backend + +| Backend | Script | When to use | +|---------|--------|-------------| +| vLLM | `common/eagle3/dump_offline_data_vllm.sh` | Default; broad coverage via vLLM + speculators | +| HF | `common/eagle3/dump_offline_data_hf.sh` | VLMs, custom-code models, SWA attention | +| TRT-LLM | `common/eagle3/dump_offline_data.sh` | Pure-text models with TRT-LLM support (needs `--tp`/`--moe-ep`) | + +Use **HF** when the model is a VLM or uses sliding window attention (TRT-LLM does not support these). +Use **vLLM** for everything else as the default. + +## Step 4 — Write the YAML + +Create `tools/launcher/examples///hf_offline_eagle3.yaml`. +Use an existing config as a reference (e.g., `tools/launcher/examples/Qwen/Qwen3.5-35B-A3B/hf_offline_eagle3.yaml`). + +### Header comment + +```yaml +# EAGLE3 offline speculative decoding pipeline for /. +# +# is a model. +# BF16 weights ~ GB — fits on GB200 node(s) ( × 192 GB). +# +# +# +# 4-step pipeline: +# task_0: Data synthesis — query vLLM server to generate prompt samples +# task_1: Dump hidden states — run target model to capture hidden states +# task_2: Offline training — train the EAGLE3 draft head +# task_3: Benchmark — evaluate speculative decoding speedup via VLLM +# +# Usage: +# uv run launch.py --yaml examples///hf_offline_eagle3.yaml --yes +# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples///hf_offline_eagle3.yaml --yes + +job_name: _EAGLE3_offline +pipeline: + allow_to_fail: false + skip: false + note: + + global_vars: + hf_model: /hf-local// +``` + +### task_0 — Data synthesis (`common/vllm/query.sh`) + +Args before `--` go to the vLLM server; args after `--` go to `query.py`. + +```yaml + task_0: + script: common/vllm/query.sh + args: + - --model <> + - --tensor-parallel-size + - --trust-remote-code # add only if required + - -- # separator + - --data /hf-local/modelopt/Speculative-Decoding-Dataset-v2-default + - --save /scratchspace/data + environment: + - HF_LOCAL: /hf-local + slurm_config: + _factory_: "slurm_factory" + nodes: + ntasks_per_node: 1 + gpus_per_node: 4 + container: vllm/vllm-openai:latest +``` + +### task_1 — Hidden states (vLLM backend, default) + +```yaml + task_1: + script: common/eagle3/dump_offline_data_vllm.sh + args: + - --input-data /scratchspace/data + - --output-dir /scratchspace/offline_hidden_states + - --max-seq-len 8192 + environment: + - HF_MODEL_CKPT: <> + slurm_config: + _factory_: "slurm_factory" + nodes: + ntasks_per_node: 1 + gpus_per_node: 4 + container: vllm/vllm-openai:latest +``` + +For **HF backend** (VLMs, SWA models), use `dump_offline_data_hf.sh` instead — same args, no TP flags needed. + +For **TRT-LLM backend**, use `dump_offline_data.sh` and add `--tp ` and `--moe-ep 1` (or appropriate EP). + +### task_2 — Offline training (`common/eagle3/train_eagle.sh`) + +```yaml + task_2: + script: common/eagle3/train_eagle.sh + args: + - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml + - model.model_name_or_path=<> + - data.offline_data_path=/scratchspace/offline_hidden_states + - training.output_dir=/scratchspace/eagle3 + - training.training_seq_len=4096 + - training.disable_tqdm=true + - training.ar_validate_steps=500000 + slurm_config: + _factory_: "slurm_factory" + nodes: 1 + ntasks_per_node: 1 + gpus_per_node: 4 + container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 +``` + +> **MoE note:** For MoE models with large per-expert hidden dims, consider increasing +> `intermediate_size` in `eagle_config.json` to match the model's `moe_intermediate_size`. + +### task_3 — Benchmark (`common/specdec_bench/quick_check.sh`) + +```yaml + task_3: + script: common/specdec_bench/quick_check.sh + args: + - --draft_model_dir /scratchspace/export + - --draft_length 3 + - --output_length 4096 + - --engine VLLM + - --tp_size + - --ep_size 1 + - --speculative_algorithm EAGLE3 + - --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl + - --concurrency 1 + environment: + - HF_LOCAL: /hf-local + - HF_MODEL_CKPT: <> + slurm_config: + _factory_: "slurm_factory" + nodes: + ntasks_per_node: 1 + gpus_per_node: 4 + container: vllm/vllm-openai:latest +``` + +## Step 5 — Common model-specific adjustments + +| Situation | What to change | +|---|---| +| Requires `--trust-remote-code` | Add to task_0 vLLM args (before `--`) | +| VLM / multimodal | Use `dump_offline_data_hf.sh` for task_1 | +| Sliding window attention | Use `dump_offline_data_hf.sh` or `_vllm.sh` for task_1 | +| MoE with large expert hidden dim | Increase `intermediate_size` in eagle_config.json | +| Non-standard attention (MLA) | Verify `eagle_decoder_type` in the eagle3 recipe YAML | +| Custom tokenizer (e.g., tiktoken) | Set `TIKTOKEN_RS_CACHE_DIR` env var in task_0 and task_1 | +| NVFP4 quant model | task_0/task_3 use quant container; task_1/task_2 use BF16 base model — add `hf_model_bf16` global_var | +| Model needs `trust_remote_code` at benchmark | Add `--trust-remote-code` to task_3 args | + +## Step 6 — Test with dry run + +Preview the resolved config before submitting: + +```bash +uv run launch.py --yaml examples///hf_offline_eagle3.yaml --dryrun --yes -v +``` + +## Step 7 — Update triage chart + +After adding a new model, add a row to the test matrix in +`tools/launcher/examples/EAGLE3_TRIAGE.md` with status 🔲 (not yet tested). +Fill in results after running. diff --git a/.claude/skills/eagle3-review-logs/SKILL.md b/.claude/skills/eagle3-review-logs/SKILL.md new file mode 100644 index 00000000000..e9e519c5a5d --- /dev/null +++ b/.claude/skills/eagle3-review-logs/SKILL.md @@ -0,0 +1,96 @@ +--- +name: eagle3-review-logs +description: > + Review EAGLE3 pipeline experiment logs from the launcher's experiments/ directory. + Summarizes pass/fail status for all 4 tasks, diagnoses failures with root causes + and fixes, and flags warnings. Use when the user asks to review job logs, + check experiment results, or diagnose why a specific task failed. +user_invocable: true +--- + +# Review EAGLE3 Experiment Logs + +Analyze output logs from an EAGLE3 pipeline run launched via `launch.py` or `slurm.py`. + +## Step 0 — Find experiment logs + +Locate the experiment directory. The default is `experiments/` relative to the launcher root, +or wherever `--job-dir` was pointed. + +```bash +ls -td experiments/cicd/cicd_* | head -10 +``` + +Each experiment has one subdirectory per task (0–3). Logs are `sbatch_*.out` files inside: + +```bash +find experiments// -name "sbatch_*.out" | sort +``` + +Do this in a single Bash call. If no experiments exist, ask the user for the directory. + +## Step 1 — Read all task logs + +Read the last 200 lines of each log in parallel. Errors appear at the end: + +```bash +for f in $(find experiments// -name "sbatch_*.out" | sort); do + echo "=== $f ==="; tail -200 "$f"; echo +done +``` + +## Step 2 — Analyze + +For each task log, check: + +- **Exit / cancellation**: `DUE TO TIME LIMIT`, `FAILED`, signal (e.g., `signal 15`) +- **Python exceptions / tracebacks**: last exception is usually the root cause +- **CUDA errors**: OOM, NCCL timeout +- **Slurm state**: COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY +- **Success indicators**: "Saved N samples", "Successfully processed N conversations", training loss line, AR output + +## Step 3 — Produce report + +Output a structured markdown report: + +### Summary +- Overall status: PASSED / FAILED / MIXED / PARTIAL +- Task breakdown: e.g., task_0 TIMEOUT, task_1 FAIL, task_2 skipped, task_3 skipped + +### Task Results + +For each task (0–3): + +**Task N — \: PASS / FAIL / TIMEOUT** +- Key output: (e.g., "3277/3295 samples generated" or "Script not found") +- Error (if failed): quoted error message, max 10 lines +- Root cause: one-line diagnosis +- Suggested fix: actionable step + +### Warnings +Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput). + +## Step 4 — Suggest next steps + +Based on results: + +- If a task failed due to a known issue, suggest the fix and how to re-run from that task: + ```bash + uv run launch.py --yaml examples///hf_offline_eagle3.yaml \ + pipeline.task_0.skip=true \ + --yes + ``` + +- If the failure pattern is new (not in `tools/launcher/examples/EAGLE3_TRIAGE.md`), + suggest adding it to the triage chart using `/eagle3-triage` guidance. + +- If all tasks passed, suggest running `/eagle3-validate` to confirm AR meets threshold. + +## Known benign patterns (do NOT mark as failures) + +| Pattern | Explanation | +|---|---| +| vLLM server exit code 143 | SIGTERM — server was killed after queries completed. Expected. | +| `CANCELLED AT ... DUE TO TASK FAILURE` after `exit code: 0` | Slurm cleanup of worker nodes after main task succeeded. | +| `destroy_process_group() was not called` | Benign PyTorch shutdown warning. | +| `tokenizer class ... not equal to the registered tokenizer class` | Harmless tokenizer mismatch warning. | diff --git a/.claude/skills/eagle3-triage/SKILL.md b/.claude/skills/eagle3-triage/SKILL.md new file mode 100644 index 00000000000..7009b4d0523 --- /dev/null +++ b/.claude/skills/eagle3-triage/SKILL.md @@ -0,0 +1,177 @@ +--- +name: eagle3-triage +description: > + Triage a failed EAGLE3 pipeline run. Identifies which step failed (data synthesis, + hidden state dump, training, or benchmark), diagnoses root cause from logs, and + suggests fixes. Use when user reports an EAGLE3 pipeline failure or asks why a + specific step failed. Also helps debug new model support issues. +user_invocable: true +--- + +# EAGLE3 Pipeline Triage + +Diagnose failures in the 4-step EAGLE3 offline pipeline. This skill walks through +each step, identifies the failure point, and provides actionable fixes. + +## Pipeline Overview + +| Step | Script | Purpose | Common failure area | +|------|--------|---------|---------------------| +| task_0 | `common/vllm/query.sh` | Data synthesis via vLLM server | Server startup, model loading, OOM | +| task_1 | `common/eagle3/dump_offline_data_vllm.sh` (or `_hf.sh` / `.sh`) | Dump hidden states | Backend selection, OOM, unsupported arch | +| task_2 | `common/eagle3/train_eagle.sh` | Train EAGLE3 draft head | Dependencies, training crash, export | +| task_3 | `common/specdec_bench/quick_check.sh` | Benchmark acceptance rate | Engine startup, draft model loading | + +## Step 0 — Locate the experiment + +Ask the user for one of: +- Experiment directory (e.g., the `--job-dir` passed to `launch.py` or `slurm.py`) +- The model name / YAML they ran + +Find recent experiments under the job directory: + +```bash +ls -td experiments/cicd/cicd_* | head -10 +# or wherever --job-dir was pointed +``` + +Each experiment directory contains one subdirectory per task (task_0 through task_3), +each with a `sbatch_*.out` log file. + +## Step 1 — Fetch logs for the failed task + +Locate and read the Slurm output file for the failed task: + +```bash +find experiments/ -name "sbatch_*.out" | sort +``` + +Read the last 200 lines — errors appear at the end: + +```bash +tail -200 experiments///sbatch__.out +``` + +Look for the first task with a non-zero exit code or error message. + +## Step 2 — Diagnose by step + +### task_0 failures (Data Synthesis) + +**How it works:** Launches a vLLM OpenAI-compatible server, polls `/health` until ready, +then runs `query.py` to generate synthetic prompt/response pairs. +Output goes to `/scratchspace/data/`. + +| Error pattern | Root cause | Fix | +|---|---|---| +| Server never becomes healthy (hangs at health check) | Model too large for allocated GPUs, or vLLM startup crash | Check BF16 weight size vs GPU memory. GB200: 192 GB/GPU × 4 GPUs/node = 768 GB. Increase TP. | +| `CUDA out of memory` during model load | Insufficient GPU memory | Reduce `--max-model-len` or increase `--tensor-parallel-size` | +| `trust_remote_code` error | Model requires custom code but flag not set | Add `--trust-remote-code` before the `--` separator in task_0 args | +| Vocab / tokenizer error | Missing tokenizer cache (e.g., GPT-OSS-20B needs `TIKTOKEN_RS_CACHE_DIR`) | Set `TIKTOKEN_RS_CACHE_DIR` to a pre-populated cache path in the environment | +| Architecture not supported | vLLM version doesn't support this model | Try a newer vLLM container (`vllm/vllm-openai:latest`) | +| `CANCELLED ... DUE TO TIME LIMIT` | Job wall-clock limit too short | Increase Slurm `--time`. Note: `afterany` deps let task_1 still start. | +| Empty `/scratchspace/data/` | query.py ran but produced no output | Check `--data` path exists and contains prompts. Check query.py logs. | + +### task_1 failures (Hidden State Dump) + +**How it works:** Loads the target model and runs a forward pass on each conversation, +saving hidden states as `.pt` files in `/scratchspace/offline_hidden_states/`. + +Three backends are available: + +| Backend | Script | When to use | +|---------|--------|-------------| +| vLLM | `dump_offline_data_vllm.sh` | Broad model coverage; uses `speculators.VllmHiddenStatesGenerator` | +| HF | `dump_offline_data_hf.sh` | VLMs, custom-code models, SWA attention; uses `device_map="auto"` | +| TRT-LLM | `dump_offline_data.sh` | Pure-text models with TRT-LLM support; needs `--tp`/`--moe-ep` args | + +| Error pattern | Root cause | Fix | +|---|---|---| +| `No such file or directory: dump_offline_data_vllm.sh` | Wrong script path in YAML | Use the correct path under `common/eagle3/` | +| `FileNotFoundError: /scratchspace/data` | task_0 failed or produced no output | Re-run task_0 first, or point `--input-data` to existing data | +| `CUDA out of memory` | Model too large | Switch to `_hf.sh` (device_map="auto") or increase TP | +| `RuntimeError` / unsupported arch | Model not supported by TRT-LLM backend | Switch to `dump_offline_data_hf.sh` or `dump_offline_data_vllm.sh` | +| `NCCL timeout` / `NCCL error` | Multi-node communication failure | Retry. Reduce EP. | +| No `.pt` files in output dir | Script ran but extraction produced nothing | Check `--max-seq-len` and input data format | +| `pyxis: child terminated with signal 15` | SIGTERM — likely OOM | Increase TP or switch backends | + +### task_2 failures (Training) + +**How it works:** Installs requirements, runs `launch_train.sh` (Accelerate + FSDP) with the +config from `modelopt_recipes/general/speculative_decoding/eagle3.yaml`, then exports via +`export_hf_checkpoint.py`. Output: `/scratchspace/eagle3/` and `/scratchspace/export/`. + +| Error pattern | Root cause | Fix | +|---|---|---| +| `pip install` failure | Network issue or incompatible dependency | Check container has network access | +| `ImportError: modelopt` | ModelOpt not installed or path issue | Check container version | +| `FileNotFoundError: /scratchspace/offline_hidden_states` | task_1 failed or produced no output | Re-run task_1 first | +| `CUDA out of memory` during training | Batch size too large | Reduce `training.train_bs` or `training.training_seq_len` | +| `KeyError` / `AttributeError` in model loading | Model architecture not recognized by EAGLE3 | Check `eagle_decoder_type` in config. Model may need code changes in modelopt. | +| `HFValidationError: Repo id must be in the form...` | Old `offline_training.sh` trying to upload to HF Hub | Use `train_eagle.sh` which does local export only | +| Loss is NaN or diverges | LR too high or data quality issue | Reduce `training.lr`. Check hidden state data. | +| `export_hf_checkpoint.py` fails | Training produced incomplete checkpoint | Check `/scratchspace/eagle3/` for `model.safetensors` | + +### task_3 failures (Benchmark) + +**How it works:** Launches vLLM with the target + draft model, runs acceptance rate and +throughput benchmarks. Output: JSON files. + +| Error pattern | Root cause | Fix | +|---|---|---| +| `FileNotFoundError: /scratchspace/export` | task_2 failed or export step failed | Re-run task_2. Check export output. | +| `trust_remote_code` error at benchmark | Model requires it but `quick_check.sh` doesn't forward the flag | Pass `--trust-remote-code` in task_3 args | +| Server fails with draft model | Draft model config incompatible with engine | Check `eagle_config.json` and engine version | +| AR below threshold / exit code 1 | Draft model quality too low | More epochs, data, or hyperparameter tuning | +| `CUDA out of memory` | Target + draft exceeds GPU memory | Increase TP | +| vLLM EAGLE3 not supported | vLLM version too old | Use `vllm/vllm-openai:latest` (≥ v0.15.0 for NVFP4) | + +## Step 3 — Check for new-model-specific issues + +If the user is adding support for a new model, also check: + +1. **Is the model a VLM?** → Use `dump_offline_data_hf.sh` (text-only path, no vision encoder invoked) +2. **Does the model use sliding window attention (SWA)?** → TRT-LLM backend won't work; use HF or vLLM +3. **Does the model need `trust_remote_code`?** → Add to task_0 args AND task_3 args +4. **Is the model MoE?** → Check `eagle_config.json` `intermediate_size` matches model's `moe_intermediate_size` +5. **Is the model architecture recognized by EAGLE3 training?** → Check `modelopt/torch/speculative/` for the model type +6. **Custom tokenizer?** → May need additional environment vars (e.g., `TIKTOKEN_RS_CACHE_DIR`) + +## Step 4 — Suggest fix and next steps + +After diagnosis, provide: + +1. **Root cause** — one-line summary +2. **Fix** — specific config change, code edit, or command to run +3. **How to re-run** — skip earlier successful steps by pointing to existing scratchspace artifacts + +To skip task_0 and task_1 and re-run from task_2: +```bash +uv run launch.py --yaml examples///hf_offline_eagle3.yaml \ + pipeline.task_0.skip=true \ + pipeline.task_1.skip=true \ + --yes +``` + +To run only task_1 standalone (using existing task_0 data): +```bash +uv run launch.py --yaml examples///hf_offline_eagle3.yaml \ + pipeline.task_0.skip=true \ + pipeline.task_2.skip=true \ + pipeline.task_3.skip=true \ + --yes +``` + +If the fix requires code changes in ModelOpt (e.g., adding a new `eagle_decoder_type`), +note that a separate PR in the modelopt repo is needed. + +## Step 5 — Update triage chart + +If you encounter a failure pattern not in the triage chart at +`tools/launcher/examples/EAGLE3_TRIAGE.md`, add it: + +1. Add a new branch in the Mermaid flowchart under the relevant step node +2. Add a new issue entry in the "Known Issues" section +3. Update the model's row in the test matrix + +This keeps the chart current for the next engineer debugging the same issue. diff --git a/.claude/skills/eagle3-validate/SKILL.md b/.claude/skills/eagle3-validate/SKILL.md new file mode 100644 index 00000000000..1b3665cf8b1 --- /dev/null +++ b/.claude/skills/eagle3-validate/SKILL.md @@ -0,0 +1,121 @@ +--- +name: eagle3-validate +description: > + Validate that an EAGLE3 pipeline run completed successfully end-to-end. + Checks all 4 steps produced expected artifacts, verifies acceptance rate + meets threshold (>= 2.1), and produces a summary report. + Use when user wants to verify a pipeline run or check benchmark results. +user_invocable: true +--- + +# EAGLE3 Pipeline Validation + +Verify that an EAGLE3 pipeline run completed successfully and meets quality criteria. + +## Step 0 — Identify the experiment + +Find the most recent experiment directory (or ask the user for the path): + +```bash +ls -td experiments/cicd/cicd_* | head -5 +``` + +Each experiment directory has one subdirectory per task (numbered 0–3), each containing +a `sbatch_*.out` log file. + +## Step 1 — Check task outcomes + +Read the last 50 lines of each task's log file: + +```bash +find experiments// -name "sbatch_*.out" | sort | while read f; do + echo "=== $f ==="; tail -50 "$f"; echo +done +``` + +All 4 tasks must complete without error. Look for: +- `exit code: 0` or no error — success +- `DUE TO TIME LIMIT` — timeout +- `FAILED` / `signal` / exception traceback — failure + +If any task failed, suggest running `/eagle3-triage` instead. + +## Step 2 — Verify artifacts exist + +Check each step produced the expected output (artifacts live on the cluster at `/scratchspace/`). +Confirm via log messages: + +| Step | Expected log evidence | Artifact | +|------|-----------------------|----------| +| task_0 | "Saved N samples" or progress bar completing | `/scratchspace/data/*.jsonl` | +| task_1 | "Successfully processed N conversations" | `/scratchspace/offline_hidden_states/*.pt` | +| task_2 | Training loss decreasing, "export complete" | `/scratchspace/eagle3/model.safetensors`, `/scratchspace/export/` | +| task_3 | `Average Acceptance Length ... ratio: X.XX` | JSON result files | + +## Step 3 — Check acceptance rate + +In the task_3 log, find: + +``` +Average Acceptance Length {'accept': X, 'count': Y, 'ratio': Z.ZZ} +``` + +The `ratio` field is the acceptance rate (AR). + +| Criterion | Threshold | Status | +|-----------|-----------|--------| +| AR (MT-Bench) | >= 2.1 | PASS / FAIL | + +If the log shows `AR ... < lower bound`, the run already triggered a threshold failure (exit code 1). + +## Step 4 — Check training quality + +In the task_2 log look for: +- **Final training loss** — should be decreasing, not NaN +- **AR validation during training** (if `training.ar_validate_steps` was set) +- **Number of training steps** — confirms full training duration + +## Step 5 — Produce validation report + +```markdown +## EAGLE3 Pipeline Validation Report + +**Experiment:** +**Model:** +**Date:** +**Pipeline config:** + +### Step Status +| Step | Task | Status | Notes | +|------|------|--------|-------| +| 0 | Data synthesis | PASS/FAIL/TIMEOUT | N samples generated | +| 1 | Hidden state dump | PASS/FAIL | N .pt files | +| 2 | Training + export | PASS/FAIL | Final loss: X.XX | +| 3 | Benchmark | PASS/FAIL | AR: X.XX | + +### Acceptance Rate +- MT-Bench AR: X.XX (threshold: >= 2.1) — PASS/FAIL + +### Training Summary +- Final loss: X.XX +- Training steps: N +- AR during training: X.XX (if validated) + +### Overall: PASS / FAIL + +``` + +## Step 6 — Suggest next steps + +**If PASS:** +- The model's row in `tools/launcher/examples/EAGLE3_TRIAGE.md` can be updated to ✅ +- Note the checkpoint path for downstream use + +**If FAIL:** +- Identify which step or metric failed +- Suggest running `/eagle3-triage` for diagnosis +- If AR is close but below threshold, suggest: + - More training epochs (`training.num_epochs` override) + - More training data (re-run task_0 with larger dataset) + - Larger draft head (`num_hidden_layers` in `eagle_config.json`) + - Hyperparameter tuning (`training.lr`, `training.train_bs`) From 01724172100ff3f9d1b9474e6c53e5727f26b0de Mon Sep 17 00:00:00 2001 From: Ye Yu Date: Mon, 11 May 2026 11:47:29 -0700 Subject: [PATCH 2/3] fix(okr30): fix markdownlint MD040 in eagle3 skill files Add `text` language specifiers to bare fenced code blocks: - eagle3-new-model/SKILL.md: GPU calculation formula block - eagle3-validate/SKILL.md: acceptance rate log output block Co-Authored-By: Claude Sonnet 4.6 Signed-off-by: Ye Yu --- .claude/skills/eagle3-new-model/SKILL.md | 2 +- .claude/skills/eagle3-validate/SKILL.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.claude/skills/eagle3-new-model/SKILL.md b/.claude/skills/eagle3-new-model/SKILL.md index df9cc90a9a9..c908ae7393e 100644 --- a/.claude/skills/eagle3-new-model/SKILL.md +++ b/.claude/skills/eagle3-new-model/SKILL.md @@ -30,7 +30,7 @@ Determine these values from the HuggingFace model card, `config.json`, and vLLM OCI-HSG nodes: **4 GPUs × 192 GB HBM3e = 768 GB per node** -``` +```text BF16 weight size = total_params × 2 bytes GPUs needed = ceil(weight_size_GB / 192) nodes = ceil(gpus_needed / 4) diff --git a/.claude/skills/eagle3-validate/SKILL.md b/.claude/skills/eagle3-validate/SKILL.md index 1b3665cf8b1..2a37318f3a4 100644 --- a/.claude/skills/eagle3-validate/SKILL.md +++ b/.claude/skills/eagle3-validate/SKILL.md @@ -56,7 +56,7 @@ Confirm via log messages: In the task_3 log, find: -``` +```text Average Acceptance Length {'accept': X, 'count': Y, 'ratio': Z.ZZ} ``` From 7ccae94f137949d83293511f2186a67ed945b48c Mon Sep 17 00:00:00 2001 From: Ye Yu Date: Mon, 11 May 2026 11:55:02 -0700 Subject: [PATCH 3/3] fix(okr30): fix markdownlint MD031 in eagle3 skill files Add blank lines before fenced code blocks as required by MD031: - eagle3-triage/SKILL.md: two re-run command blocks - eagle3-review-logs/SKILL.md: suggested fix block and section headers Co-Authored-By: Claude Sonnet 4.6 Signed-off-by: Ye Yu --- .claude/skills/eagle3-review-logs/SKILL.md | 3 +++ .claude/skills/eagle3-triage/SKILL.md | 2 ++ 2 files changed, 5 insertions(+) diff --git a/.claude/skills/eagle3-review-logs/SKILL.md b/.claude/skills/eagle3-review-logs/SKILL.md index e9e519c5a5d..18027a69096 100644 --- a/.claude/skills/eagle3-review-logs/SKILL.md +++ b/.claude/skills/eagle3-review-logs/SKILL.md @@ -54,6 +54,7 @@ For each task log, check: Output a structured markdown report: ### Summary + - Overall status: PASSED / FAILED / MIXED / PARTIAL - Task breakdown: e.g., task_0 TIMEOUT, task_1 FAIL, task_2 skipped, task_3 skipped @@ -68,6 +69,7 @@ For each task (0–3): - Suggested fix: actionable step ### Warnings + Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput). ## Step 4 — Suggest next steps @@ -75,6 +77,7 @@ Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput). Based on results: - If a task failed due to a known issue, suggest the fix and how to re-run from that task: + ```bash uv run launch.py --yaml examples///hf_offline_eagle3.yaml \ pipeline.task_0.skip=true \ diff --git a/.claude/skills/eagle3-triage/SKILL.md b/.claude/skills/eagle3-triage/SKILL.md index 7009b4d0523..ed2422e6f63 100644 --- a/.claude/skills/eagle3-triage/SKILL.md +++ b/.claude/skills/eagle3-triage/SKILL.md @@ -146,6 +146,7 @@ After diagnosis, provide: 3. **How to re-run** — skip earlier successful steps by pointing to existing scratchspace artifacts To skip task_0 and task_1 and re-run from task_2: + ```bash uv run launch.py --yaml examples///hf_offline_eagle3.yaml \ pipeline.task_0.skip=true \ @@ -154,6 +155,7 @@ uv run launch.py --yaml examples///hf_offline_eagle3.yaml \ ``` To run only task_1 standalone (using existing task_0 data): + ```bash uv run launch.py --yaml examples///hf_offline_eagle3.yaml \ pipeline.task_0.skip=true \