AstraFlow v0.1.1: Megatron backend, offline math, recursive/spawn agents#16
Merged
Conversation
Upgrade the inference/runtime stack to the latest sglang and the dependency versions it requires, validated end-to-end on the FSDP backend (qwen3-1.7b math example, 2x L40). Version pins (pyproject.toml, docs, Docker): - sglang 0.5.5.post1 -> 0.5.12.post1 - torch 2.8.0 -> 2.11.0; torch_memory_saver 0.0.9 -> 0.0.9.post1 - transformers 4.57.1 -> 5.6.1 (sglang pins ==5.6.0, which has a flash-attention s_aux=None crash for non-sink models; 5.6.1 is the upstream patch release. Forced via [tool.uv] override-dependencies, which requires uv >= 0.10 -- documented in installation.md) - peft -> >=0.18.0 (required by transformers 5.x) - CUDA base image 12.9.1 -> 13.0.0 sglang 0.5.12 API compatibility: - remove LoRAAbortReleasePatch (the abort-path lora_registry.release() it added is now fixed upstream; keeping it would double-release) - remove enable_ep_moe from SGLangConfig (field dropped from ServerArgs) - kernel package rename sgl_kernel -> sglang_kernel in the installation validator transformers 5.x / sglang 0.5.12 runtime fixes (surfaced by the run): - rlvr workflow: apply_chat_template now returns a BatchEncoding; pass return_dict=False to get the flat list[int] the rollout path expects - fsdp apply_fsdp2: model._no_split_modules is a set in transformers 5.x; coerce to list before indexing - raas free-port range capped at 55535 so sglang's derived gRPC port (port + 10000) stays <= 65535 Scope: FSDP backend only. Megatron / VL paths are intentionally not covered here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chore: bump sglang 0.5.5.post1 -> 0.5.12.post1 (FSDP path)
sglang 0.5.12's /health round-trips through the scheduler, which stays saturated for ~30-40s during the initial unchunked prefill of ~2048 requests/engine. The old 3-strike / 30s watchdog (5s probe timeout) hard-exited a busy-but-alive engine before the first rollout batch completed, hanging the rollout pipeline at step 0. Raise the /health probe timeout 5s -> 20s so a slow-but-alive endpoint isn't marked failed, and the failure budget 3 -> 5 strikes. A crashed engine refuses connections instantly, so real-death detection stays ~50s (worst case ~100s) while the prefill ramp is tolerated. Verified: math and code qwen3-8b-m2po-delta recipes train through the ramp with zero watchdog strikes.
…ution Two from-scratch install blockers with the sglang 0.5.12 / torch 2.11 stack: - sglang 0.5.12 depends on flash-attn-4>=4.0.0b9 (a pre-release pulled in as a dependency), so resolution fails unless pre-releases are allowed. Add prerelease = "allow" to [tool.uv] so `uv pip install -e ".[sglang]"` resolves on both the conda and Docker paths. - flash-attn 2.8.3 builds from source; nvcc writes GBs of intermediates to $TMPDIR. When $TMPDIR is a small/NFS-quota'd home the build fails with "nvFatbin error: empty input" / "Disk quota exceeded" from truncated temps. Document setting CUDA_HOME and a roomy TMPDIR, switch the sglang step to the project-extra form, and clarify flash-attn (FA2, trainer) vs flash-attn-4 (pulled in by sglang).
sglang requires an unbounded "kernels", so uv resolved the latest (0.15), but transformers 5.6.1 only supports kernels<0.13 — its hub_kernels module constructs LayerRepository() without a revision/version, which kernels 0.15 rejects, so `import sglang` crashes with "Either a revision or a version must be specified." Pin to the range transformers 5.6.1 expects (0.12.x). Verified on a from-scratch env: kernels resolves to 0.12.3 and the math recipe trains.
Add export_hf_named_params: a streaming generator that reconstructs the global model from Megatron's TP/PP/EP/ETP/VPP layout and yields HF-named, HF-layout CPU tensors one at a time (OOM-safe for large / MoE models). The gather + mcore->HF conversion is delegated to mbridge's export_weights (the same bridge the engine already uses to load/save); this module adds the consumer concerns: CPU move, byte-bounded bucketing, and a metadata-only path for transfer-buffer sizing. This is the foundation for correct sparse weight sync under full Megatron parallelism. The design (the "delta is computed in HF byte space" invariant) is documented in docs/en/architecture/megatron-weight-sync.md. The Megatron backend needs two extra compiled deps beyond the base install (megatron-core / mbridge are already there): Transformer Engine (fused LayerNorm + sequence parallelism) and apex (optional fused LayerNorm/Adam). These are kept out of the default image: a separate docker/Dockerfile.sglang.megatron layers them on top of Dockerfile.sglang, and installation.md gains an optional "Step 5: Install the Megatron training backend" under Option A. The FSDP backend and inference are unaffected. Validated (exact bf16 match vs the HF reference checkpoint): - Qwen3-0.6B TP=2 310 tensors, 0 mismatch - Qwen3-0.6B PP=2 311 tensors, 0 mismatch - Qwen3-0.6B TP=2 PP=2 311 tensors, 0 mismatch - Qwen3-30B-A3B TP=2 EP=2 PP=2 18867 tensors, 0 mismatch
Replace the TP-only shard-direct weight transfer with the HF-export path: - MegatronEngine.export_hf_named_params() / get_hf_weight_metadata() stream gathered HF tensors via mbridge (handles TP/PP/EP/ETP/VPP). The previous PP>1 / EP>1 NotImplementedError guards are removed. - WeightManager gains "megatron_hf_meta" mode: the transfer buffer is sized for the full HF model and offload() streams HF tensors into the inactive half on the writer rank, while the gather collectives run on all ranks in lockstep. The sender receives megatron_metadata=None and runs the plain full/delta path used by FSDP. Because the buffer now holds HF-layout bytes, the sparse delta is computed in HF space and is correct under any parallelism — fixing the latent corruption where the delta was computed in mcore layout but applied by the receiver in HF layout. - ppo_trainer wires the generator + HF metadata through. The legacy CPU shard-reassembly in the sender agent is now unused for Megatron (kept only for the deprecated megatron_metadata path). Validated (buffer roundtrip == HF reference, bit-exact): - Qwen3-0.6B TP=2 310 tensors, 0 mismatch, 1.19 GB - Qwen3-0.6B TP=2 PP=2 311 tensors, 0 mismatch, 1.50 GB
Add examples/math/qwen3-8b-megatron-delta — the FSDP qwen3-8b-m2po-delta recipe with the trainer engine switched to the Megatron backend (backend: megatron, tensor_parallel_size: 4). Identical data, algorithm, and weight-transfer path, so it doubles as a clean FSDP-vs-Megatron A/B. End-to-end validation (single 8-GPU node, 4 RaaS + 4 trainer TP=4, delta TCP weight sync, DeepScaleR/M2PO): - Qwen3-8B (this recipe): 25 steps, 0 errors; weight_transfer/delta_sparsity ~0.92 (delta computed in HF space); task_reward/avg rose 0.535 (first half) -> 0.585 (last half), recent steps 0.61-0.64. Per-step weight offload 0.59s. - Qwen3-30B-A3B MoE (TP2/PP3/EP2 trainer on 6 GPUs + SGLang TP2 on 2 GPUs): 21 steps, 0 errors; full MoE export (18867 tensors, 61 GB) gathered across TP/PP/EP each step; task_reward/avg ~0.64 -> 0.66 (recent steps 0.70-0.77).
The Megatron HF-export offload materialized each gathered tensor in
pageable host memory via .to("cpu") before copying it into the pinned
shared-memory transfer buffer — a ~1 GB/s bounce that cost ~13s/step for
an 8B model on the RL critical path.
Now the engine yields the gathered tensors on GPU (export_hf_named_params
to_cpu=False) and WeightManager copies each tensor's uint8 view directly
into the pinned buffer slice (non_blocking=True), fenced by a single
cuda.synchronize() before the cross-rank barrier. The pinned buffer is
already cudaHostRegister'd, so this hits the PCIe DMA engine.
Copying through uint8 views on both sides keeps the copy alignment-free
(robust to mixed-dtype models) and byte-identical for contiguous sources.
Measured (Qwen3-8B, TP=4, 16.38 GB):
pageable (old): 12.6s (1.3 GB/s)
direct DMA: 0.56s (29.3 GB/s) ~23x
Byte-equivalence verified (new buffer == old pageable path == HF
reference, bit-exact) across TP=2, PP=2, TP=2/PP=2, and MoE TP=2/EP=2/PP=2
(Qwen3-30B-A3B, 61 GB). Adds tests/test_direct_dma_offload.py (equivalence)
and tests/bench_offload_dma.py (throughput).
The hardcoded sglang inference defaults assume Hopper and crash on
non-Hopper GPUs (verified on L40 / Ada sm_89), while the identical
package stack runs on H100. Two Hopper-only kernel paths were forced
regardless of hardware:
- attention_backend="fa3": FlashAttention-3 is Hopper-only; on Ada/Ampere
it fails CUDA-graph capture ("scheduler_metadata must have shape").
- flashinfer 0.6.x CuTe-DSL RMSNorm: no Ada/Ampere kernel, JITs into an
incompatible nvidia-cutlass-dsl and crashes (GPUModuleOp TypeError).
Make both arch-aware so one image/env runs on both:
- SGLangConfig.attention_backend defaults to None, which omits the
--attention-backend flag and lets sglang auto-select per GPU (fa3 on
Hopper, an Ada/Ampere-compatible backend below sm_90).
- raas/entrypoint.py sets FLASHINFER_USE_CUDA_NORM=1 on non-Hopper GPUs
before sglang/flashinfer import, selecting flashinfer's CUDA-JIT norm.
Detection uses NVML (no CUDA context in the launcher) and respects an
existing env override.
Hopper behavior is unchanged (fa3 + CuTe norm). Recipes and YAMLs are
untouched. Verified end-to-end on L40 with the qwen3-1.7b-m2po-2gpus-delta
example (sglang init, CUDA-graph capture, live generation).
Training co-locates trainer + RaaS + SGLang in one container and drives many concurrent rollouts, which surfaces two docker run requirements that the docs were inconsistent or silent about: - shm-size: docker/README.md still showed 16g, which causes "[Errno 28] No space left on device" when RaaS stages weights under /dev/shm. Bump it to 512g to match the install guide, with a note. - nofile: the container default soft limit (1024) is too low for the reward worker process pool and fails with "[Errno 24] Too many open files". Add --ulimit nofile=65536:65536 to the run commands in both docker/README.md and docs/en/get-started/installation.md, with a note.
Feat/megatron weight sync dev
Introduces ``spawn_rlvr``, a single-shot tool-call workflow where the main
agent may emit one ``<spawn>{"tasks": [...]}</spawn>`` block per trajectory
to dispatch up to 4 sub-agents in parallel against the same RaaS pool.
Sub-agent outputs are spliced back into main's context inside a
``<spawn_result>`` block and the main agent continues to a final answer.
Training scheme: one trajectory per episode contains 1 main + N sub-agent
sequences, all sharing the team reward (math_verify on main's final
answer). No model_ids tagging — single-trainer regime. Mirrors
solve_and_check.py's multi-sequence-shared-reward precedent.
Implementation notes:
- Phase-1 generates freely (no SGLang string-stop; SGLang runs with
--skip-tokenizer-init and crashes on string-based stop matching).
The workflow detects <spawn>...</spawn> post-hoc via regex and
truncates phase-1 tokens at the close-tag boundary.
- Over-spec'd payloads (>4 tasks) are silently capped to 4; malformed
JSON degrades the trajectory to vanilla single-turn RLVR.
- Rollout dumps under {fileroot}/rollout_dumps/{version}/{qid}.txt
contain decoded phase-1, per-sub-agent task+output, and phase-2 text
for sanity checking. dump_prob configurable per workflow.
Recipe: examples/math/spawn/qwen3-8b-spawn/ — Qwen3-8B, 8 GPUs, M2PO,
ctx 16k, offline math datasets. Main max_new_tokens=3000, sub-agent
max_new_tokens=1500 (max aggregated injection 4*1500=6000) so phase-1
+ aggregated sub-results + phase-2 fits in the 16k SGLang window.
Smoke-tested on 8x H100 (3 train steps + eval-at-start on the full
4768-item eval suite). Untrained Qwen3-8B emits valid spawn payloads
~40% of the time without SFT. Over 100 train steps, overall eval rose
from 40.6% (v0) to 46.3% (v75), surpassing the vanilla rlvr baseline
(44.3%) by +2.0% — confirming the team-reward gradient on shared
trajectories is productive.
Port of platoon's TextCraft RL setup. Adds a new workflow class
`recursive_agent` that lets a root agent spawn 1-4 sub-agents in
parallel via asyncio.gather, each inheriting the parent's inventory
by reference. Trees are bounded by max_depth=3, max_breadth=4.
- workflow.py: ParsedAction dispatcher (get_info / view_inventory /
craft / spawn / finish), per-agent BudgetTracker, trajectory dump
format with full message logs for debugging.
- env.py: stateful TextCraftEnv with forkable inventory aliasing
(sub-agents and root share one mutable dict); binary all-or-nothing
evaluate() against task.misc["target_items"].
- recipe_loader.py + recipes/: bundled Minecraft recipe DB (~860
recipes) so no HF download is needed.
- tasks.py + bundled textcraft_{train,val}.jsonl: 1000 train / 100
val tasks synthesized from the recipe DB; deterministic seed.
- dataflow/dataset/textcraft.py: dataset loaders for RL training
and eval splits.
- reward/textcraft_success.py: stub registered for parity; actual
reward comes from env.evaluate().
- examples/textcraft/qwen3-4b-recursive/: full recipe (yaml +
scripts) for Qwen3-4B-Instruct-2507 with M2PO, FSDP, SGLang,
ctx32k, TCP weight transfer.
- docs/recipes/textcraft-recursive.md: design overview.
Also ignores claude-doc/ in .gitignore.
Ports platoon's OOLONG recursive-agent design (arxiv 2605.06639) to AstraFlow as a new workflow (oolong_recursive) with reward fn (oolong_success), HF dataset loader, and Qwen3-4B-Instruct-2507 recipe under examples/oolong/qwen3-4b-recursive/. Sub-agent grading is currently rule-based for oolong-synth and a placeholder (score=0) for oolong-real until an LLM judge is wired.
Two sweep variants of qwen3-4b-recursive: - gen4k: per-turn max_new_tokens 1024 -> 4096 for longer agent thinking - lr5e6: lr 3e-6 -> 5e-6 (earlier exploration) Recipe-only additions, no library code changes.
Two public functions: - judge(system, user, ...) -> str: posts a (system, user) pair to Fireworks and returns the raw assistant content. Retries on 429/5xx with exponential backoff (3 attempts). Falls back to reasoning_content when content is empty (handles gpt-oss-120b's reasoning-model quirk). - extract_json(text) -> dict: parses JSON out of an LLM response, tolerating ```json``` and plain ``` fence wrapping. Default model: accounts/fireworks/models/gpt-oss-120b (2s/call avg, vs 4s for deepseek-v4-pro and 8s for kimi-k2p6 on the same Fireworks account). max_tokens default 2048 to give reasoning models enough headroom. Each caller writes its own system prompt and parses what it expects -- no central rubric registry, no JudgeRewardEnv mixin, no caching, no budget gate. Matches platoon's pattern. Includes: - test_judge.py: 7 unit tests for extract_json + API-key guard, plus one live end-to-end test skipped without FIREWORKS_API_KEY. - judge_example.py: runnable script with 9 calibration cases, prints full input/output for each, supports --user / --system / --model flags for custom cases. See claude-doc/minimal-llm-judge-plan.md for design rationale.
Two user-facing reward systems, selected by reward_mode in workflow_spec:
team_credit (default)
All agents share the root's rule-based reward. No LLM judge calls.
Cheap, simple, every agent gets some signal.
per_agent_judge
Root keeps its rule-based reward; each sub-agent is scored by an
LLM judge (astraEnv.judge) on its own (goal, output). True per-agent
credit assignment at the cost of API calls per sub-agent.
env.py changes:
- evaluate() is now async; sub-agent branch routes to the LLM judge
when use_llm_judge=True, else returns the legacy 0.0 placeholder.
- _grade_subagent_with_llm() catches all exceptions and clamps to
[0, 1] so a flaky judge never crashes a rollout.
- Adds use_llm_judge and judge_model kwargs.
workflow.py changes:
- New reward_mode kwarg with validation (raises ValueError on unknown
modes, including the now-dropped 'root_only').
- use_llm_judge is derived from reward_mode -- never set independently.
- Sequence emission no longer filters out sub-agent trajectories; all
agents are emitted with their own reward from _reward_for_agent(),
which picks root vs own based on reward_mode.
Tests:
- test_env.py: 10 tests (judge enabled/disabled, parse failures,
network failures, clamping, model override, synth path unchanged).
- test_workflow.py: 12 tests (default mode, validation, two-mode tree
matrix). All LLM calls are mocked -- no API key required to run.
Backwards-compatible defaults: yaml without reward_mode gets
team_credit, which corresponds to "broadcast root reward to all
agents". Note this differs from v5's root-only training (sub-agents
were filtered out); root_only is no longer a supported mode.
…raEnv Two new utilities shared by the DeepDive pipeline: - search.py: minimal httpx client for the CMU RAG server with a 256-concurrency semaphore and 3-attempt exponential backoff. Backoff sleeps release the semaphore so degraded server periods don't thundering-herd onto starved slots. - checklist.py: ChecklistGrader, an ai-rubric-style RubricChecklistFast port. A single LLM call generates 3-5 atomic criteria from the goal and scores them in one response, returning a holistic overall_score in [0, 1]. System prompt is verbatim from ai_rubric 0.2.4.
Port of platoon's DeepDive RL recipe to AstraFlow.
- workflow_cls "deepdive_recursive": recursive web-research agent loop
with <action type="search|spawn|finish">{JSON}</action> format.
- env.py: search (CMU RAG), spawn (sub-agent), finish actions. Reward
routing: root task -> binary-success LLM judge against ground truth
(verbatim platoon rubric); sub-agent task -> ChecklistGrader.
- Workflow stamps sample-weighted group_reward_mean over root rewards
and group_reward_std=1.0 on every emitted sequence, matching
platoon's mean-only centering (no std normalization).
- reward_mode selector (team_credit | per_agent_judge) for credit
assignment experiments.
- Recipe: qwen3-4b-recursive with bs=256, lr=3e-6, total 500 steps,
filter_zero_adv re-enabled, delegation_lambda=0.
- dnd_process_response: type-aware rule-based scoring ported faithfully
from platoon. int->int uses 0.75^|gap| partial credit; str->str is
exact-match after strip().lower(); list->list is Jaccard overlap.
\boxed{...} extraction with parse_confidence label.
- Fix env.py routing: sub-agent task ids inherit the parent's dataset
prefix and were incorrectly hitting the rule-based grader. Now any
id containing "/sub_" routes to the LLM judge regardless of prefix.
- Add qwen3-4b-recursive-real recipe targeting the D&D split.
The producer was unconditionally overwriting group_reward_mean and group_reward_std on every emitted sequence, blocking workflows from publishing their own baseline. Now the producer only fills these fields when the workflow has not already stamped them. Motivation: recursive agents emit a variable number of sequences per prompt (root + N sub-agents). Letting the producer compute group stats over all sequences sequence-weights the baseline, pulling it toward samples that happened to spawn more sub-agents. The DeepDive workflow now stamps a sample-weighted mean over root rewards and std=1.0.
Include the question and ground_truth in episode dump files so rollouts can be inspected without cross-referencing the dataset. Bump dump_prob 0.05 -> 0.25 (train + eval) for denser sampling during debugging (~64 dumps/step at bs=256). Add qwen3-4b-recursive-v7 recipe variant (trial_name suffixed -v7).
…rs 5 transformers>=5 makes apply_chat_template(tokenize=True) return a BatchEncoding (a Mapping) instead of a flat list[int]. Workflows that did list(apply_chat_template(...)) then got the dict keys (['input_ids','attention_mask']), which were sent to the inference engine and rejected with HTTP 400, breaking every agent/recursive/multi-agent recipe on a transformers-5 env. Add a shared apply_chat_template_to_ids() helper in hf_utils that defaults tokenize=True, forwards enable_thinking with a TypeError fallback, and extracts input_ids when a Mapping is returned. Route every workflow that builds token ids through it; the recursive textcraft/oolong/deepdive workflows keep an equivalent inline guard. rlvr already passed return_dict=False and is unaffected.
At 2048 the recursive agent tree fans out to ~16k concurrent generate requests, saturating the 4-GPU SGLang so no episode ever completes and the trainer hangs at step 0 waiting for data. 512 keeps the live agent count ~2.4k and lets episodes finish; validated across a full 105-step run and a clean 5-recipe smoke test on sglang 0.5.12.
Oolong is not supported in the latest version. Remove the oolong_recursive workflow (env, tasks, eval_helpers, tests, jsonl data), the oolong_success reward stub, the oolong dataset loaders, and both examples/oolong recipes. Drop the two oolong import lines from the workflow package __init__ and a stale "Oolong" mention in the deepdive recipe comments. The workflow and reward registries load cleanly without oolong; no references remain in source.
Drop examples/deepdive (qwen3-4b-recursive, -v7): DeepDive needs a search
server we don't currently run. The deepdive workflow, reward, and dataset
code stay in place and registered, ready to use again once a search
backend is available.
Also drop the experimental TextCraft variants
qwen3-4b-recursive-{gen4k,lr5e6}, keeping the base qwen3-4b-recursive
(the recipe actually in use).
Update the package version (astraflow/version.py and train_worker/version.py) and all version references in the docs and READMEs from 0.1.0 to 0.1.1: docs Sphinx version, sidebar badge, docs index title, and the astraflowai/astraflow image tag in docker/README and the installation guide. Add a v0.1.1 News entry to README (keeping the v0.1.0 record). Matches the already-published astraflowai/astraflow:v0.1.1 image (CUDA 13 / SGLang 0.5.12); no Docker rebuild required.
Spawn solution
Move examples/textcraft -> examples/textcraft-recursive-agent (the recipe is now examples/textcraft-recursive-agent/qwen3-4b-recursive) and update the path references in the recipe scripts' usage comments and the textcraft-recursive doc. Directory depth is unchanged, so the scripts' repo-root resolution is unaffected. The workflow code (astraflow/core/workflow/impl/textcraft) and the experiment_name/ trial_name identifiers are intentionally left as-is.
Add docs/web/examples/textcraft-recursive-8agent-episode.txt — a real rollout dump from the textcraft recursive-agent recipe where the root agent spawns 7 sub-agents (all succeed, reward 1.0) to craft the dye and material intermediates in parallel. Placed under docs/web so the animation pages can fetch it; the ROOT/SUB depth/parent markers and spawn actions provide the agent tree and timing for visualization.
The prebuilt `transformer-engine[pytorch]` wheels link libcublas.so.12 and fail to load on the CUDA 13 base image (and a CUDA 13 host install) with `ImportError: libcublas.so.12: cannot open shared object file`. The astraflow v0.1.1 stack is torch 2.11+cu130 / CUDA 13, so the wheel path is broken. Build TE from source (release_v2.13) against the CUDA 13 toolkit instead, with `nvidia-mathdx==25.6.0` supplying the build-time cuBLASDx / cuDNN frontend headers (mirrors slime's CUDA-13 recipe). Apply the same fix in docker/Dockerfile.sglang.megatron and the optional Megatron step in the installation guide. Verified: image astraflowai/astraflow:v0.1.1-megatron builds and `import transformer_engine.pytorch` succeeds on CUDA 13 (was the libcublas.so.12 ImportError before); apex and MegatronEngine also import.
The astraflowai/astraflow:v0.1.1.megatron tag (Transformer Engine + apex, built from Dockerfile.sglang.megatron) was published but never referenced in the docs. Map both images to their training backend so users pick the right one: v0.1.1 for the FSDP backend (default), v0.1.1.megatron for the Megatron-LM backend (TP/PP/EP, MoE). - docker/README.md: list both tags in the pull and run sections - installation.md: split Option B by backend; cross-link from Step 5 - qwen3-8b-megatron-delta/README.md: note the recipe needs the .megatron image
Two pre-merge correctness fixes for v0.1.1: - weight_manager: add a buffer-overflow guard in _offload_megatron_hf, mirroring _copy_all_gather. Without it, an export/metadata size disagreement at inactive-buffer index 0 silently spills into the active half the sender is shipping, corrupting weights with no error. The guard raises at the write site instead. - textcraft: default depth_level_weighting to False. The 1/(depth+1) weighting on raw reward gives sub-agents backwards credit; the shipped recipe already disabled it, but the in-code default would silently apply it to any recipe that omits the flag.
Add a runnable README for the qwen3-4b-recursive TextCraft recipe and flesh out the docs recipe page with a spawn animation, validation-accuracy curve, and a reference to the Recursive Agent Optimization paper (arXiv:2605.06639). Correct the settings table to match experiment.yaml (lr, batch size, max_staleness, total_train_steps, depth_level_weighting) and link the example from the examples index. Also remove the math-spawn recipe page from the docs and its toctree entry.
Relocate the qwen3-4b-recursive README up to examples/textcraft-recursive-agent/ so it's the landing page for the recipe family, and fix the asset paths for the new depth.
Add a news entry linking the TextCraft recursive-agent recipe docs, and fix the v0.1.1 release date to 2026/06.
fix(megatron): build Transformer Engine from source for CUDA 13
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rolls up v0.1.1 onto
main: a Megatron-LM training backend, offline mathdatasets, new multi-agent workflows, and a toolchain bump.
Highlights
direct-DMA offload path (~23× faster).
(TextCraft, Oolong, DeepDive) with Qwen3-4B recipes; offline math datasets
client, rubric grader.
weight-sync page, CUDA 13 install steps.
~860 of the changed files are bundled TextCraft (Minecraft) recipe JSONs.