AstraFlow v0.1.1: Megatron backend, offline math, recursive/spawn agents by haizhongzheng · Pull Request #16 · Infini-AI-Lab/astraflow

haizhongzheng · 2026-06-05T18:33:05Z

Summary

Rolls up v0.1.1 onto main: a Megatron-LM training backend, offline math
datasets, new multi-agent workflows, and a toolchain bump.

Highlights

Megatron-LM backend with HF-space weight sync (PP/EP/VPP) and a
direct-DMA offload path (~23× faster).
Toolchain: CUDA 13, SGLang 0.5.12, transformers 5.
Workflows: spawn-sub-agents for math RL; recursive-agent workflows
(TextCraft, Oolong, DeepDive) with Qwen3-4B recipes; offline math datasets
- recipes.
astraEnv: LLM-as-judge library, reward_mode selector, RAG search
client, rubric grader.
Docs: TextCraft recursive-agent and offline-math recipe pages, Megatron
weight-sync page, CUDA 13 install steps.

~860 of the changed files are bundled TextCraft (Minecraft) recipe JSONs.

Upgrade the inference/runtime stack to the latest sglang and the dependency versions it requires, validated end-to-end on the FSDP backend (qwen3-1.7b math example, 2x L40). Version pins (pyproject.toml, docs, Docker): - sglang 0.5.5.post1 -> 0.5.12.post1 - torch 2.8.0 -> 2.11.0; torch_memory_saver 0.0.9 -> 0.0.9.post1 - transformers 4.57.1 -> 5.6.1 (sglang pins ==5.6.0, which has a flash-attention s_aux=None crash for non-sink models; 5.6.1 is the upstream patch release. Forced via [tool.uv] override-dependencies, which requires uv >= 0.10 -- documented in installation.md) - peft -> >=0.18.0 (required by transformers 5.x) - CUDA base image 12.9.1 -> 13.0.0 sglang 0.5.12 API compatibility: - remove LoRAAbortReleasePatch (the abort-path lora_registry.release() it added is now fixed upstream; keeping it would double-release) - remove enable_ep_moe from SGLangConfig (field dropped from ServerArgs) - kernel package rename sgl_kernel -> sglang_kernel in the installation validator transformers 5.x / sglang 0.5.12 runtime fixes (surfaced by the run): - rlvr workflow: apply_chat_template now returns a BatchEncoding; pass return_dict=False to get the flat list[int] the rollout path expects - fsdp apply_fsdp2: model._no_split_modules is a set in transformers 5.x; coerce to list before indexing - raas free-port range capped at 55535 so sglang's derived gRPC port (port + 10000) stays <= 65535 Scope: FSDP backend only. Megatron / VL paths are intentionally not covered here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump sglang 0.5.5.post1 -> 0.5.12.post1 (FSDP path)

sglang 0.5.12's /health round-trips through the scheduler, which stays saturated for ~30-40s during the initial unchunked prefill of ~2048 requests/engine. The old 3-strike / 30s watchdog (5s probe timeout) hard-exited a busy-but-alive engine before the first rollout batch completed, hanging the rollout pipeline at step 0. Raise the /health probe timeout 5s -> 20s so a slow-but-alive endpoint isn't marked failed, and the failure budget 3 -> 5 strikes. A crashed engine refuses connections instantly, so real-death detection stays ~50s (worst case ~100s) while the prefill ramp is tolerated. Verified: math and code qwen3-8b-m2po-delta recipes train through the ramp with zero watchdog strikes.

…ution Two from-scratch install blockers with the sglang 0.5.12 / torch 2.11 stack: - sglang 0.5.12 depends on flash-attn-4>=4.0.0b9 (a pre-release pulled in as a dependency), so resolution fails unless pre-releases are allowed. Add prerelease = "allow" to [tool.uv] so `uv pip install -e ".[sglang]"` resolves on both the conda and Docker paths. - flash-attn 2.8.3 builds from source; nvcc writes GBs of intermediates to $TMPDIR. When $TMPDIR is a small/NFS-quota'd home the build fails with "nvFatbin error: empty input" / "Disk quota exceeded" from truncated temps. Document setting CUDA_HOME and a roomy TMPDIR, switch the sglang step to the project-extra form, and clarify flash-attn (FA2, trainer) vs flash-attn-4 (pulled in by sglang).

sglang requires an unbounded "kernels", so uv resolved the latest (0.15), but transformers 5.6.1 only supports kernels<0.13 — its hub_kernels module constructs LayerRepository() without a revision/version, which kernels 0.15 rejects, so `import sglang` crashes with "Either a revision or a version must be specified." Pin to the range transformers 5.6.1 expects (0.12.x). Verified on a from-scratch env: kernels resolves to 0.12.3 and the math recipe trains.

Add export_hf_named_params: a streaming generator that reconstructs the global model from Megatron's TP/PP/EP/ETP/VPP layout and yields HF-named, HF-layout CPU tensors one at a time (OOM-safe for large / MoE models). The gather + mcore->HF conversion is delegated to mbridge's export_weights (the same bridge the engine already uses to load/save); this module adds the consumer concerns: CPU move, byte-bounded bucketing, and a metadata-only path for transfer-buffer sizing. This is the foundation for correct sparse weight sync under full Megatron parallelism. The design (the "delta is computed in HF byte space" invariant) is documented in docs/en/architecture/megatron-weight-sync.md. The Megatron backend needs two extra compiled deps beyond the base install (megatron-core / mbridge are already there): Transformer Engine (fused LayerNorm + sequence parallelism) and apex (optional fused LayerNorm/Adam). These are kept out of the default image: a separate docker/Dockerfile.sglang.megatron layers them on top of Dockerfile.sglang, and installation.md gains an optional "Step 5: Install the Megatron training backend" under Option A. The FSDP backend and inference are unaffected. Validated (exact bf16 match vs the HF reference checkpoint): - Qwen3-0.6B TP=2 310 tensors, 0 mismatch - Qwen3-0.6B PP=2 311 tensors, 0 mismatch - Qwen3-0.6B TP=2 PP=2 311 tensors, 0 mismatch - Qwen3-30B-A3B TP=2 EP=2 PP=2 18867 tensors, 0 mismatch

Replace the TP-only shard-direct weight transfer with the HF-export path: - MegatronEngine.export_hf_named_params() / get_hf_weight_metadata() stream gathered HF tensors via mbridge (handles TP/PP/EP/ETP/VPP). The previous PP>1 / EP>1 NotImplementedError guards are removed. - WeightManager gains "megatron_hf_meta" mode: the transfer buffer is sized for the full HF model and offload() streams HF tensors into the inactive half on the writer rank, while the gather collectives run on all ranks in lockstep. The sender receives megatron_metadata=None and runs the plain full/delta path used by FSDP. Because the buffer now holds HF-layout bytes, the sparse delta is computed in HF space and is correct under any parallelism — fixing the latent corruption where the delta was computed in mcore layout but applied by the receiver in HF layout. - ppo_trainer wires the generator + HF metadata through. The legacy CPU shard-reassembly in the sender agent is now unused for Megatron (kept only for the deprecated megatron_metadata path). Validated (buffer roundtrip == HF reference, bit-exact): - Qwen3-0.6B TP=2 310 tensors, 0 mismatch, 1.19 GB - Qwen3-0.6B TP=2 PP=2 311 tensors, 0 mismatch, 1.50 GB

Add examples/math/qwen3-8b-megatron-delta — the FSDP qwen3-8b-m2po-delta recipe with the trainer engine switched to the Megatron backend (backend: megatron, tensor_parallel_size: 4). Identical data, algorithm, and weight-transfer path, so it doubles as a clean FSDP-vs-Megatron A/B. End-to-end validation (single 8-GPU node, 4 RaaS + 4 trainer TP=4, delta TCP weight sync, DeepScaleR/M2PO): - Qwen3-8B (this recipe): 25 steps, 0 errors; weight_transfer/delta_sparsity ~0.92 (delta computed in HF space); task_reward/avg rose 0.535 (first half) -> 0.585 (last half), recent steps 0.61-0.64. Per-step weight offload 0.59s. - Qwen3-30B-A3B MoE (TP2/PP3/EP2 trainer on 6 GPUs + SGLang TP2 on 2 GPUs): 21 steps, 0 errors; full MoE export (18867 tensors, 61 GB) gathered across TP/PP/EP each step; task_reward/avg ~0.64 -> 0.66 (recent steps 0.70-0.77).

The Megatron HF-export offload materialized each gathered tensor in pageable host memory via .to("cpu") before copying it into the pinned shared-memory transfer buffer — a ~1 GB/s bounce that cost ~13s/step for an 8B model on the RL critical path. Now the engine yields the gathered tensors on GPU (export_hf_named_params to_cpu=False) and WeightManager copies each tensor's uint8 view directly into the pinned buffer slice (non_blocking=True), fenced by a single cuda.synchronize() before the cross-rank barrier. The pinned buffer is already cudaHostRegister'd, so this hits the PCIe DMA engine. Copying through uint8 views on both sides keeps the copy alignment-free (robust to mixed-dtype models) and byte-identical for contiguous sources. Measured (Qwen3-8B, TP=4, 16.38 GB): pageable (old): 12.6s (1.3 GB/s) direct DMA: 0.56s (29.3 GB/s) ~23x Byte-equivalence verified (new buffer == old pageable path == HF reference, bit-exact) across TP=2, PP=2, TP=2/PP=2, and MoE TP=2/EP=2/PP=2 (Qwen3-30B-A3B, 61 GB). Adds tests/test_direct_dma_offload.py (equivalence) and tests/bench_offload_dma.py (throughput).

The hardcoded sglang inference defaults assume Hopper and crash on non-Hopper GPUs (verified on L40 / Ada sm_89), while the identical package stack runs on H100. Two Hopper-only kernel paths were forced regardless of hardware: - attention_backend="fa3": FlashAttention-3 is Hopper-only; on Ada/Ampere it fails CUDA-graph capture ("scheduler_metadata must have shape"). - flashinfer 0.6.x CuTe-DSL RMSNorm: no Ada/Ampere kernel, JITs into an incompatible nvidia-cutlass-dsl and crashes (GPUModuleOp TypeError). Make both arch-aware so one image/env runs on both: - SGLangConfig.attention_backend defaults to None, which omits the --attention-backend flag and lets sglang auto-select per GPU (fa3 on Hopper, an Ada/Ampere-compatible backend below sm_90). - raas/entrypoint.py sets FLASHINFER_USE_CUDA_NORM=1 on non-Hopper GPUs before sglang/flashinfer import, selecting flashinfer's CUDA-JIT norm. Detection uses NVML (no CUDA context in the launcher) and respects an existing env override. Hopper behavior is unchanged (fa3 + CuTe norm). Recipes and YAMLs are untouched. Verified end-to-end on L40 with the qwen3-1.7b-m2po-2gpus-delta example (sglang init, CUDA-graph capture, live generation).

Training co-locates trainer + RaaS + SGLang in one container and drives many concurrent rollouts, which surfaces two docker run requirements that the docs were inconsistent or silent about: - shm-size: docker/README.md still showed 16g, which causes "[Errno 28] No space left on device" when RaaS stages weights under /dev/shm. Bump it to 512g to match the install guide, with a note. - nofile: the container default soft limit (1024) is too low for the reward worker process pool and fails with "[Errno 24] Too many open files". Add --ulimit nofile=65536:65536 to the run commands in both docker/README.md and docs/en/get-started/installation.md, with a note.

Feat/megatron weight sync dev

Introduces ``spawn_rlvr``, a single-shot tool-call workflow where the main agent may emit one ``<spawn>{"tasks": [...]}</spawn>`` block per trajectory to dispatch up to 4 sub-agents in parallel against the same RaaS pool. Sub-agent outputs are spliced back into main's context inside a ``<spawn_result>`` block and the main agent continues to a final answer. Training scheme: one trajectory per episode contains 1 main + N sub-agent sequences, all sharing the team reward (math_verify on main's final answer). No model_ids tagging — single-trainer regime. Mirrors solve_and_check.py's multi-sequence-shared-reward precedent. Implementation notes: - Phase-1 generates freely (no SGLang string-stop; SGLang runs with --skip-tokenizer-init and crashes on string-based stop matching). The workflow detects <spawn>...</spawn> post-hoc via regex and truncates phase-1 tokens at the close-tag boundary. - Over-spec'd payloads (>4 tasks) are silently capped to 4; malformed JSON degrades the trajectory to vanilla single-turn RLVR. - Rollout dumps under {fileroot}/rollout_dumps/{version}/{qid}.txt contain decoded phase-1, per-sub-agent task+output, and phase-2 text for sanity checking. dump_prob configurable per workflow. Recipe: examples/math/spawn/qwen3-8b-spawn/ — Qwen3-8B, 8 GPUs, M2PO, ctx 16k, offline math datasets. Main max_new_tokens=3000, sub-agent max_new_tokens=1500 (max aggregated injection 4*1500=6000) so phase-1 + aggregated sub-results + phase-2 fits in the 16k SGLang window. Smoke-tested on 8x H100 (3 train steps + eval-at-start on the full 4768-item eval suite). Untrained Qwen3-8B emits valid spawn payloads ~40% of the time without SFT. Over 100 train steps, overall eval rose from 40.6% (v0) to 46.3% (v75), surpassing the vanilla rlvr baseline (44.3%) by +2.0% — confirming the team-reward gradient on shared trajectories is productive.

Port of platoon's TextCraft RL setup. Adds a new workflow class `recursive_agent` that lets a root agent spawn 1-4 sub-agents in parallel via asyncio.gather, each inheriting the parent's inventory by reference. Trees are bounded by max_depth=3, max_breadth=4. - workflow.py: ParsedAction dispatcher (get_info / view_inventory / craft / spawn / finish), per-agent BudgetTracker, trajectory dump format with full message logs for debugging. - env.py: stateful TextCraftEnv with forkable inventory aliasing (sub-agents and root share one mutable dict); binary all-or-nothing evaluate() against task.misc["target_items"]. - recipe_loader.py + recipes/: bundled Minecraft recipe DB (~860 recipes) so no HF download is needed. - tasks.py + bundled textcraft_{train,val}.jsonl: 1000 train / 100 val tasks synthesized from the recipe DB; deterministic seed. - dataflow/dataset/textcraft.py: dataset loaders for RL training and eval splits. - reward/textcraft_success.py: stub registered for parity; actual reward comes from env.evaluate(). - examples/textcraft/qwen3-4b-recursive/: full recipe (yaml + scripts) for Qwen3-4B-Instruct-2507 with M2PO, FSDP, SGLang, ctx32k, TCP weight transfer. - docs/recipes/textcraft-recursive.md: design overview. Also ignores claude-doc/ in .gitignore.

Ports platoon's OOLONG recursive-agent design (arxiv 2605.06639) to AstraFlow as a new workflow (oolong_recursive) with reward fn (oolong_success), HF dataset loader, and Qwen3-4B-Instruct-2507 recipe under examples/oolong/qwen3-4b-recursive/. Sub-agent grading is currently rule-based for oolong-synth and a placeholder (score=0) for oolong-real until an LLM judge is wired.

Two sweep variants of qwen3-4b-recursive: - gen4k: per-turn max_new_tokens 1024 -> 4096 for longer agent thinking - lr5e6: lr 3e-6 -> 5e-6 (earlier exploration) Recipe-only additions, no library code changes.

Two public functions: - judge(system, user, ...) -> str: posts a (system, user) pair to Fireworks and returns the raw assistant content. Retries on 429/5xx with exponential backoff (3 attempts). Falls back to reasoning_content when content is empty (handles gpt-oss-120b's reasoning-model quirk). - extract_json(text) -> dict: parses JSON out of an LLM response, tolerating ```json``` and plain ``` fence wrapping. Default model: accounts/fireworks/models/gpt-oss-120b (2s/call avg, vs 4s for deepseek-v4-pro and 8s for kimi-k2p6 on the same Fireworks account). max_tokens default 2048 to give reasoning models enough headroom. Each caller writes its own system prompt and parses what it expects -- no central rubric registry, no JudgeRewardEnv mixin, no caching, no budget gate. Matches platoon's pattern. Includes: - test_judge.py: 7 unit tests for extract_json + API-key guard, plus one live end-to-end test skipped without FIREWORKS_API_KEY. - judge_example.py: runnable script with 9 calibration cases, prints full input/output for each, supports --user / --system / --model flags for custom cases. See claude-doc/minimal-llm-judge-plan.md for design rationale.

Two user-facing reward systems, selected by reward_mode in workflow_spec: team_credit (default) All agents share the root's rule-based reward. No LLM judge calls. Cheap, simple, every agent gets some signal. per_agent_judge Root keeps its rule-based reward; each sub-agent is scored by an LLM judge (astraEnv.judge) on its own (goal, output). True per-agent credit assignment at the cost of API calls per sub-agent. env.py changes: - evaluate() is now async; sub-agent branch routes to the LLM judge when use_llm_judge=True, else returns the legacy 0.0 placeholder. - _grade_subagent_with_llm() catches all exceptions and clamps to [0, 1] so a flaky judge never crashes a rollout. - Adds use_llm_judge and judge_model kwargs. workflow.py changes: - New reward_mode kwarg with validation (raises ValueError on unknown modes, including the now-dropped 'root_only'). - use_llm_judge is derived from reward_mode -- never set independently. - Sequence emission no longer filters out sub-agent trajectories; all agents are emitted with their own reward from _reward_for_agent(), which picks root vs own based on reward_mode. Tests: - test_env.py: 10 tests (judge enabled/disabled, parse failures, network failures, clamping, model override, synth path unchanged). - test_workflow.py: 12 tests (default mode, validation, two-mode tree matrix). All LLM calls are mocked -- no API key required to run. Backwards-compatible defaults: yaml without reward_mode gets team_credit, which corresponds to "broadcast root reward to all agents". Note this differs from v5's root-only training (sub-agents were filtered out); root_only is no longer a supported mode.

…raEnv Two new utilities shared by the DeepDive pipeline: - search.py: minimal httpx client for the CMU RAG server with a 256-concurrency semaphore and 3-attempt exponential backoff. Backoff sleeps release the semaphore so degraded server periods don't thundering-herd onto starved slots. - checklist.py: ChecklistGrader, an ai-rubric-style RubricChecklistFast port. A single LLM call generates 3-5 atomic criteria from the goal and scores them in one response, returning a holistic overall_score in [0, 1]. System prompt is verbatim from ai_rubric 0.2.4.

Port of platoon's DeepDive RL recipe to AstraFlow. - workflow_cls "deepdive_recursive": recursive web-research agent loop with <action type="search|spawn|finish">{JSON}</action> format. - env.py: search (CMU RAG), spawn (sub-agent), finish actions. Reward routing: root task -> binary-success LLM judge against ground truth (verbatim platoon rubric); sub-agent task -> ChecklistGrader. - Workflow stamps sample-weighted group_reward_mean over root rewards and group_reward_std=1.0 on every emitted sequence, matching platoon's mean-only centering (no std normalization). - reward_mode selector (team_credit | per_agent_judge) for credit assignment experiments. - Recipe: qwen3-4b-recursive with bs=256, lr=3e-6, total 500 steps, filter_zero_adv re-enabled, delegation_lambda=0.

- dnd_process_response: type-aware rule-based scoring ported faithfully from platoon. int->int uses 0.75^|gap| partial credit; str->str is exact-match after strip().lower(); list->list is Jaccard overlap. \boxed{...} extraction with parse_confidence label. - Fix env.py routing: sub-agent task ids inherit the parent's dataset prefix and were incorrectly hitting the rule-based grader. Now any id containing "/sub_" routes to the LLM judge regardless of prefix. - Add qwen3-4b-recursive-real recipe targeting the D&D split.

The producer was unconditionally overwriting group_reward_mean and group_reward_std on every emitted sequence, blocking workflows from publishing their own baseline. Now the producer only fills these fields when the workflow has not already stamped them. Motivation: recursive agents emit a variable number of sequences per prompt (root + N sub-agents). Letting the producer compute group stats over all sequences sequence-weights the baseline, pulling it toward samples that happened to spawn more sub-agents. The DeepDive workflow now stamps a sample-weighted mean over root rewards and std=1.0.

Include the question and ground_truth in episode dump files so rollouts can be inspected without cross-referencing the dataset. Bump dump_prob 0.05 -> 0.25 (train + eval) for denser sampling during debugging (~64 dumps/step at bs=256). Add qwen3-4b-recursive-v7 recipe variant (trial_name suffixed -v7).

…rs 5 transformers>=5 makes apply_chat_template(tokenize=True) return a BatchEncoding (a Mapping) instead of a flat list[int]. Workflows that did list(apply_chat_template(...)) then got the dict keys (['input_ids','attention_mask']), which were sent to the inference engine and rejected with HTTP 400, breaking every agent/recursive/multi-agent recipe on a transformers-5 env. Add a shared apply_chat_template_to_ids() helper in hf_utils that defaults tokenize=True, forwards enable_thinking with a TypeError fallback, and extracts input_ids when a Mapping is returned. Route every workflow that builds token ids through it; the recursive textcraft/oolong/deepdive workflows keep an equivalent inline guard. rlvr already passed return_dict=False and is unaffected.

At 2048 the recursive agent tree fans out to ~16k concurrent generate requests, saturating the 4-GPU SGLang so no episode ever completes and the trainer hangs at step 0 waiting for data. 512 keeps the live agent count ~2.4k and lets episodes finish; validated across a full 105-step run and a clean 5-recipe smoke test on sglang 0.5.12.

Oolong is not supported in the latest version. Remove the oolong_recursive workflow (env, tasks, eval_helpers, tests, jsonl data), the oolong_success reward stub, the oolong dataset loaders, and both examples/oolong recipes. Drop the two oolong import lines from the workflow package __init__ and a stale "Oolong" mention in the deepdive recipe comments. The workflow and reward registries load cleanly without oolong; no references remain in source.

Drop examples/deepdive (qwen3-4b-recursive, -v7): DeepDive needs a search server we don't currently run. The deepdive workflow, reward, and dataset code stay in place and registered, ready to use again once a search backend is available. Also drop the experimental TextCraft variants qwen3-4b-recursive-{gen4k,lr5e6}, keeping the base qwen3-4b-recursive (the recipe actually in use).

Update the package version (astraflow/version.py and train_worker/version.py) and all version references in the docs and READMEs from 0.1.0 to 0.1.1: docs Sphinx version, sidebar badge, docs index title, and the astraflowai/astraflow image tag in docker/README and the installation guide. Add a v0.1.1 News entry to README (keeping the v0.1.0 record). Matches the already-published astraflowai/astraflow:v0.1.1 image (CUDA 13 / SGLang 0.5.12); no Docker rebuild required.

Spawn solution

Move examples/textcraft -> examples/textcraft-recursive-agent (the recipe is now examples/textcraft-recursive-agent/qwen3-4b-recursive) and update the path references in the recipe scripts' usage comments and the textcraft-recursive doc. Directory depth is unchanged, so the scripts' repo-root resolution is unaffected. The workflow code (astraflow/core/workflow/impl/textcraft) and the experiment_name/ trial_name identifiers are intentionally left as-is.

Add docs/web/examples/textcraft-recursive-8agent-episode.txt — a real rollout dump from the textcraft recursive-agent recipe where the root agent spawns 7 sub-agents (all succeed, reward 1.0) to craft the dye and material intermediates in parallel. Placed under docs/web so the animation pages can fetch it; the ROOT/SUB depth/parent markers and spawn actions provide the agent tree and timing for visualization.

The prebuilt `transformer-engine[pytorch]` wheels link libcublas.so.12 and fail to load on the CUDA 13 base image (and a CUDA 13 host install) with `ImportError: libcublas.so.12: cannot open shared object file`. The astraflow v0.1.1 stack is torch 2.11+cu130 / CUDA 13, so the wheel path is broken. Build TE from source (release_v2.13) against the CUDA 13 toolkit instead, with `nvidia-mathdx==25.6.0` supplying the build-time cuBLASDx / cuDNN frontend headers (mirrors slime's CUDA-13 recipe). Apply the same fix in docker/Dockerfile.sglang.megatron and the optional Megatron step in the installation guide. Verified: image astraflowai/astraflow:v0.1.1-megatron builds and `import transformer_engine.pytorch` succeeds on CUDA 13 (was the libcublas.so.12 ImportError before); apex and MegatronEngine also import.

The astraflowai/astraflow:v0.1.1.megatron tag (Transformer Engine + apex, built from Dockerfile.sglang.megatron) was published but never referenced in the docs. Map both images to their training backend so users pick the right one: v0.1.1 for the FSDP backend (default), v0.1.1.megatron for the Megatron-LM backend (TP/PP/EP, MoE). - docker/README.md: list both tags in the pull and run sections - installation.md: split Option B by backend; cross-link from Step 5 - qwen3-8b-megatron-delta/README.md: note the recipe needs the .megatron image

Two pre-merge correctness fixes for v0.1.1: - weight_manager: add a buffer-overflow guard in _offload_megatron_hf, mirroring _copy_all_gather. Without it, an export/metadata size disagreement at inactive-buffer index 0 silently spills into the active half the sender is shipping, corrupting weights with no error. The guard raises at the write site instead. - textcraft: default depth_level_weighting to False. The 1/(depth+1) weighting on raw reward gives sub-agents backwards credit; the shipped recipe already disabled it, but the in-code default would silently apply it to any recipe that omits the flag.

Add a runnable README for the qwen3-4b-recursive TextCraft recipe and flesh out the docs recipe page with a spawn animation, validation-accuracy curve, and a reference to the Recursive Agent Optimization paper (arXiv:2605.06639). Correct the settings table to match experiment.yaml (lr, batch size, max_staleness, total_train_steps, depth_level_weighting) and link the example from the examples index. Also remove the math-spawn recipe page from the docs and its toctree entry.

Relocate the qwen3-4b-recursive README up to examples/textcraft-recursive-agent/ so it's the landing page for the recipe family, and fix the asset paths for the new depth.

Add a news entry linking the TextCraft recursive-agent recipe docs, and fix the v0.1.1 release date to 2026/06.

fix(megatron): build Transformer Engine from source for CUDA 13

WWWjiahui and others added 30 commits May 28, 2026 22:31

Merge pull request #5 from WWWjiahui/chore/bump-sglang-0.5.12

b1bf6de

chore: bump sglang 0.5.5.post1 -> 0.5.12.post1 (FSDP path)

Merge pull request #9 from jsw-zorro/feat/megatron-weight-sync-dev

23fe945

Feat/megatron weight sync dev

feat: add TextCraft recursive-agent recipe variants (gen4k, lr5e6)

bd5e94e

Two sweep variants of qwen3-4b-recursive: - gen4k: per-turn max_new_tokens 1024 -> 4096 for longer agent thinking - lr5e6: lr 3e-6 -> 5e-6 (earlier exploration) Recipe-only additions, no library code changes.

Merge pull request #12 from Infini-AI-Lab/spawn-solution

c1e9f52

Spawn solution

haizhongzheng and others added 9 commits June 3, 2026 10:43

docs: move textcraft recipe README to parent example folder

7baeaab

Relocate the qwen3-4b-recursive README up to examples/textcraft-recursive-agent/ so it's the landing page for the recipe family, and fix the asset paths for the new depth.

docs: add textcraft recipe README at parent example folder

0844c6e

docs: announce dynamic recursive-agent recipe in README news

6d24d31

Add a news entry linking the TextCraft recursive-agent recipe docs, and fix the v0.1.1 release date to 2026/06.

Merge pull request #14 from jsw-zorro/fix/megatron-cuda13-te-build

1f11b72

fix(megatron): build Transformer Engine from source for CUDA 13

haizhongzheng merged commit 28a69f6 into main Jun 5, 2026
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AstraFlow v0.1.1: Megatron backend, offline math, recursive/spawn agents#16

AstraFlow v0.1.1: Megatron backend, offline math, recursive/spawn agents#16
haizhongzheng merged 39 commits into
mainfrom
dev

haizhongzheng commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

haizhongzheng commented Jun 5, 2026

Summary

Highlights

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants