Skip to content

AstraFlow v0.1.1: Megatron backend, offline math, recursive/spawn agents#15

Closed
haizhongzheng wants to merge 39 commits into
mainfrom
dev
Closed

AstraFlow v0.1.1: Megatron backend, offline math, recursive/spawn agents#15
haizhongzheng wants to merge 39 commits into
mainfrom
dev

Conversation

@haizhongzheng

Copy link
Copy Markdown
Member

Summary

Rolls up v0.1.1 onto main: a Megatron-LM training backend, offline math
datasets, new multi-agent workflows, and a toolchain bump.

Highlights

  • Megatron-LM backend with HF-space weight sync (PP/EP/VPP) and a
    direct-DMA offload path (~23× faster).
  • Toolchain: CUDA 13, SGLang 0.5.12, transformers 5.
  • Workflows: spawn-sub-agents for math RL; recursive-agent workflows
    (TextCraft, Oolong, DeepDive) with Qwen3-4B recipes; offline math datasets
    • recipes.
  • astraEnv: LLM-as-judge library, reward_mode selector, RAG search
    client, rubric grader.
  • Docs: TextCraft recursive-agent and offline-math recipe pages, Megatron
    weight-sync page, CUDA 13 install steps.

~860 of the changed files are bundled TextCraft (Minecraft) recipe JSONs.

WWWjiahui and others added 30 commits May 28, 2026 22:31
Upgrade the inference/runtime stack to the latest sglang and the
dependency versions it requires, validated end-to-end on the FSDP
backend (qwen3-1.7b math example, 2x L40).

Version pins (pyproject.toml, docs, Docker):
- sglang 0.5.5.post1 -> 0.5.12.post1
- torch 2.8.0 -> 2.11.0; torch_memory_saver 0.0.9 -> 0.0.9.post1
- transformers 4.57.1 -> 5.6.1 (sglang pins ==5.6.0, which has a
  flash-attention s_aux=None crash for non-sink models; 5.6.1 is the
  upstream patch release. Forced via [tool.uv] override-dependencies,
  which requires uv >= 0.10 -- documented in installation.md)
- peft -> >=0.18.0 (required by transformers 5.x)
- CUDA base image 12.9.1 -> 13.0.0

sglang 0.5.12 API compatibility:
- remove LoRAAbortReleasePatch (the abort-path lora_registry.release()
  it added is now fixed upstream; keeping it would double-release)
- remove enable_ep_moe from SGLangConfig (field dropped from ServerArgs)
- kernel package rename sgl_kernel -> sglang_kernel in the installation
  validator

transformers 5.x / sglang 0.5.12 runtime fixes (surfaced by the run):
- rlvr workflow: apply_chat_template now returns a BatchEncoding; pass
  return_dict=False to get the flat list[int] the rollout path expects
- fsdp apply_fsdp2: model._no_split_modules is a set in transformers 5.x;
  coerce to list before indexing
- raas free-port range capped at 55535 so sglang's derived gRPC port
  (port + 10000) stays <= 65535

Scope: FSDP backend only. Megatron / VL paths are intentionally not
covered here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chore: bump sglang 0.5.5.post1 -> 0.5.12.post1 (FSDP path)
sglang 0.5.12's /health round-trips through the scheduler, which stays
saturated for ~30-40s during the initial unchunked prefill of ~2048
requests/engine. The old 3-strike / 30s watchdog (5s probe timeout)
hard-exited a busy-but-alive engine before the first rollout batch
completed, hanging the rollout pipeline at step 0.

Raise the /health probe timeout 5s -> 20s so a slow-but-alive endpoint
isn't marked failed, and the failure budget 3 -> 5 strikes. A crashed
engine refuses connections instantly, so real-death detection stays
~50s (worst case ~100s) while the prefill ramp is tolerated. Verified:
math and code qwen3-8b-m2po-delta recipes train through the ramp with
zero watchdog strikes.
…ution

Two from-scratch install blockers with the sglang 0.5.12 / torch 2.11
stack:

- sglang 0.5.12 depends on flash-attn-4>=4.0.0b9 (a pre-release pulled in
  as a dependency), so resolution fails unless pre-releases are allowed.
  Add prerelease = "allow" to [tool.uv] so `uv pip install -e ".[sglang]"`
  resolves on both the conda and Docker paths.

- flash-attn 2.8.3 builds from source; nvcc writes GBs of intermediates to
  $TMPDIR. When $TMPDIR is a small/NFS-quota'd home the build fails with
  "nvFatbin error: empty input" / "Disk quota exceeded" from truncated
  temps. Document setting CUDA_HOME and a roomy TMPDIR, switch the sglang
  step to the project-extra form, and clarify flash-attn (FA2, trainer) vs
  flash-attn-4 (pulled in by sglang).
sglang requires an unbounded "kernels", so uv resolved the latest (0.15),
but transformers 5.6.1 only supports kernels<0.13 — its hub_kernels module
constructs LayerRepository() without a revision/version, which kernels 0.15
rejects, so `import sglang` crashes with "Either a revision or a version
must be specified." Pin to the range transformers 5.6.1 expects (0.12.x).
Verified on a from-scratch env: kernels resolves to 0.12.3 and the math
recipe trains.
Add export_hf_named_params: a streaming generator that reconstructs the
global model from Megatron's TP/PP/EP/ETP/VPP layout and yields HF-named,
HF-layout CPU tensors one at a time (OOM-safe for large / MoE models).
The gather + mcore->HF conversion is delegated to mbridge's export_weights
(the same bridge the engine already uses to load/save); this module adds
the consumer concerns: CPU move, byte-bounded bucketing, and a
metadata-only path for transfer-buffer sizing.

This is the foundation for correct sparse weight sync under full Megatron
parallelism. The design (the "delta is computed in HF byte space"
invariant) is documented in docs/en/architecture/megatron-weight-sync.md.

The Megatron backend needs two extra compiled deps beyond the base install
(megatron-core / mbridge are already there): Transformer Engine (fused
LayerNorm + sequence parallelism) and apex (optional fused LayerNorm/Adam).
These are kept out of the default image: a separate
docker/Dockerfile.sglang.megatron layers them on top of Dockerfile.sglang,
and installation.md gains an optional "Step 5: Install the Megatron training
backend" under Option A. The FSDP backend and inference are unaffected.

Validated (exact bf16 match vs the HF reference checkpoint):
- Qwen3-0.6B     TP=2            310 tensors, 0 mismatch
- Qwen3-0.6B     PP=2            311 tensors, 0 mismatch
- Qwen3-0.6B     TP=2 PP=2       311 tensors, 0 mismatch
- Qwen3-30B-A3B  TP=2 EP=2 PP=2  18867 tensors, 0 mismatch
Replace the TP-only shard-direct weight transfer with the HF-export path:

- MegatronEngine.export_hf_named_params() / get_hf_weight_metadata() stream
  gathered HF tensors via mbridge (handles TP/PP/EP/ETP/VPP). The previous
  PP>1 / EP>1 NotImplementedError guards are removed.
- WeightManager gains "megatron_hf_meta" mode: the transfer buffer is sized
  for the full HF model and offload() streams HF tensors into the inactive
  half on the writer rank, while the gather collectives run on all ranks in
  lockstep. The sender receives megatron_metadata=None and runs the plain
  full/delta path used by FSDP. Because the buffer now holds HF-layout
  bytes, the sparse delta is computed in HF space and is correct under any
  parallelism — fixing the latent corruption where the delta was computed
  in mcore layout but applied by the receiver in HF layout.
- ppo_trainer wires the generator + HF metadata through.

The legacy CPU shard-reassembly in the sender agent is now unused for
Megatron (kept only for the deprecated megatron_metadata path).

Validated (buffer roundtrip == HF reference, bit-exact):
- Qwen3-0.6B TP=2       310 tensors, 0 mismatch, 1.19 GB
- Qwen3-0.6B TP=2 PP=2  311 tensors, 0 mismatch, 1.50 GB
Add examples/math/qwen3-8b-megatron-delta — the FSDP qwen3-8b-m2po-delta
recipe with the trainer engine switched to the Megatron backend
(backend: megatron, tensor_parallel_size: 4). Identical data, algorithm,
and weight-transfer path, so it doubles as a clean FSDP-vs-Megatron A/B.

End-to-end validation (single 8-GPU node, 4 RaaS + 4 trainer TP=4, delta
TCP weight sync, DeepScaleR/M2PO):
- Qwen3-8B (this recipe): 25 steps, 0 errors; weight_transfer/delta_sparsity
  ~0.92 (delta computed in HF space); task_reward/avg rose 0.535 (first half)
  -> 0.585 (last half), recent steps 0.61-0.64. Per-step weight offload 0.59s.
- Qwen3-30B-A3B MoE (TP2/PP3/EP2 trainer on 6 GPUs + SGLang TP2 on 2 GPUs):
  21 steps, 0 errors; full MoE export (18867 tensors, 61 GB) gathered across
  TP/PP/EP each step; task_reward/avg ~0.64 -> 0.66 (recent steps 0.70-0.77).
The Megatron HF-export offload materialized each gathered tensor in
pageable host memory via .to("cpu") before copying it into the pinned
shared-memory transfer buffer — a ~1 GB/s bounce that cost ~13s/step for
an 8B model on the RL critical path.

Now the engine yields the gathered tensors on GPU (export_hf_named_params
to_cpu=False) and WeightManager copies each tensor's uint8 view directly
into the pinned buffer slice (non_blocking=True), fenced by a single
cuda.synchronize() before the cross-rank barrier. The pinned buffer is
already cudaHostRegister'd, so this hits the PCIe DMA engine.

Copying through uint8 views on both sides keeps the copy alignment-free
(robust to mixed-dtype models) and byte-identical for contiguous sources.

Measured (Qwen3-8B, TP=4, 16.38 GB):
  pageable (old): 12.6s  (1.3 GB/s)
  direct DMA:      0.56s  (29.3 GB/s)   ~23x

Byte-equivalence verified (new buffer == old pageable path == HF
reference, bit-exact) across TP=2, PP=2, TP=2/PP=2, and MoE TP=2/EP=2/PP=2
(Qwen3-30B-A3B, 61 GB). Adds tests/test_direct_dma_offload.py (equivalence)
and tests/bench_offload_dma.py (throughput).
The hardcoded sglang inference defaults assume Hopper and crash on
non-Hopper GPUs (verified on L40 / Ada sm_89), while the identical
package stack runs on H100. Two Hopper-only kernel paths were forced
regardless of hardware:

- attention_backend="fa3": FlashAttention-3 is Hopper-only; on Ada/Ampere
  it fails CUDA-graph capture ("scheduler_metadata must have shape").
- flashinfer 0.6.x CuTe-DSL RMSNorm: no Ada/Ampere kernel, JITs into an
  incompatible nvidia-cutlass-dsl and crashes (GPUModuleOp TypeError).

Make both arch-aware so one image/env runs on both:

- SGLangConfig.attention_backend defaults to None, which omits the
  --attention-backend flag and lets sglang auto-select per GPU (fa3 on
  Hopper, an Ada/Ampere-compatible backend below sm_90).
- raas/entrypoint.py sets FLASHINFER_USE_CUDA_NORM=1 on non-Hopper GPUs
  before sglang/flashinfer import, selecting flashinfer's CUDA-JIT norm.
  Detection uses NVML (no CUDA context in the launcher) and respects an
  existing env override.

Hopper behavior is unchanged (fa3 + CuTe norm). Recipes and YAMLs are
untouched. Verified end-to-end on L40 with the qwen3-1.7b-m2po-2gpus-delta
example (sglang init, CUDA-graph capture, live generation).
Training co-locates trainer + RaaS + SGLang in one container and drives
many concurrent rollouts, which surfaces two docker run requirements that
the docs were inconsistent or silent about:

- shm-size: docker/README.md still showed 16g, which causes
  "[Errno 28] No space left on device" when RaaS stages weights under
  /dev/shm. Bump it to 512g to match the install guide, with a note.
- nofile: the container default soft limit (1024) is too low for the
  reward worker process pool and fails with "[Errno 24] Too many open
  files". Add --ulimit nofile=65536:65536 to the run commands in both
  docker/README.md and docs/en/get-started/installation.md, with a note.
Introduces ``spawn_rlvr``, a single-shot tool-call workflow where the main
agent may emit one ``<spawn>{"tasks": [...]}</spawn>`` block per trajectory
to dispatch up to 4 sub-agents in parallel against the same RaaS pool.
Sub-agent outputs are spliced back into main's context inside a
``<spawn_result>`` block and the main agent continues to a final answer.

Training scheme: one trajectory per episode contains 1 main + N sub-agent
sequences, all sharing the team reward (math_verify on main's final
answer).  No model_ids tagging — single-trainer regime.  Mirrors
solve_and_check.py's multi-sequence-shared-reward precedent.

Implementation notes:
- Phase-1 generates freely (no SGLang string-stop; SGLang runs with
  --skip-tokenizer-init and crashes on string-based stop matching).
  The workflow detects <spawn>...</spawn> post-hoc via regex and
  truncates phase-1 tokens at the close-tag boundary.
- Over-spec'd payloads (>4 tasks) are silently capped to 4; malformed
  JSON degrades the trajectory to vanilla single-turn RLVR.
- Rollout dumps under {fileroot}/rollout_dumps/{version}/{qid}.txt
  contain decoded phase-1, per-sub-agent task+output, and phase-2 text
  for sanity checking.  dump_prob configurable per workflow.

Recipe: examples/math/spawn/qwen3-8b-spawn/ — Qwen3-8B, 8 GPUs, M2PO,
ctx 16k, offline math datasets.  Main max_new_tokens=3000, sub-agent
max_new_tokens=1500 (max aggregated injection 4*1500=6000) so phase-1
+ aggregated sub-results + phase-2 fits in the 16k SGLang window.

Smoke-tested on 8x H100 (3 train steps + eval-at-start on the full
4768-item eval suite).  Untrained Qwen3-8B emits valid spawn payloads
~40% of the time without SFT.  Over 100 train steps, overall eval rose
from 40.6% (v0) to 46.3% (v75), surpassing the vanilla rlvr baseline
(44.3%) by +2.0% — confirming the team-reward gradient on shared
trajectories is productive.
Port of platoon's TextCraft RL setup. Adds a new workflow class
`recursive_agent` that lets a root agent spawn 1-4 sub-agents in
parallel via asyncio.gather, each inheriting the parent's inventory
by reference. Trees are bounded by max_depth=3, max_breadth=4.

- workflow.py: ParsedAction dispatcher (get_info / view_inventory /
  craft / spawn / finish), per-agent BudgetTracker, trajectory dump
  format with full message logs for debugging.
- env.py: stateful TextCraftEnv with forkable inventory aliasing
  (sub-agents and root share one mutable dict); binary all-or-nothing
  evaluate() against task.misc["target_items"].
- recipe_loader.py + recipes/: bundled Minecraft recipe DB (~860
  recipes) so no HF download is needed.
- tasks.py + bundled textcraft_{train,val}.jsonl: 1000 train / 100
  val tasks synthesized from the recipe DB; deterministic seed.
- dataflow/dataset/textcraft.py: dataset loaders for RL training
  and eval splits.
- reward/textcraft_success.py: stub registered for parity; actual
  reward comes from env.evaluate().
- examples/textcraft/qwen3-4b-recursive/: full recipe (yaml +
  scripts) for Qwen3-4B-Instruct-2507 with M2PO, FSDP, SGLang,
  ctx32k, TCP weight transfer.
- docs/recipes/textcraft-recursive.md: design overview.

Also ignores claude-doc/ in .gitignore.
Ports platoon's OOLONG recursive-agent design (arxiv 2605.06639) to
AstraFlow as a new workflow (oolong_recursive) with reward fn
(oolong_success), HF dataset loader, and Qwen3-4B-Instruct-2507
recipe under examples/oolong/qwen3-4b-recursive/.

Sub-agent grading is currently rule-based for oolong-synth and a
placeholder (score=0) for oolong-real until an LLM judge is wired.
Two sweep variants of qwen3-4b-recursive:
- gen4k: per-turn max_new_tokens 1024 -> 4096 for longer agent thinking
- lr5e6: lr 3e-6 -> 5e-6 (earlier exploration)

Recipe-only additions, no library code changes.
Two public functions:
- judge(system, user, ...) -> str: posts a (system, user) pair to
  Fireworks and returns the raw assistant content. Retries on 429/5xx
  with exponential backoff (3 attempts). Falls back to reasoning_content
  when content is empty (handles gpt-oss-120b's reasoning-model quirk).
- extract_json(text) -> dict: parses JSON out of an LLM response,
  tolerating ```json``` and plain ``` fence wrapping.

Default model: accounts/fireworks/models/gpt-oss-120b (2s/call avg, vs
4s for deepseek-v4-pro and 8s for kimi-k2p6 on the same Fireworks
account). max_tokens default 2048 to give reasoning models enough
headroom.

Each caller writes its own system prompt and parses what it expects --
no central rubric registry, no JudgeRewardEnv mixin, no caching, no
budget gate. Matches platoon's pattern.

Includes:
- test_judge.py: 7 unit tests for extract_json + API-key guard, plus
  one live end-to-end test skipped without FIREWORKS_API_KEY.
- judge_example.py: runnable script with 9 calibration cases, prints
  full input/output for each, supports --user / --system / --model
  flags for custom cases.

See claude-doc/minimal-llm-judge-plan.md for design rationale.
Two user-facing reward systems, selected by reward_mode in workflow_spec:

  team_credit (default)
    All agents share the root's rule-based reward. No LLM judge calls.
    Cheap, simple, every agent gets some signal.

  per_agent_judge
    Root keeps its rule-based reward; each sub-agent is scored by an
    LLM judge (astraEnv.judge) on its own (goal, output). True per-agent
    credit assignment at the cost of API calls per sub-agent.

env.py changes:
- evaluate() is now async; sub-agent branch routes to the LLM judge
  when use_llm_judge=True, else returns the legacy 0.0 placeholder.
- _grade_subagent_with_llm() catches all exceptions and clamps to
  [0, 1] so a flaky judge never crashes a rollout.
- Adds use_llm_judge and judge_model kwargs.

workflow.py changes:
- New reward_mode kwarg with validation (raises ValueError on unknown
  modes, including the now-dropped 'root_only').
- use_llm_judge is derived from reward_mode -- never set independently.
- Sequence emission no longer filters out sub-agent trajectories; all
  agents are emitted with their own reward from _reward_for_agent(),
  which picks root vs own based on reward_mode.

Tests:
- test_env.py: 10 tests (judge enabled/disabled, parse failures,
  network failures, clamping, model override, synth path unchanged).
- test_workflow.py: 12 tests (default mode, validation, two-mode tree
  matrix). All LLM calls are mocked -- no API key required to run.

Backwards-compatible defaults: yaml without reward_mode gets
team_credit, which corresponds to "broadcast root reward to all
agents". Note this differs from v5's root-only training (sub-agents
were filtered out); root_only is no longer a supported mode.
…raEnv

Two new utilities shared by the DeepDive pipeline:

- search.py: minimal httpx client for the CMU RAG server with a
  256-concurrency semaphore and 3-attempt exponential backoff. Backoff
  sleeps release the semaphore so degraded server periods don't
  thundering-herd onto starved slots.

- checklist.py: ChecklistGrader, an ai-rubric-style RubricChecklistFast
  port. A single LLM call generates 3-5 atomic criteria from the goal and
  scores them in one response, returning a holistic overall_score in
  [0, 1]. System prompt is verbatim from ai_rubric 0.2.4.
Port of platoon's DeepDive RL recipe to AstraFlow.

- workflow_cls "deepdive_recursive": recursive web-research agent loop
  with <action type="search|spawn|finish">{JSON}</action> format.
- env.py: search (CMU RAG), spawn (sub-agent), finish actions. Reward
  routing: root task -> binary-success LLM judge against ground truth
  (verbatim platoon rubric); sub-agent task -> ChecklistGrader.
- Workflow stamps sample-weighted group_reward_mean over root rewards
  and group_reward_std=1.0 on every emitted sequence, matching
  platoon's mean-only centering (no std normalization).
- reward_mode selector (team_credit | per_agent_judge) for credit
  assignment experiments.
- Recipe: qwen3-4b-recursive with bs=256, lr=3e-6, total 500 steps,
  filter_zero_adv re-enabled, delegation_lambda=0.
- dnd_process_response: type-aware rule-based scoring ported faithfully
  from platoon. int->int uses 0.75^|gap| partial credit; str->str is
  exact-match after strip().lower(); list->list is Jaccard overlap.
  \boxed{...} extraction with parse_confidence label.
- Fix env.py routing: sub-agent task ids inherit the parent's dataset
  prefix and were incorrectly hitting the rule-based grader. Now any
  id containing "/sub_" routes to the LLM judge regardless of prefix.
- Add qwen3-4b-recursive-real recipe targeting the D&D split.
The producer was unconditionally overwriting group_reward_mean and
group_reward_std on every emitted sequence, blocking workflows from
publishing their own baseline. Now the producer only fills these fields
when the workflow has not already stamped them.

Motivation: recursive agents emit a variable number of sequences per
prompt (root + N sub-agents). Letting the producer compute group stats
over all sequences sequence-weights the baseline, pulling it toward
samples that happened to spawn more sub-agents. The DeepDive workflow
now stamps a sample-weighted mean over root rewards and std=1.0.
Include the question and ground_truth in episode dump files so rollouts
can be inspected without cross-referencing the dataset.

Bump dump_prob 0.05 -> 0.25 (train + eval) for denser sampling during
debugging (~64 dumps/step at bs=256).

Add qwen3-4b-recursive-v7 recipe variant (trial_name suffixed -v7).
…rs 5

transformers>=5 makes apply_chat_template(tokenize=True) return a
BatchEncoding (a Mapping) instead of a flat list[int]. Workflows that did
list(apply_chat_template(...)) then got the dict keys
(['input_ids','attention_mask']), which were sent to the inference engine
and rejected with HTTP 400, breaking every agent/recursive/multi-agent
recipe on a transformers-5 env.

Add a shared apply_chat_template_to_ids() helper in hf_utils that defaults
tokenize=True, forwards enable_thinking with a TypeError fallback, and
extracts input_ids when a Mapping is returned. Route every workflow that
builds token ids through it; the recursive textcraft/oolong/deepdive
workflows keep an equivalent inline guard. rlvr already passed
return_dict=False and is unaffected.
At 2048 the recursive agent tree fans out to ~16k concurrent generate
requests, saturating the 4-GPU SGLang so no episode ever completes and the
trainer hangs at step 0 waiting for data. 512 keeps the live agent count
~2.4k and lets episodes finish; validated across a full 105-step run and a
clean 5-recipe smoke test on sglang 0.5.12.
Oolong is not supported in the latest version. Remove the
oolong_recursive workflow (env, tasks, eval_helpers, tests, jsonl data),
the oolong_success reward stub, the oolong dataset loaders, and both
examples/oolong recipes. Drop the two oolong import lines from the
workflow package __init__ and a stale "Oolong" mention in the deepdive
recipe comments.

The workflow and reward registries load cleanly without oolong; no
references remain in source.
Drop examples/deepdive (qwen3-4b-recursive, -v7): DeepDive needs a search
server we don't currently run. The deepdive workflow, reward, and dataset
code stay in place and registered, ready to use again once a search
backend is available.

Also drop the experimental TextCraft variants
qwen3-4b-recursive-{gen4k,lr5e6}, keeping the base qwen3-4b-recursive
(the recipe actually in use).
Update the package version (astraflow/version.py and
train_worker/version.py) and all version references in the docs and
READMEs from 0.1.0 to 0.1.1: docs Sphinx version, sidebar badge, docs
index title, and the astraflowai/astraflow image tag in docker/README
and the installation guide. Add a v0.1.1 News entry to README (keeping
the v0.1.0 record).

Matches the already-published astraflowai/astraflow:v0.1.1 image
(CUDA 13 / SGLang 0.5.12); no Docker rebuild required.
Move examples/textcraft -> examples/textcraft-recursive-agent (the recipe
is now examples/textcraft-recursive-agent/qwen3-4b-recursive) and update
the path references in the recipe scripts' usage comments and the
textcraft-recursive doc. Directory depth is unchanged, so the scripts'
repo-root resolution is unaffected. The workflow code
(astraflow/core/workflow/impl/textcraft) and the experiment_name/
trial_name identifiers are intentionally left as-is.
haizhongzheng and others added 9 commits June 3, 2026 10:43
Add docs/web/examples/textcraft-recursive-8agent-episode.txt — a real
rollout dump from the textcraft recursive-agent recipe where the root
agent spawns 7 sub-agents (all succeed, reward 1.0) to craft the dye and
material intermediates in parallel. Placed under docs/web so the
animation pages can fetch it; the ROOT/SUB depth/parent markers and
spawn actions provide the agent tree and timing for visualization.
The prebuilt `transformer-engine[pytorch]` wheels link libcublas.so.12 and
fail to load on the CUDA 13 base image (and a CUDA 13 host install) with
`ImportError: libcublas.so.12: cannot open shared object file`. The astraflow
v0.1.1 stack is torch 2.11+cu130 / CUDA 13, so the wheel path is broken.

Build TE from source (release_v2.13) against the CUDA 13 toolkit instead,
with `nvidia-mathdx==25.6.0` supplying the build-time cuBLASDx / cuDNN
frontend headers (mirrors slime's CUDA-13 recipe). Apply the same fix in
docker/Dockerfile.sglang.megatron and the optional Megatron step in the
installation guide.

Verified: image astraflowai/astraflow:v0.1.1-megatron builds and
`import transformer_engine.pytorch` succeeds on CUDA 13 (was the
libcublas.so.12 ImportError before); apex and MegatronEngine also import.
The astraflowai/astraflow:v0.1.1.megatron tag (Transformer Engine + apex,
built from Dockerfile.sglang.megatron) was published but never referenced
in the docs. Map both images to their training backend so users pick the
right one: v0.1.1 for the FSDP backend (default), v0.1.1.megatron for the
Megatron-LM backend (TP/PP/EP, MoE).

- docker/README.md: list both tags in the pull and run sections
- installation.md: split Option B by backend; cross-link from Step 5
- qwen3-8b-megatron-delta/README.md: note the recipe needs the .megatron image
Two pre-merge correctness fixes for v0.1.1:

- weight_manager: add a buffer-overflow guard in _offload_megatron_hf,
  mirroring _copy_all_gather. Without it, an export/metadata size
  disagreement at inactive-buffer index 0 silently spills into the active
  half the sender is shipping, corrupting weights with no error. The guard
  raises at the write site instead.

- textcraft: default depth_level_weighting to False. The 1/(depth+1)
  weighting on raw reward gives sub-agents backwards credit; the shipped
  recipe already disabled it, but the in-code default would silently apply
  it to any recipe that omits the flag.
Add a runnable README for the qwen3-4b-recursive TextCraft recipe and flesh
out the docs recipe page with a spawn animation, validation-accuracy curve,
and a reference to the Recursive Agent Optimization paper
(arXiv:2605.06639). Correct the settings table to match experiment.yaml
(lr, batch size, max_staleness, total_train_steps, depth_level_weighting)
and link the example from the examples index.

Also remove the math-spawn recipe page from the docs and its toctree entry.
Relocate the qwen3-4b-recursive README up to examples/textcraft-recursive-agent/
so it's the landing page for the recipe family, and fix the asset paths for
the new depth.
Add a news entry linking the TextCraft recursive-agent recipe docs, and fix
the v0.1.1 release date to 2026/06.
fix(megatron): build Transformer Engine from source for CUDA 13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants