[WIP] Profiling, board configs, weight streaming + memory-footprint (reference) by booth-algo · Pull Request #58 · AICrossSim/PLENA_Simulator

booth-algo · 2026-05-28T00:09:33Z

Rebased onto current main and reduced to the work that is unique to this branch. The board-config YAMLs + profiler board-loader that originally lived here have landed separately via #87, so those commits are dropped to avoid duplication; the isa_analysis → asm_profiler rename is also dropped (main standardised on isa_analysis.py, which #87 extended).

What remains, re-implemented against current main:

Weight-streaming memory models (`a856568`)

lib/memory/streaming.rs: HostStream (direct host→SRAM streaming, bypassing DDR3 for weights) and LayerSwapping (capacity-limited DDR3 that swaps layer weights on demand), plus a JSON WeightManifest. Exposed via a new --memory-model {hbm,layer-swap,host-stream} CLI flag (--weight-manifest, --ddr3-capacity, --host-bandwidth).

The original wiring lived in the old monolithic main.rs; the emulator has since been refactored (dispatch/Accelerator moved into src/accelerator/, main.rs is now ~28 lines), so the integration was re-implemented in runner.rs: the backing store is preloaded up front, then Arc<dyn ErasedMemoryModel> is built from a match opts.memory_model. The default --memory-model hbm path is byte-for-byte identical to before (verified: HBM byte-statistics line, Latency, and vram_dump.bin all unchanged), and cargo build --release is clean with no warnings. Streaming-model statistics aren't yet surfaced through the erased trait — follow-up.

Profiler DDR3 latency (`b30ad94`)

SimulatorCycleModel gains prefetch_v/prefetch_m/store_v_cycles (default 0); cycle_model_from_board reads them from the board memory: section, so --board nexys_a7 now charges DDR3 H_PREFETCH/H_STORE latency (V=25, M=400, store=25 cyc). The defaults are 0, so the plena_settings.toml path and every existing caller keep the prior async behaviour unchanged.

Docs (SMOLVLM2_ISA_PROFILE.md, memory-footprint-and-streaming.md) are included with the streaming commit.

…in the ISA profiler (#87) * feat(profiler): per-board cycle models — load board_configs YAMLs in isa_analysis.py Adds board_configs/nexys_a7.yaml (Artix-7 XC7A200T / Nexys Video) and v80.yaml (Alveo V80), plus load_board_config / cycle_model_from_board / a --board CLI in the ISA profiler, so a program's cycle cost can be scored against a specific FPGA's per-op latencies instead of plena_settings.toml. Only the board's compute cost model (the latency: section) is consumed; async memory timing (H_PREFETCH/H_STORE) stays uncharged, matching the behaviour simulator's async memory model. Extracted from the WIP profiling branch (sim PR #58) — deliberately excludes that branch's docs, the Rust DDR3/streaming memory model (lib/memory/streaming.rs, cli.rs, main.rs) and its serde Cargo deps, and the stale ATEN_UNROLL compare-harness edits that predate the ATEN_OPS_UNROLL rename. Verified: the profiler runs against a real decoder ASM for both nexys_a7 and v80. * style: ruff format isa_analysis.py (wrap --board argparse line)

…im2col, #58 causal, #59 ffn)

…im2col, #58 causal, #59 ffn) (#88)

…pping) wired into --memory-model Adds lib/memory/streaming.rs (HostStream = direct host->SRAM streaming bypassing DDR3 for weights; LayerSwapping = capacity-limited DDR3 that swaps layer weights on demand; WeightManifest parsed from JSON), and re-implements HBM-model instantiation in runner.rs (the emulator was refactored since this work began: dispatch/Accelerator moved out of the monolithic main.rs into src/accelerator/, so the original main.rs wiring no longer applied). runner.rs now preloads the backing store up front and builds Arc<dyn ErasedMemoryModel> via match opts.memory_model {hbm,layer-swap,host-stream}. The default --memory-model hbm path is byte-for-byte identical (verified: HBM stats, Latency, vram_dump unchanged); cargo build --release clean. Streaming-model statistics are not yet surfaced through the erased trait (follow-up).

…fig memory cycles SimulatorCycleModel gains prefetch_v/prefetch_m/store_v_cycles (default 0); cycle_model_from_board reads them from the board memory: section, so --board nexys_a7 now charges DDR3 latency (V=25, M=400, store=25 cyc). Defaults are 0, so the plena_settings.toml path and every existing caller keep the prior async behaviour unchanged.

…ryModel Adds MemoryModel::statistics_summary (default None) forwarded via ErasedMemoryModel::box_statistics_summary, overridden by WithStats (the existing HBM byte-stats line), HostStream (weight/activation stream counts), and LayerSwapping (swap count/bytes/time). runner.rs now reports stats uniformly for any --memory-model via the trait, dropping the 'statistics unavailable for streaming models' fallback; the concrete hbm_concrete handle is kept only for the DEBUG-only HBM dump (data read-back isn't on the trait). Default --memory-model hbm output is byte-for-byte unchanged (verified: Bytes read 49152 / written 0 / 5.22e9 bytes/sec, Latency 9410ns); cargo build + cargo fmt clean.

Resolves the confirmed correctness + cleanup findings from the PR #58 review: (1) WithStats::statistics_summary now appends the inner model's summary so HostStream/LayerSwapping stats actually surface (were shadowed by the outer WithStats); the HBM path is unchanged (inner returns None). (2) HostStream's inner is now wrapped in WithTiming(ddr3) so activation reads incur DDR3 latency (matches the doc; LayerSwap already did this). (3+6) --host-bandwidth / --ddr3-capacity are now non-Option with a value_parser that rejects 0 (was a divide-by-zero panic via Some(0) slipping past unwrap_or; also removes the dead Option/unwrap_or). (4+8) bytes->nanos extracted to transfer_nanos() with a u128 intermediate, fixing u64 overflow for layers larger than ~18 GB and de-duplicating the two call sites. (5) ensure_resident now warns when a single layer's weights exceed DDR3 capacity (restores the dropped guard). (7) clarifying comment that the DDR4-2400 preset approximates the board's DDR3-1600. Default --memory-model hbm output byte-for-byte unchanged; cargo build + fmt clean; --host-bandwidth 0 now rejected at the CLI.

booth-algo · 2026-06-04T00:55:16Z

Superseded by #90 — the same weight-streaming + DDR3-profiler work rebased cleanly onto main, plus the board memory-regime model (CapacityModel: resident / kv-swap / weight-stream) and the weight-manifest generator. Closing this WIP reference branch in favour of #90.

booth-algo marked this pull request as draft May 28, 2026 08:03

booth-algo changed the title ~~ASM profiler, board configs, and weight streaming memory models~~ [WIP] Profiling, board configs, weight streaming + memory-footprint (reference) Jun 2, 2026

This was referenced Jun 2, 2026

docs: memory footprint & host-link streaming analysis #78

Closed

feat(profiler): per-board cycle models (Artix-7 + V80 board configs) in the ISA profiler #87

Merged

booth-algo mentioned this pull request Jun 3, 2026

chore: bump PLENA_Compiler to merged main (#57 im2col, #58 causal-mask, #59 FFN roll) #88

Merged

booth-algo added a commit that referenced this pull request Jun 3, 2026

chore: re-pin PLENA_Compiler to merged main (74ff7a5 -> ebdba9e: #57 …

0d3704d

…im2col, #58 causal, #59 ffn)

booth-algo added a commit that referenced this pull request Jun 3, 2026

chore: re-pin PLENA_Compiler to merged main (74ff7a5 -> ebdba9e: #57 …

e2e9d04

…im2col, #58 causal, #59 ffn) (#88)

booth-algo force-pushed the kev/profile-and-opt branch from 9bb716d to b30ad94 Compare June 3, 2026 17:03

booth-algo added 2 commits June 3, 2026 18:18

booth-algo force-pushed the kev/profile-and-opt branch from b30ad94 to 6761f55 Compare June 3, 2026 17:18

booth-algo added 2 commits June 3, 2026 22:38

booth-algo mentioned this pull request Jun 4, 2026

Board-aware weight-streaming + capacity-regime memory models + weight-manifest generator #90

Open

booth-algo closed this Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Profiling, board configs, weight streaming + memory-footprint (reference)#58

[WIP] Profiling, board configs, weight streaming + memory-footprint (reference)#58
booth-algo wants to merge 4 commits into
mainfrom
kev/profile-and-opt

booth-algo commented May 28, 2026 •

edited

Loading

Uh oh!

booth-algo commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

booth-algo commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Weight-streaming memory models (a856568)

Profiler DDR3 latency (b30ad94)

Uh oh!

booth-algo commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

booth-algo commented May 28, 2026 •

edited

Loading

Weight-streaming memory models (`a856568`)

Profiler DDR3 latency (`b30ad94`)