Board-aware weight-streaming + capacity-regime memory models + weight-manifest generator by booth-algo · Pull Request #90 · AICrossSim/PLENA_Simulator

booth-algo · 2026-06-04T00:55:00Z

Supersedes #58 (closed). That branch was a long-lived WIP reference; this is the same weight-streaming + profiler work rebased cleanly onto main, plus the board memory-regime model and the weight-manifest generator, carrying only code (working docs stay in doc/_local per #89).

What's here

Weight-streaming memory models (lib/memory/src/streaming.rs, wired through --memory-model): HostStream (weights stream from the host link at a fixed bandwidth) and LayerSwapping (LRU layer residency), with the ErasedMemoryModel plumbing to surface their statistics.

DDR3 profiler latency: the ISA profiler charges H_PREFETCH / H_STORE latency from each board config's DDR3 memory-cycle counts.

Per-board configs: nexys_a7.yaml → nexys_video.yaml (Artix-7 XC7A200T, 512 MB DDR3, USB-only host link — no PCIe) and a new custom_a7.yaml (same fabric and DDR3 but a PCIe Gen2 x4 host link, ~1.5 GB/s sustained). Both carry memory.capacity_bytes.

Capacity-aware memory model (CapacityModel): three DDR-residency regimes chosen from model footprint vs board capacity — resident (everything fits, pure DDR timing), kv-swap (weights pinned, KV/activations LRU-swapped from host on miss, no recompute), and weight-stream (weights stream from the host link, KV/activations resident). Exposed as --memory-model {auto,resident,kv-swap,weight-stream}; auto picks the regime from the manifest footprint vs --ddr3-capacity / the manifest's ddr_capacity_bytes. The default hbm model is byte-identical to before.

Weight-manifest generator (testbench/make_weight_manifest.py + a dump hook in run_model.py): turns a real compile's HBM region maps into the kind-tagged WeightManifest the capacity model consumes, classifying each region as weight / KV / activation by the plena_frontend.py naming conventions and sizing it in MXFP8 bytes (hbm_sizes → tensor_layouts → address-delta), with overlap clamping and a uniform-layer full-model footprint estimate.

Validation

clm60m (1 decoder layer, mlen=256, custom_a7 PCIe 1.5 GB/s): the generated manifest is 9 weight / 4 KV / 6 activation, all sized from authoritative sizes. Running the emulator directly on the real compiled program:

`--memory-model`	resident B	kv swapped B	weight streamed B	latency (ns)
auto (→ resident)	6,144,000	0	0	988,444
weight-stream	1,884,160	0	4,259,840	5,611,832
kv-swap	4,259,840	884,736	0	1,568,646

Conservation holds (total HBM read = 6,144,000 across all three), auto reads the real footprint and picks resident, and the latency ordering is monotonic (resident < kv-swap < weight-stream).

Companion

The generator's KV sizing is exact when paired with AICrossSim/PLENA_Compiler#60, which emits per-region hbm_sizes. Without it the generator falls back to tensor_layouts (weights/activations exact) plus address-deltas for KV. The PLENA_Compiler submodule pin will bump once that merges.

…pping) wired into --memory-model Adds lib/memory/streaming.rs (HostStream = direct host->SRAM streaming bypassing DDR3 for weights; LayerSwapping = capacity-limited DDR3 that swaps layer weights on demand; WeightManifest parsed from JSON), and re-implements HBM-model instantiation in runner.rs (the emulator was refactored since this work began: dispatch/Accelerator moved out of the monolithic main.rs into src/accelerator/, so the original main.rs wiring no longer applied). runner.rs now preloads the backing store up front and builds Arc<dyn ErasedMemoryModel> via match opts.memory_model {hbm,layer-swap,host-stream}. The default --memory-model hbm path is byte-for-byte identical (verified: HBM stats, Latency, vram_dump unchanged); cargo build --release clean. Streaming-model statistics are not yet surfaced through the erased trait (follow-up).

…fig memory cycles SimulatorCycleModel gains prefetch_v/prefetch_m/store_v_cycles (default 0); cycle_model_from_board reads them from the board memory: section, so --board nexys_a7 now charges DDR3 latency (V=25, M=400, store=25 cyc). Defaults are 0, so the plena_settings.toml path and every existing caller keep the prior async behaviour unchanged.

…ryModel Adds MemoryModel::statistics_summary (default None) forwarded via ErasedMemoryModel::box_statistics_summary, overridden by WithStats (the existing HBM byte-stats line), HostStream (weight/activation stream counts), and LayerSwapping (swap count/bytes/time). runner.rs now reports stats uniformly for any --memory-model via the trait, dropping the 'statistics unavailable for streaming models' fallback; the concrete hbm_concrete handle is kept only for the DEBUG-only HBM dump (data read-back isn't on the trait). Default --memory-model hbm output is byte-for-byte unchanged (verified: Bytes read 49152 / written 0 / 5.22e9 bytes/sec, Latency 9410ns); cargo build + cargo fmt clean.

Resolves the confirmed correctness + cleanup findings from the PR #58 review: (1) WithStats::statistics_summary now appends the inner model's summary so HostStream/LayerSwapping stats actually surface (were shadowed by the outer WithStats); the HBM path is unchanged (inner returns None). (2) HostStream's inner is now wrapped in WithTiming(ddr3) so activation reads incur DDR3 latency (matches the doc; LayerSwap already did this). (3+6) --host-bandwidth / --ddr3-capacity are now non-Option with a value_parser that rejects 0 (was a divide-by-zero panic via Some(0) slipping past unwrap_or; also removes the dead Option/unwrap_or). (4+8) bytes->nanos extracted to transfer_nanos() with a u128 intermediate, fixing u64 overflow for layers larger than ~18 GB and de-duplicating the two call sites. (5) ensure_resident now warns when a single layer's weights exceed DDR3 capacity (restores the dropped guard). (7) clarifying comment that the DDR4-2400 preset approximates the board's DDR3-1600. Default --memory-model hbm output byte-for-byte unchanged; cargo build + fmt clean; --host-bandwidth 0 now rejected at the CLI.

…a7 (PCIe Gen2 x4) The board mislabelled nexys_a7 is actually the Nexys Video (XC7A200T, USB-only via FT2232H, no PCIe) — renamed, and its usb2_bandwidth_mbps replaced with a structured host_link {type: usb2, bandwidth_bytes_per_sec: 4750000}. Added custom_a7.yaml: same XC7A200T/512MB DDR3 but host_link {type: pcie_gen2_x4, bandwidth_bytes_per_sec: 1.5e9} (2.0 GB/s theoretical, ~1.5 GB/s sustained). Both gain capacity_bytes=512MiB for the capacity-aware memory model. Updated --board default to nexys_video and the latency preset nexys_a7_150mhz -> nexys_video_150mhz. Profiler verified on both boards; ruff clean.

… weight-stream regimes Adds RegionKind {Weight,Kv,Activation} (serde default Weight, back-compat with the weight-only manifest) + ddr_capacity_bytes to WeightManifest, and a CapacityModel<T> that models the three DDR-fit regimes: Resident (footprint <= capacity: everything in DDR, no host transfer), KvSwap (weights pinned in DDR, KV/activations LRU-swapped to host on capacity pressure, reload-on-access no recompute), WeightStream (weights streamed from host, HostStream semantics). choose_regime() auto-selects from footprint_by_kind vs capacity; --memory-model gains {auto,resident,kv-swap,weight-stream}. Capacity = --ddr3-capacity (or manifest ddr_capacity_bytes); host bandwidth = --host-bandwidth (board host_link). Stats surface via WithStats/FIX-1. hbm/layer-swap/host-stream unchanged; default hbm byte-for-byte identical (9410ns). Verified regime ordering resident(13454ns) < kv-swap(47457ns) < weight-stream(849689ns).

…iles make_weight_manifest.py turns a compile's hbm_addrs / hbm_sizes / tensor_layouts into the capacity-model WeightManifest (weight / kv / activation regions, sized in MXFP8 bytes, plus the board's DDR capacity), so the resident / kv-swap / weight-stream regimes run on a real workload instead of a synthetic manifest. run_model.py dumps the HBM region maps next to the ISA (covers --compile-only and full runs). Classifies by the real plena_frontend.py name conventions (decoder W_*, vision V_W_/V_B_/V_LN1_/V_LN2_/V_POST_LN_/V_PATCH_*/V_CONNECTOR_* weights; K_stored_/V_stored_ KV; everything else activation), with size precedence hbm_sizes -> tensor_layouts(prod*1.125) -> address-delta, overlap clamping, and a uniform-layer full-model footprint estimate. Validated on clm60m (1 layer, mlen=256): 9 weight / 4 kv / 6 activation, all sized from authoritative hbm_sizes; emulator regimes fire correctly on real addresses (weight-stream streams 4.26 MB of weight reads; kv-swap swaps 884 KB of real KV; auto -> resident).

These are AI-authored working references, not shipped docs; move them out of tracking into gitignored doc/_local/ (consistent with #89) so this PR carries only code.

Add --board to run_model.py: load the board YAML and write its latency: block into the per-build plena_settings.toml -- DC_EN from the board's dc_lib_en flag, and each per-op cycle cost into the column DC_EN selects (dc_lib_en when enabled, else dc_lib_dis). Previously run_model.py passed dc_en=None, so the emulator ran on the DC-library defaults (DC_EN=1) for every build regardless of board; now it runs on the board's actual per-op latencies. The board's memory: cycle params still feed the analytic profiler, not the emulator's memory model.

… RTL Update nexys_video + custom_a7 latency: blocks to the RTL DC_LIB_DIS pipeline depths (configuration.svh): systolic_processing_overhead 0->8, vector_add 2->9, vector_mul 5->7, vector_exp 6->15, vector_reci 7->8, vector_sum 20->30. Corrects the emulator/profiler under-counting of vector FP latency (softmax/LayerNorm exp + sum-reduce) and the previously zero systolic fill/drain. Per-line provenance comments cite the RTL block.

booth-algo added 8 commits June 4, 2026 01:52

chore: keep streaming/profile working docs local-only (doc/_local)

b8318c6

These are AI-authored working references, not shipped docs; move them out of tracking into gitignored doc/_local/ (consistent with #89) so this PR carries only code.

booth-algo mentioned this pull request Jun 4, 2026

[WIP] Profiling, board configs, weight streaming + memory-footprint (reference) #58

Closed

booth-algo added 2 commits June 4, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Board-aware weight-streaming + capacity-regime memory models + weight-manifest generator#90

Board-aware weight-streaming + capacity-regime memory models + weight-manifest generator#90
booth-algo wants to merge 10 commits into
mainfrom
feat/board-memory-regimes

booth-algo commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

booth-algo commented Jun 4, 2026

What's here

Validation

Companion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant