Board-aware weight-streaming + capacity-regime memory models + weight-manifest generator#90
Open
booth-algo wants to merge 10 commits into
Open
Board-aware weight-streaming + capacity-regime memory models + weight-manifest generator#90booth-algo wants to merge 10 commits into
booth-algo wants to merge 10 commits into
Conversation
…pping) wired into --memory-model
Adds lib/memory/streaming.rs (HostStream = direct host->SRAM streaming bypassing DDR3 for weights; LayerSwapping = capacity-limited DDR3 that swaps layer weights on demand; WeightManifest parsed from JSON), and re-implements HBM-model instantiation in runner.rs (the emulator was refactored since this work began: dispatch/Accelerator moved out of the monolithic main.rs into src/accelerator/, so the original main.rs wiring no longer applied). runner.rs now preloads the backing store up front and builds Arc<dyn ErasedMemoryModel> via match opts.memory_model {hbm,layer-swap,host-stream}. The default --memory-model hbm path is byte-for-byte identical (verified: HBM stats, Latency, vram_dump unchanged); cargo build --release clean. Streaming-model statistics are not yet surfaced through the erased trait (follow-up).
…fig memory cycles SimulatorCycleModel gains prefetch_v/prefetch_m/store_v_cycles (default 0); cycle_model_from_board reads them from the board memory: section, so --board nexys_a7 now charges DDR3 latency (V=25, M=400, store=25 cyc). Defaults are 0, so the plena_settings.toml path and every existing caller keep the prior async behaviour unchanged.
…ryModel Adds MemoryModel::statistics_summary (default None) forwarded via ErasedMemoryModel::box_statistics_summary, overridden by WithStats (the existing HBM byte-stats line), HostStream (weight/activation stream counts), and LayerSwapping (swap count/bytes/time). runner.rs now reports stats uniformly for any --memory-model via the trait, dropping the 'statistics unavailable for streaming models' fallback; the concrete hbm_concrete handle is kept only for the DEBUG-only HBM dump (data read-back isn't on the trait). Default --memory-model hbm output is byte-for-byte unchanged (verified: Bytes read 49152 / written 0 / 5.22e9 bytes/sec, Latency 9410ns); cargo build + cargo fmt clean.
Resolves the confirmed correctness + cleanup findings from the PR #58 review: (1) WithStats::statistics_summary now appends the inner model's summary so HostStream/LayerSwapping stats actually surface (were shadowed by the outer WithStats); the HBM path is unchanged (inner returns None). (2) HostStream's inner is now wrapped in WithTiming(ddr3) so activation reads incur DDR3 latency (matches the doc; LayerSwap already did this). (3+6) --host-bandwidth / --ddr3-capacity are now non-Option with a value_parser that rejects 0 (was a divide-by-zero panic via Some(0) slipping past unwrap_or; also removes the dead Option/unwrap_or). (4+8) bytes->nanos extracted to transfer_nanos() with a u128 intermediate, fixing u64 overflow for layers larger than ~18 GB and de-duplicating the two call sites. (5) ensure_resident now warns when a single layer's weights exceed DDR3 capacity (restores the dropped guard). (7) clarifying comment that the DDR4-2400 preset approximates the board's DDR3-1600. Default --memory-model hbm output byte-for-byte unchanged; cargo build + fmt clean; --host-bandwidth 0 now rejected at the CLI.
…a7 (PCIe Gen2 x4)
The board mislabelled nexys_a7 is actually the Nexys Video (XC7A200T, USB-only via FT2232H, no PCIe) — renamed, and its usb2_bandwidth_mbps replaced with a structured host_link {type: usb2, bandwidth_bytes_per_sec: 4750000}. Added custom_a7.yaml: same XC7A200T/512MB DDR3 but host_link {type: pcie_gen2_x4, bandwidth_bytes_per_sec: 1.5e9} (2.0 GB/s theoretical, ~1.5 GB/s sustained). Both gain capacity_bytes=512MiB for the capacity-aware memory model. Updated --board default to nexys_video and the latency preset nexys_a7_150mhz -> nexys_video_150mhz. Profiler verified on both boards; ruff clean.
… weight-stream regimes
Adds RegionKind {Weight,Kv,Activation} (serde default Weight, back-compat with the weight-only manifest) + ddr_capacity_bytes to WeightManifest, and a CapacityModel<T> that models the three DDR-fit regimes: Resident (footprint <= capacity: everything in DDR, no host transfer), KvSwap (weights pinned in DDR, KV/activations LRU-swapped to host on capacity pressure, reload-on-access no recompute), WeightStream (weights streamed from host, HostStream semantics). choose_regime() auto-selects from footprint_by_kind vs capacity; --memory-model gains {auto,resident,kv-swap,weight-stream}. Capacity = --ddr3-capacity (or manifest ddr_capacity_bytes); host bandwidth = --host-bandwidth (board host_link). Stats surface via WithStats/FIX-1. hbm/layer-swap/host-stream unchanged; default hbm byte-for-byte identical (9410ns). Verified regime ordering resident(13454ns) < kv-swap(47457ns) < weight-stream(849689ns).
…iles make_weight_manifest.py turns a compile's hbm_addrs / hbm_sizes / tensor_layouts into the capacity-model WeightManifest (weight / kv / activation regions, sized in MXFP8 bytes, plus the board's DDR capacity), so the resident / kv-swap / weight-stream regimes run on a real workload instead of a synthetic manifest. run_model.py dumps the HBM region maps next to the ISA (covers --compile-only and full runs). Classifies by the real plena_frontend.py name conventions (decoder W_*, vision V_W_/V_B_/V_LN1_/V_LN2_/V_POST_LN_/V_PATCH_*/V_CONNECTOR_* weights; K_stored_/V_stored_ KV; everything else activation), with size precedence hbm_sizes -> tensor_layouts(prod*1.125) -> address-delta, overlap clamping, and a uniform-layer full-model footprint estimate. Validated on clm60m (1 layer, mlen=256): 9 weight / 4 kv / 6 activation, all sized from authoritative hbm_sizes; emulator regimes fire correctly on real addresses (weight-stream streams 4.26 MB of weight reads; kv-swap swaps 884 KB of real KV; auto -> resident).
These are AI-authored working references, not shipped docs; move them out of tracking into gitignored doc/_local/ (consistent with #89) so this PR carries only code.
Add --board to run_model.py: load the board YAML and write its latency: block into the per-build plena_settings.toml -- DC_EN from the board's dc_lib_en flag, and each per-op cycle cost into the column DC_EN selects (dc_lib_en when enabled, else dc_lib_dis). Previously run_model.py passed dc_en=None, so the emulator ran on the DC-library defaults (DC_EN=1) for every build regardless of board; now it runs on the board's actual per-op latencies. The board's memory: cycle params still feed the analytic profiler, not the emulator's memory model.
… RTL Update nexys_video + custom_a7 latency: blocks to the RTL DC_LIB_DIS pipeline depths (configuration.svh): systolic_processing_overhead 0->8, vector_add 2->9, vector_mul 5->7, vector_exp 6->15, vector_reci 7->8, vector_sum 20->30. Corrects the emulator/profiler under-counting of vector FP latency (softmax/LayerNorm exp + sum-reduce) and the previously zero systolic fill/drain. Per-line provenance comments cite the RTL block.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #58 (closed). That branch was a long-lived WIP reference; this is the same weight-streaming + profiler work rebased cleanly onto
main, plus the board memory-regime model and the weight-manifest generator, carrying only code (working docs stay indoc/_localper #89).What's here
Weight-streaming memory models (
lib/memory/src/streaming.rs, wired through--memory-model):HostStream(weights stream from the host link at a fixed bandwidth) andLayerSwapping(LRU layer residency), with theErasedMemoryModelplumbing to surface their statistics.DDR3 profiler latency: the ISA profiler charges H_PREFETCH / H_STORE latency from each board config's DDR3 memory-cycle counts.
Per-board configs:
nexys_a7.yaml→nexys_video.yaml(Artix-7 XC7A200T, 512 MB DDR3, USB-only host link — no PCIe) and a newcustom_a7.yaml(same fabric and DDR3 but a PCIe Gen2 x4 host link, ~1.5 GB/s sustained). Both carrymemory.capacity_bytes.Capacity-aware memory model (
CapacityModel): three DDR-residency regimes chosen from model footprint vs board capacity —resident(everything fits, pure DDR timing),kv-swap(weights pinned, KV/activations LRU-swapped from host on miss, no recompute), andweight-stream(weights stream from the host link, KV/activations resident). Exposed as--memory-model {auto,resident,kv-swap,weight-stream};autopicks the regime from the manifest footprint vs--ddr3-capacity/ the manifest'sddr_capacity_bytes. The defaulthbmmodel is byte-identical to before.Weight-manifest generator (
testbench/make_weight_manifest.py+ a dump hook inrun_model.py): turns a real compile's HBM region maps into the kind-taggedWeightManifestthe capacity model consumes, classifying each region as weight / KV / activation by theplena_frontend.pynaming conventions and sizing it in MXFP8 bytes (hbm_sizes→tensor_layouts→ address-delta), with overlap clamping and a uniform-layer full-model footprint estimate.Validation
clm60m (1 decoder layer, mlen=256,
custom_a7PCIe 1.5 GB/s): the generated manifest is 9 weight / 4 KV / 6 activation, all sized from authoritative sizes. Running the emulator directly on the real compiled program:--memory-modelConservation holds (total HBM read = 6,144,000 across all three),
autoreads the real footprint and picks resident, and the latency ordering is monotonic (resident < kv-swap < weight-stream).Companion
The generator's KV sizing is exact when paired with AICrossSim/PLENA_Compiler#60, which emits per-region
hbm_sizes. Without it the generator falls back totensor_layouts(weights/activations exact) plus address-deltas for KV. ThePLENA_Compilersubmodule pin will bump once that merges.