[WIP] Profiling, board configs, weight streaming + memory-footprint (reference)#58
Closed
booth-algo wants to merge 4 commits into
Closed
[WIP] Profiling, board configs, weight streaming + memory-footprint (reference)#58booth-algo wants to merge 4 commits into
booth-algo wants to merge 4 commits into
Conversation
This was referenced Jun 2, 2026
booth-algo
added a commit
that referenced
this pull request
Jun 3, 2026
…in the ISA profiler (#87) * feat(profiler): per-board cycle models — load board_configs YAMLs in isa_analysis.py Adds board_configs/nexys_a7.yaml (Artix-7 XC7A200T / Nexys Video) and v80.yaml (Alveo V80), plus load_board_config / cycle_model_from_board / a --board CLI in the ISA profiler, so a program's cycle cost can be scored against a specific FPGA's per-op latencies instead of plena_settings.toml. Only the board's compute cost model (the latency: section) is consumed; async memory timing (H_PREFETCH/H_STORE) stays uncharged, matching the behaviour simulator's async memory model. Extracted from the WIP profiling branch (sim PR #58) — deliberately excludes that branch's docs, the Rust DDR3/streaming memory model (lib/memory/streaming.rs, cli.rs, main.rs) and its serde Cargo deps, and the stale ATEN_UNROLL compare-harness edits that predate the ATEN_OPS_UNROLL rename. Verified: the profiler runs against a real decoder ASM for both nexys_a7 and v80. * style: ruff format isa_analysis.py (wrap --board argparse line)
booth-algo
added a commit
that referenced
this pull request
Jun 3, 2026
booth-algo
added a commit
that referenced
this pull request
Jun 3, 2026
9bb716d to
b30ad94
Compare
…pping) wired into --memory-model
Adds lib/memory/streaming.rs (HostStream = direct host->SRAM streaming bypassing DDR3 for weights; LayerSwapping = capacity-limited DDR3 that swaps layer weights on demand; WeightManifest parsed from JSON), and re-implements HBM-model instantiation in runner.rs (the emulator was refactored since this work began: dispatch/Accelerator moved out of the monolithic main.rs into src/accelerator/, so the original main.rs wiring no longer applied). runner.rs now preloads the backing store up front and builds Arc<dyn ErasedMemoryModel> via match opts.memory_model {hbm,layer-swap,host-stream}. The default --memory-model hbm path is byte-for-byte identical (verified: HBM stats, Latency, vram_dump unchanged); cargo build --release clean. Streaming-model statistics are not yet surfaced through the erased trait (follow-up).
…fig memory cycles SimulatorCycleModel gains prefetch_v/prefetch_m/store_v_cycles (default 0); cycle_model_from_board reads them from the board memory: section, so --board nexys_a7 now charges DDR3 latency (V=25, M=400, store=25 cyc). Defaults are 0, so the plena_settings.toml path and every existing caller keep the prior async behaviour unchanged.
b30ad94 to
6761f55
Compare
…ryModel Adds MemoryModel::statistics_summary (default None) forwarded via ErasedMemoryModel::box_statistics_summary, overridden by WithStats (the existing HBM byte-stats line), HostStream (weight/activation stream counts), and LayerSwapping (swap count/bytes/time). runner.rs now reports stats uniformly for any --memory-model via the trait, dropping the 'statistics unavailable for streaming models' fallback; the concrete hbm_concrete handle is kept only for the DEBUG-only HBM dump (data read-back isn't on the trait). Default --memory-model hbm output is byte-for-byte unchanged (verified: Bytes read 49152 / written 0 / 5.22e9 bytes/sec, Latency 9410ns); cargo build + cargo fmt clean.
Resolves the confirmed correctness + cleanup findings from the PR #58 review: (1) WithStats::statistics_summary now appends the inner model's summary so HostStream/LayerSwapping stats actually surface (were shadowed by the outer WithStats); the HBM path is unchanged (inner returns None). (2) HostStream's inner is now wrapped in WithTiming(ddr3) so activation reads incur DDR3 latency (matches the doc; LayerSwap already did this). (3+6) --host-bandwidth / --ddr3-capacity are now non-Option with a value_parser that rejects 0 (was a divide-by-zero panic via Some(0) slipping past unwrap_or; also removes the dead Option/unwrap_or). (4+8) bytes->nanos extracted to transfer_nanos() with a u128 intermediate, fixing u64 overflow for layers larger than ~18 GB and de-duplicating the two call sites. (5) ensure_resident now warns when a single layer's weights exceed DDR3 capacity (restores the dropped guard). (7) clarifying comment that the DDR4-2400 preset approximates the board's DDR3-1600. Default --memory-model hbm output byte-for-byte unchanged; cargo build + fmt clean; --host-bandwidth 0 now rejected at the CLI.
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rebased onto current
mainand reduced to the work that is unique to this branch. The board-config YAMLs + profiler board-loader that originally lived here have landed separately via #87, so those commits are dropped to avoid duplication; theisa_analysis → asm_profilerrename is also dropped (main standardised onisa_analysis.py, which #87 extended).What remains, re-implemented against current
main:Weight-streaming memory models (
a856568)lib/memory/streaming.rs:HostStream(direct host→SRAM streaming, bypassing DDR3 for weights) andLayerSwapping(capacity-limited DDR3 that swaps layer weights on demand), plus a JSONWeightManifest. Exposed via a new--memory-model {hbm,layer-swap,host-stream}CLI flag (--weight-manifest,--ddr3-capacity,--host-bandwidth).The original wiring lived in the old monolithic
main.rs; the emulator has since been refactored (dispatch/Acceleratormoved intosrc/accelerator/,main.rsis now ~28 lines), so the integration was re-implemented inrunner.rs: the backing store is preloaded up front, thenArc<dyn ErasedMemoryModel>is built from amatch opts.memory_model. The default--memory-model hbmpath is byte-for-byte identical to before (verified: HBM byte-statistics line,Latency, andvram_dump.binall unchanged), andcargo build --releaseis clean with no warnings. Streaming-model statistics aren't yet surfaced through the erased trait — follow-up.Profiler DDR3 latency (
b30ad94)SimulatorCycleModelgainsprefetch_v/prefetch_m/store_v_cycles(default0);cycle_model_from_boardreads them from the boardmemory:section, so--board nexys_a7now charges DDR3 H_PREFETCH/H_STORE latency (V=25, M=400, store=25 cyc). The defaults are0, so theplena_settings.tomlpath and every existing caller keep the prior async behaviour unchanged.Docs (
SMOLVLM2_ISA_PROFILE.md,memory-footprint-and-streaming.md) are included with the streaming commit.