Skip to content

[WIP] Profiling, board configs, weight streaming + memory-footprint (reference)#58

Closed
booth-algo wants to merge 4 commits into
mainfrom
kev/profile-and-opt
Closed

[WIP] Profiling, board configs, weight streaming + memory-footprint (reference)#58
booth-algo wants to merge 4 commits into
mainfrom
kev/profile-and-opt

Conversation

@booth-algo

@booth-algo booth-algo commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Rebased onto current main and reduced to the work that is unique to this branch. The board-config YAMLs + profiler board-loader that originally lived here have landed separately via #87, so those commits are dropped to avoid duplication; the isa_analysis → asm_profiler rename is also dropped (main standardised on isa_analysis.py, which #87 extended).

What remains, re-implemented against current main:

Weight-streaming memory models (a856568)

lib/memory/streaming.rs: HostStream (direct host→SRAM streaming, bypassing DDR3 for weights) and LayerSwapping (capacity-limited DDR3 that swaps layer weights on demand), plus a JSON WeightManifest. Exposed via a new --memory-model {hbm,layer-swap,host-stream} CLI flag (--weight-manifest, --ddr3-capacity, --host-bandwidth).

The original wiring lived in the old monolithic main.rs; the emulator has since been refactored (dispatch/Accelerator moved into src/accelerator/, main.rs is now ~28 lines), so the integration was re-implemented in runner.rs: the backing store is preloaded up front, then Arc<dyn ErasedMemoryModel> is built from a match opts.memory_model. The default --memory-model hbm path is byte-for-byte identical to before (verified: HBM byte-statistics line, Latency, and vram_dump.bin all unchanged), and cargo build --release is clean with no warnings. Streaming-model statistics aren't yet surfaced through the erased trait — follow-up.

Profiler DDR3 latency (b30ad94)

SimulatorCycleModel gains prefetch_v/prefetch_m/store_v_cycles (default 0); cycle_model_from_board reads them from the board memory: section, so --board nexys_a7 now charges DDR3 H_PREFETCH/H_STORE latency (V=25, M=400, store=25 cyc). The defaults are 0, so the plena_settings.toml path and every existing caller keep the prior async behaviour unchanged.

Docs (SMOLVLM2_ISA_PROFILE.md, memory-footprint-and-streaming.md) are included with the streaming commit.

@booth-algo booth-algo marked this pull request as draft May 28, 2026 08:03
@booth-algo booth-algo changed the title ASM profiler, board configs, and weight streaming memory models [WIP] Profiling, board configs, weight streaming + memory-footprint (reference) Jun 2, 2026
booth-algo added a commit that referenced this pull request Jun 3, 2026
…in the ISA profiler (#87)

* feat(profiler): per-board cycle models — load board_configs YAMLs in isa_analysis.py

Adds board_configs/nexys_a7.yaml (Artix-7 XC7A200T / Nexys Video) and v80.yaml (Alveo V80), plus load_board_config / cycle_model_from_board / a --board CLI in the ISA profiler, so a program's cycle cost can be scored against a specific FPGA's per-op latencies instead of plena_settings.toml. Only the board's compute cost model (the latency: section) is consumed; async memory timing (H_PREFETCH/H_STORE) stays uncharged, matching the behaviour simulator's async memory model. Extracted from the WIP profiling branch (sim PR #58) — deliberately excludes that branch's docs, the Rust DDR3/streaming memory model (lib/memory/streaming.rs, cli.rs, main.rs) and its serde Cargo deps, and the stale ATEN_UNROLL compare-harness edits that predate the ATEN_OPS_UNROLL rename. Verified: the profiler runs against a real decoder ASM for both nexys_a7 and v80.

* style: ruff format isa_analysis.py (wrap --board argparse line)
booth-algo added a commit that referenced this pull request Jun 3, 2026
booth-algo added a commit that referenced this pull request Jun 3, 2026
@booth-algo booth-algo force-pushed the kev/profile-and-opt branch from 9bb716d to b30ad94 Compare June 3, 2026 17:03
…pping) wired into --memory-model

Adds lib/memory/streaming.rs (HostStream = direct host->SRAM streaming bypassing DDR3 for weights; LayerSwapping = capacity-limited DDR3 that swaps layer weights on demand; WeightManifest parsed from JSON), and re-implements HBM-model instantiation in runner.rs (the emulator was refactored since this work began: dispatch/Accelerator moved out of the monolithic main.rs into src/accelerator/, so the original main.rs wiring no longer applied). runner.rs now preloads the backing store up front and builds Arc<dyn ErasedMemoryModel> via match opts.memory_model {hbm,layer-swap,host-stream}. The default --memory-model hbm path is byte-for-byte identical (verified: HBM stats, Latency, vram_dump unchanged); cargo build --release clean. Streaming-model statistics are not yet surfaced through the erased trait (follow-up).
…fig memory cycles

SimulatorCycleModel gains prefetch_v/prefetch_m/store_v_cycles (default 0); cycle_model_from_board reads them from the board memory: section, so --board nexys_a7 now charges DDR3 latency (V=25, M=400, store=25 cyc). Defaults are 0, so the plena_settings.toml path and every existing caller keep the prior async behaviour unchanged.
@booth-algo booth-algo force-pushed the kev/profile-and-opt branch from b30ad94 to 6761f55 Compare June 3, 2026 17:18
…ryModel

Adds MemoryModel::statistics_summary (default None) forwarded via ErasedMemoryModel::box_statistics_summary, overridden by WithStats (the existing HBM byte-stats line), HostStream (weight/activation stream counts), and LayerSwapping (swap count/bytes/time). runner.rs now reports stats uniformly for any --memory-model via the trait, dropping the 'statistics unavailable for streaming models' fallback; the concrete hbm_concrete handle is kept only for the DEBUG-only HBM dump (data read-back isn't on the trait). Default --memory-model hbm output is byte-for-byte unchanged (verified: Bytes read 49152 / written 0 / 5.22e9 bytes/sec, Latency 9410ns); cargo build + cargo fmt clean.
Resolves the confirmed correctness + cleanup findings from the PR #58 review: (1) WithStats::statistics_summary now appends the inner model's summary so HostStream/LayerSwapping stats actually surface (were shadowed by the outer WithStats); the HBM path is unchanged (inner returns None). (2) HostStream's inner is now wrapped in WithTiming(ddr3) so activation reads incur DDR3 latency (matches the doc; LayerSwap already did this). (3+6) --host-bandwidth / --ddr3-capacity are now non-Option with a value_parser that rejects 0 (was a divide-by-zero panic via Some(0) slipping past unwrap_or; also removes the dead Option/unwrap_or). (4+8) bytes->nanos extracted to transfer_nanos() with a u128 intermediate, fixing u64 overflow for layers larger than ~18 GB and de-duplicating the two call sites. (5) ensure_resident now warns when a single layer's weights exceed DDR3 capacity (restores the dropped guard). (7) clarifying comment that the DDR4-2400 preset approximates the board's DDR3-1600. Default --memory-model hbm output byte-for-byte unchanged; cargo build + fmt clean; --host-bandwidth 0 now rejected at the CLI.
@booth-algo

Copy link
Copy Markdown
Collaborator Author

Superseded by #90 — the same weight-streaming + DDR3-profiler work rebased cleanly onto main, plus the board memory-regime model (CapacityModel: resident / kv-swap / weight-stream) and the weight-manifest generator. Closing this WIP reference branch in favour of #90.

@booth-algo booth-algo closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant