Skip to content

docs: memory footprint & host-link streaming analysis#78

Closed
booth-algo wants to merge 2 commits into
mainfrom
docs/memory-footprint-streaming
Closed

docs: memory footprint & host-link streaming analysis#78
booth-algo wants to merge 2 commits into
mainfrom
docs/memory-footprint-streaming

Conversation

@booth-algo

Copy link
Copy Markdown
Collaborator

Adds doc/memory-footprint-and-streaming.md — reasoning notes on:

  • Footprint: SmolVLM2-256M at MXFP8 ≈ 288 MB weights + tens-of-MB KV → fits in 512 MB onboard DDR (the "~1 GB" GPU figure is CUDA overhead PLENA doesn't have).
  • Implication: the HostStream / LayerSwapping models (lib/memory/src/streaming.rs) are for weights-exceed-DDR; for this model use plain resident DDR (no streaming).
  • Instructions: load the program into DDR once; don't stream it per-cycle.
  • DDR3 vs host link: bandwidth table + the latency point (local DDR3 ~15–50 ns vs PCIe ~0.5–1 µs), and the transport-vs-memory framing. For the actual boards: Nexys local DDR3 (~1–3.2 GB/s) ≫ USB2 (~38 MB/s, no PCIe); V80 HBM (819 GB/s) ≫ PCIe.
  • Real bottleneck: DDR bandwidth (~3–4 tok/s on Nexys), not the host link.
  • Validation: run LayerSwapping with capacity = 512 MB and confirm zero swaps; note the emulator currently models DDR as 128 GiB so the limit isn't enforced today.

Captured from a design discussion; no code changes.

Notes on why SmolVLM2-256M (~288MB weights at MXFP8 + KV) fits in 512MB onboard
DDR and needs no weight streaming, plus a DDR3-vs-host-link (USB2/PCIe)
bandwidth + latency comparison for the Nexys Video and V80 targets, and how to
validate the footprint with the LayerSwapping model.
A PCIe 2.0 x4 host link (~1.6-1.8 GB/s effective) reaches parity with the local
single-chip DDR3, so weight streaming (HostStream/LayerSwapping) flips from
hopeless (USB2) to viable — main value being the lifted 512MB capacity ceiling.
Adds the board-table row, a §5.1 subsection (numbers, implications, latency
caveat, perf sketch), and a bottom-line note.
@booth-algo booth-algo marked this pull request as draft June 1, 2026 13:33
booth-algo added a commit that referenced this pull request Jun 2, 2026
Consolidates the memory-footprint / PCIe-vs-DDR3 streaming reference doc
(previously PR #78) into this combined WIP reference PR.
@booth-algo

Copy link
Copy Markdown
Collaborator Author

Consolidated into #58 (combined WIP reference PR) — the memory-footprint doc now lives on kev/profile-and-opt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant