A real, published 2.4-billion-parameter language model answers a question, and the matrix-multiplications that produce the answer happen inside the DRAM cells of a regular DDR4 memory module — via charge-sharing, on real silicon.
▶ Interactive walkthrough of how it works: https://pcdeni.github.io/CaSA/explainer/ — 11 stepped scenes from cell to inference loop, every claim sourced to a paper or the code in this repo. Reviewed adversarially before publication; the claim-to-source ledger lives next to the page in
docs/explainer/.▶ Companion: https://pcdeni.github.io/CaSA/explainer/xor-spread.html — the
doubleACTrow-spread we found during calibration: a bit-exact copy deposited into address-XOR sibling rows, the MAJ3 self-pollution it causes, and how we engineer around it (or exploit it).
This repository contains the software side of that demonstration:
scheduler/—casa_sched.c, a discrete-event scheduler that models a DDR4 channel running charge-sharing PIM primitives, with bus contention tracked explicitly. Used to project per-token throughput under different optimization stacks.app/— C++ apps that issue charge-sharing primitives (RowClone, MAJ3 viadoubleACT, multi-row broadcast) on real DRAM-Bender silicon and run the matmuls of Microsoft's BitNet b1.58-2B-4T using them. Drop these into a DRAM-Bender checkout andmake.python/— orchestrator that patches Hugging Face'stransformerslibrary to route specific BitNet projection layers to the FPGA-side server while the rest of the model runs on the CPU.calibration/— calibrated MAJ3-perfect open-row tuples for one of our test DIMMs. Format documented; you produce your own for new DIMMs.docs/— hardware requirements, calibration protocol, scheduler-projection methodology, and the interactive explainer (source indocs/explainer/).
This builds directly on prior research from the
CMU SAFARI group — RowClone, Ambit,
SiMRA-DRAM, Multi-Row-Init, LISA, pLUTo — and the open-source
DRAM-Bender FPGA platform.
We don't re-host either; you clone them yourself and place the C++
apps from app/ into the right path. See app/README.md.
What we are running today is BitNet b1.58-2B-4T with 1 of its 30 transformer layers' matrix-multiplies executing in DRAM (the seven projections of layer 0). The other 29 layers run in PyTorch on the CPU. This is enough to demonstrate the mechanism end-to-end on a real published model — running all 30 layers in DRAM is straightforward engineering, not a science question, and the scheduler projects what that would look like.
The current measurement is dominated by orchestration overhead (per-call PCIe round-trips, per-column weight writes, Python + subprocess). It is not what the silicon can do — it is what our software currently lets the silicon do.
The cycle-level scheduler casa_sched.c exists to project the
bus-bound silicon ceiling: what happens once orchestration
overhead is engineered out and the only real wall is the DDR bus.
Every number below the "MEASURED" row comes from running the
scheduler with the listed flags; the bus utilization the scheduler
reports is printed alongside.
All projections come from casa_sched.c configured to match what our
silicon actually issues per MAJ3: an activation broadcast plus 5
full-row bus_writes (the activation update — doubleACT(10,2)
broadcasts to all 16 open rows, so the activation slots have to be
overwritten individually), a 3-cycle frac discharge, the MAJ3 itself,
and a result read. The scheduler bookkeeping respects every standard
DDR4 timing parameter (tRCD, tRP, tFAW, tCCD, tBurst, tWR, tREFI,
tRFC) and tracks bus and bank utilization explicitly. Bus-bound
projections quoted below are what our current silicon
implementation would achieve at full bus utilization — i.e. with
all the orchestration overhead engineered out. They are not
hypothetical-future projections.
| Regime | Per-token | tok/s | Bus % | Source |
|---|---|---|---|---|
| MEASURED today — full 30 BitNet layers, multi-bank, ~30 s/tok dominated by orchestration overhead per MAJ3 | ~30 s | ~0.034 | ~2 % | this hardware, today |
| Bus-bound ceiling — all 30 layers in DRAM, 1 DIMM, current silicon path | 3.0 s | 0.33 | 97 % | casa_sched --dimms 1 |
| + bank-group-parallel bus | 2.4 s | 0.40 | 96 % | ... --bg-parallel |
| + 4 DIMMs in parallel | 0.61 s | 1.57 | 96 % | --dimms 4 --bg-parallel |
| + on-FPGA popcount accumulator (our HDL, ready) | 0.57 s | 1.75 | 95 % | ... --popcount fpga-accum |
| + in-DRAM popcount (vendor RTL change) | 0.52 s | 1.86 | 95 % | ... --popcount dram |
| + LISA cross-subarray bus (vendor RTL change) | 0.51 s | 1.90 | 95 % | ... --lisa |
| ── beyond here is back-of-envelope (write-side write reduction, ── | ||||
| ── e.g. a 3-row MAJ primitive vs today's 11-row setup) — not casa_sched output ── |
Three sentences for the story:
-
Today's measurement is ~200× slower per MAJ3 than the silicon's bus-bound ceiling, and the entire gap is software — eliminating per-call PCIe round-trips, batching MAJ3s into SoftMC outer loops, pre-loading weights into DRAM at startup so the runtime never per-column-writes weights again.
-
The bus-bound ceiling on existing DRAM is modest — ~2 tok/s with 4 DIMMs even after every realistic optimization, because updating activation rows takes 5 full-row bus_writes per MAJ3 (the
doubleACTbroadcast primitive distributes to all 16 open rows, so the 5 activation slots must be individually re-written via wrRow). At that point the bus is genuinely full; adding popcount or LISA helps only marginally because they target the bus_read, not the bus_writes that dominate. -
Beyond ~2 tok/s requires a new DRAM primitive — specifically a way to update the activation rows without 11 full-row writes per MAJ3 (e.g. a 3-row MAJ recipe, or selective subset-broadcast that reaches the 16-row open-set via charge-sharing). Those are DRAM-vendor changes outside the scope of this repo, so we don't present headline tok/s numbers for them — only the bus-traffic argument for why they would matter.
Output is bit-exact correct on most cells (~22 139 of 22 144 in
one full BitNet layer = 99.98 %). The 5 stray flips come from cells
that pass the calibrated 1000-pattern stability test but flip on
uncalibrated bit-combinations. Ternary models are robust to this by
construction. See docs/METHODOLOGY.md.
The point of the work is not to beat a GPU on speed. The point is to demonstrate the mechanism on real silicon and put concrete, scheduler-bounded numbers on what would change with two specific DRAM-vendor improvements (in-DRAM popcount, LISA).
# 1. Clone DRAM-Bender (the FPGA controller + bitstream)
git clone https://github.com/CMU-SAFARI/DRAM-Bender
# … bring up the BCU1525 bitstream per DRAM-Bender's docs.
# 2. Drop our C++ apps into DRAM-Bender's apps tree and build.
cp app/*.cpp app/Makefile DRAM-Bender/sources/apps/DSN_AE_APPS/BitNet/
cp calibration/calib_dimm0.txt DRAM-Bender/sources/apps/DSN_AE_APPS/BitNet/
cd DRAM-Bender/sources/apps/DSN_AE_APPS/BitNet
make
# 3. Calibrate a DIMM (only needed once per chip — see docs/CALIBRATION.md).
# The shipped calib_dimm0.txt is for one of our test DIMMs;
# your hardware may need its own characterization.
# 4. Run an end-to-end smoke test:
./bitnet-real-exe 0 calib_dimm0.txt 1
# Expected: bit-exact match on a small ternary x int8 matrix multiply.
# 5. Hook the long-running PIM server into BitNet inference.
# These two paths point the Python orchestrator at the binaries
# you just built. (The python script reads them via env or CLI.)
export PIM_RUNNER=$PWD/bitnet-proj-exe
export PIM_SERVER=$PWD/bitnet-proj-server
export BITNET_CACHE=~/bitnet_weights # any HF cache dir
cd <repo>/python
pip install transformers==4.52 torch
python3 run_bitnet_pim.py \
--max-tokens 8 --projs all --bank "0,1,2,3" \
--prompt "What is the capital of Hungary? Answer in one sentence."
# Expected output ends with "Budapest" after ~4 minutes.See python/README.md for argument details and how the
pim_substitute swap works internally.
- A Xilinx Alveo U200 / BCU1525 (or compatible) FPGA card flashed with the DRAM-Bender bitstream.
- One or more DDR4 1333 MT/s DIMMs in the FPGA's DIMM slots (you'll need to characterize them).
- A host with PCIe-attached FPGA, the Xilinx XDMA driver loaded, and the SoftMC API available.
Full details in docs/HARDWARE.md.
- Per-cell yield: a small fraction of cells (we measured
~5/22 144 in one full layer = 0.02 %) flip on input bit-patterns
the calibration didn't exhaust. BitNet is robust enough to absorb
this, and most outputs land bit-exact. The per-bank yield is
run-to-run nondeterministic —
docs/METHODOLOGY.mddiscusses. - Multi-bank divergence: parallelizing across multiple banks uses multiple calibrated tuples, each with its own flaky-cell pattern. Output stays sensible but is not bit-exact across runs. For deterministic demos, pin to one bank.
- Multi-DIMM scaling is not yet integrated: characterization on the additional DIMMs (1, 2, 3) is in progress at the time of release. The scheduler projections that include 4-DIMM parallelism assume the calibration completes; the pure single-DIMM numbers do not.
- The simulator was written before the silicon implementation.
Its hardcoded charge-sharing latencies were patched against
measured DIMM 0 values; numbers shift by <2% because the
projections are bus-bound, not MAJ3-bound. See
docs/METHODOLOGY.md.
- CMU SAFARI Group (Onur Mutlu et al.) — RowClone, Ambit, SiMRA-DRAM, Multi-Row-Init, LISA, pLUTo. Without their decade of characterization papers and open-source toolkits, none of this is possible on existing silicon.
- Microsoft Research — BitNet b1.58-2B-4T, an open-weight 2.4-billion-parameter ternary language model.
- Hugging Face —
transformerslibrary (we test against v4.52). - The communities behind
Manim,Piper TTS,matplotlib, andffmpegfor the video-production tooling used in the presentation (sources for those live in the private prototype repository, not here).
MIT — see LICENSE. Upstream components remain under their own licenses (DRAM-Bender, SiMRA-DRAM, BitNet, transformers, …); we don't ship them.