Hardware accelerator for multiplying a 4×8 matrix with an 8×4 matrix to produce a 4×4 output matrix.
The project includes SystemVerilog RTL, MATLAB/Python reference simulation & verification, and Yosys-based ASIC synthesis with standard-cell library comparisons.
- SystemVerilog RTL: modular architecture (control, input register file, MAC units, RAM interface, output logic)
- Verification:
- RTL simulation (Vivado / EDA Playground)
- Testbench with automated checking against MATLAB-generated expected results
- Additional Python/MATLAB simulation utilities (reference model + data generation)
- ASIC Synthesis (Yosys):
- multi-run synthesis scripts
- gate-level netlists for different optimization modes
- area/cell breakdown and reports
The accelerator supports three operations:
- Set
rst = 1 - Reset stops ongoing work but does not clear RAM
- Ensure accelerator is idle (or reset)
- Set
ram_slot(0–31) to select the destination slot - Assert
start = 1 - Provide the input matrix column-wise on
in_dataduring thestartcycle and the next 31 cycles - Computation starts automatically;
finish = 1indicates completion
- Ensure accelerator is idle
- Set
ram_slotto select the stored matrix - Set
start = 0 - Assert
read = 1
Readout details:
- Output values are returned column-wise
- Each value takes 2 cycles:
- LSB half first, then MSB half
- Full readout duration: 32 cycles (16 values × 2 cycles)
Key modules:
top_file.sv— top-level integrationcalc_asmd.sv— control unit (ASMD)ireg.sv— input register file (32 × 8-bit, 1W + 4R)mul.sv— multipliersmac_unit.sv— accumulation stageram_mux.sv— single-port RAM write multiplexingoutput_logic.sv— splits 18-bit values into 2×9-bit transfersrom.sv— coefficient/constant storageRM_IHPSG13_1P_512x32_c2_bm_bist.v— SRAM macroRM_IHPSG13_1P_core_behavioral_bm_bist.v— behavioral SRAM model
Cycle breakdown (per multiply):
- Load input: 32 cycles
- Compute + accumulate: 32 cycles
- Write to RAM: 16 cycles
- Optional readout: 32 cycles
Total (compute only): 80 cycles
Total (compute + read): 112 cycles
Synthesis was evaluated across:
- optimization modes: speed / balanced / area
- standard-cell libraries representing slow / typical / fast corners
The design is memory dominated (SRAM macro contributes the majority of total area), so library choice has a stronger impact than logic optimization flags.
Artifacts typically included:
multirun.ysnetlist_speed.v,netlist_balanced.v,netlist_area.v- synthesis figures/reports under
figures/and/orsynth/
MIT — see LICENSE.