86x faster arithmetic acceleration through optimized custom hardware and DMA pipeline
This project demonstrates high-performance FPGA design using Custom Instructions, Modular Scatter-Gather DMA, and Avalon Streaming Pipeline to achieve massive speedups over pure software implementations on Nios II.
For detailed implementation journey, design decisions, and technical deep-dive:
- π Nios II & DMA Acceleration Guide
- π Burst Master Optimization
- π Stream Processor Pipeline
- π Dynamic PLL Reconfiguration
- π Project Roadmap (TODO)
Hardware-accelerated arithmetic unit integrated directly into Nios II CPU pipeline.
Optimization Highlights:
- Target Operation:
(A Γ B) / 400 - Traditional Approach: Hardware divider β Setup Time Violations at 50MHz
- Our Solution: Shift-Add approximation
(A Γ 5243) >> 21- Mathematical accuracy: 99.998% (0.0018% error)
- Zero timing violations even at high frequency
- Massive cycle reduction vs. software division
Parameterizable N-stage pipeline with robust backpressure handling.
Architecture:
Stage 0: Input Capture & Endian Swap
β
Stage 1: Coefficient Multiplication (Input Γ Coeff)
β
Stage 2: Division Approximation & Final Endian Swap
Design Features:
- Valid-Ready Handshake: Industry-standard Avalon-ST backpressure
- Automatic Byte Swapping: Resolves mSGDMA endianness mismatch
- Reusable Template: pipe_template.v for future projects
- Timing Closure: Maintains high throughput while meeting 50MHz+ timing
Disaggregated mSGDMA architecture with inline computation.
Benefits:
- Zero CPU Load: Calculations happen during DMA transfer
- Memory Efficiency: Direct memory-to-memory with transformation
- Flexible Structure: Separate Dispatcher, Read Master, Write Master
Benchmarks on Nios II @ 50MHz with 1000-element array processing:
| Mode | Description | Performance vs. Software |
|---|---|---|
| Bypass | DMA copy only | 7.59x faster than CPU memcpy |
| Full Acceleration | DMA + Pipeline computation | 86.14x faster than software division |
Real Numbers:
- Software computation: ~860ms
- DMA + Hardware: ~10ms
- Result: 86x speedup π
Professional hardware verification using Cocotb and pytest.
- β Python-based testbenches for flexible test scenarios
- β Automated waveform generation (VCD/FST)
- β Pytest integration for CI/CD compatibility
- β Isolated build directories per module
- β Behavioral models for Altera IP (altsyncram)
cd tests/cocotb
pytest test_runner.py -v
# Output:
# test_runner.py::test_cocotb_modules[my_custom_slave] PASSED [50%]
# test_runner.py::test_cocotb_modules[stream_processor] PASSED [100%]
# ==================== 2 passed in 0.81s ====================# GTKWave
gtkwave tests/cocotb/sim_build/stream_processor/dump.vcd
# Or use VS Code extension: Surferquartus_project/
βββ RTL/
β βββ stream_processor.v # 3-Stage Pipeline Accelerator
β βββ pipe_template.v # Reusable N-Stage Template
β βββ my_multi_calc.v # Custom Instruction Unit
β βββ my_slave.v # Avalon-MM Slave w/ DPRAM
β βββ top_module.v # System Integration
β
βββ ip/
β βββ dpram.v # Dual-Port RAM (1KB)
β
βββ software/
β βββ cust_inst_app/
β βββ main.c # Benchmark & Test Application
β
βββ tests/cocotb/
β βββ test_runner.py # Pytest Runner
β βββ tb_my_slave.py # Avalon-MM Testbench
β βββ tb_stream_processor_avs.py # Pipeline Testbench
β βββ sim_models/
β βββ altsyncram.v # Behavioral Model
β
βββ custom_inst_qsys.qsys # Platform Designer System
βββ doc/
β βββ burst_master.md # Burst Master Documentation
β βββ history.md # Detailed Implementation Guide (EN)
β βββ history_kor.md # Detailed Implementation Guide (KR)
β βββ nios.md # Nios II Implementation Details
β βββ pll.md # PLL Reconfiguration Details
β βββ README_kor.md # Korean README
β βββ TODO.md # Project TODO List
βββ README.md # Main English README
- Intel Quartus Prime (20.1 or later)
- Nios II EDS
- DE10-Nano Board (or Cyclone V FPGA)
- Python 3.8+ with Cocotb (for verification)
# Open Quartus project
quartus_sh --tcl_eval project_open custom_inst.qpf
# Compile (or use Quartus GUI: Processing β Start Compilation)
quartus_sh --flow compile custom_instcd software/cust_inst_app
nios2-app-generate-makefile --bsp-dir ../cust_inst_bsp
make# Via Quartus Programmer or command line
quartus_pgm -c 1 -m JTAG -o "p;output_files/custom_inst.sof"nios2-terminal # Connect to UART
# Then from Nios II shell:
./software/cust_inst_app/cust_inst_app.elfProblem: Hardware divider couldn't meet 50MHz timing.
Solution: Mathematical transformation using fixed-point approximation:
1/400 β 5243/2^21
Error: 0.0018%
Result: Zero timing violations
Problem: mSGDMA "First Symbol In High-Order Bits" reversed byte order.
Solution: Automatic byte-swapping at pipeline input/output:
assign swapped = {original[7:0], original[15:8],
original[23:16], original[31:24]};Problem: Data loss when downstream stalls.
Solution: Cascaded Valid-Ready handshake through all stages:
always @(posedge clk) begin
if (pipe_ready[N] || !pipe_valid[N])
stage_data[N] <= stage_data[N-1];
endIf you're new to FPGA or Nios II development, check out:
- history.md - Complete design journey with rationale
- pipe_template.v - Reusable pipeline template with detailed comments
- Cocotb Tests - See tests/cocotb/ for verification examples
Contributions are welcome! Areas of interest:
- Additional test cases for edge scenarios
- Support for other FPGA boards
- Enhanced pipeline configurations
- Documentation improvements
MIT License - See LICENSE for details
- Intel FPGA University Program
- Cocotb open-source verification framework
- VS Code Surfer waveform viewer



