Nios II Custom Instruction & DMA Acceleration Project

86x faster arithmetic acceleration through optimized custom hardware and DMA pipeline

This project demonstrates high-performance FPGA design using Custom Instructions, Modular Scatter-Gather DMA, and Avalon Streaming Pipeline to achieve massive speedups over pure software implementations on Nios II.

📚 Documentation

For detailed implementation journey, design decisions, and technical deep-dive:

📖 Supplemental Docs

Read this in other languages

🇰🇷 한국어 (Korean)

✨ Key Features

1. Custom Instruction Unit

Hardware-accelerated arithmetic unit integrated directly into Nios II CPU pipeline.

Optimization Highlights:

Target Operation: (A × B) / 400
Traditional Approach: Hardware divider → Setup Time Violations at 50MHz
Our Solution: Shift-Add approximation (A × 5243) >> 21
- Mathematical accuracy: 99.998% (0.0018% error)
- Zero timing violations even at high frequency
- Massive cycle reduction vs. software division

2. 3-Stage Streaming Pipeline Processor

Parameterizable N-stage pipeline with robust backpressure handling.

Architecture:

Stage 0: Input Capture & Endian Swap
   ↓
Stage 1: Coefficient Multiplication (Input × Coeff)
   ↓
Stage 2: Division Approximation & Final Endian Swap

Design Features:

Valid-Ready Handshake: Industry-standard Avalon-ST backpressure
Automatic Byte Swapping: Resolves mSGDMA endianness mismatch
Reusable Template: pipe_template.v for future projects
Timing Closure: Maintains high throughput while meeting 50MHz+ timing

3. Modular Scatter-Gather DMA Integration

Disaggregated mSGDMA architecture with inline computation.

Benefits:

Zero CPU Load: Calculations happen during DMA transfer
Memory Efficiency: Direct memory-to-memory with transformation
Flexible Structure: Separate Dispatcher, Read Master, Write Master

🏗️ System Architecture

🚀 Performance Results

Benchmarks on Nios II @ 50MHz with 1000-element array processing:

Mode	Description	Performance vs. Software
Bypass	DMA copy only	7.59x faster than CPU memcpy
Full Acceleration	DMA + Pipeline computation	86.14x faster than software division

Real Numbers:

Software computation: ~860ms
DMA + Hardware: ~10ms
Result: 86x speedup 🚀

🧪 Verification Environment

Professional hardware verification using Cocotb and pytest.

Features

✅ Python-based testbenches for flexible test scenarios
✅ Automated waveform generation (VCD/FST)
✅ Pytest integration for CI/CD compatibility
✅ Isolated build directories per module
✅ Behavioral models for Altera IP (altsyncram)

Quick Test

cd tests/cocotb
pytest test_runner.py -v

# Output:
# test_runner.py::test_cocotb_modules[my_custom_slave] PASSED    [50%]
# test_runner.py::test_cocotb_modules[stream_processor] PASSED   [100%]
# ==================== 2 passed in 0.81s ====================

View Waveforms

# GTKWave
gtkwave tests/cocotb/sim_build/stream_processor/dump.vcd

# Or use VS Code extension: Surfer

📂 Project Structure

quartus_project/
├── RTL/
│   ├── stream_processor.v     # 3-Stage Pipeline Accelerator
│   ├── pipe_template.v        # Reusable N-Stage Template
│   ├── my_multi_calc.v        # Custom Instruction Unit
│   ├── my_slave.v             # Avalon-MM Slave w/ DPRAM
│   └── top_module.v           # System Integration
│
├── ip/
│   └── dpram.v                # Dual-Port RAM (1KB)
│
├── software/
│   └── cust_inst_app/
│       └── main.c             # Benchmark & Test Application
│
├── tests/cocotb/
│   ├── test_runner.py         # Pytest Runner
│   ├── tb_my_slave.py         # Avalon-MM Testbench
│   ├── tb_stream_processor_avs.py  # Pipeline Testbench
│   └── sim_models/
│       └── altsyncram.v       # Behavioral Model
│
├── custom_inst_qsys.qsys      # Platform Designer System
├── doc/
│   ├── burst_master.md        # Burst Master Documentation
│   ├── history.md             # Detailed Implementation Guide (EN)
│   ├── history_kor.md         # Detailed Implementation Guide (KR)
│   ├── nios.md                # Nios II Implementation Details
│   ├── pll.md                 # PLL Reconfiguration Details
│   ├── README_kor.md          # Korean README
│   └── TODO.md                # Project TODO List
└── README.md                  # Main English README

🛠️ Quick Start

Prerequisites

Intel Quartus Prime (20.1 or later)
Nios II EDS
DE10-Nano Board (or Cyclone V FPGA)
Python 3.8+ with Cocotb (for verification)

Build FPGA Hardware

# Open Quartus project
quartus_sh --tcl_eval project_open custom_inst.qpf

# Compile (or use Quartus GUI: Processing → Start Compilation)
quartus_sh --flow compile custom_inst

Build Software

cd software/cust_inst_app
nios2-app-generate-makefile --bsp-dir ../cust_inst_bsp
make

Program FPGA

# Via Quartus Programmer or command line
quartus_pgm -c 1 -m JTAG -o "p;output_files/custom_inst.sof"

Run Application

nios2-terminal  # Connect to UART
# Then from Nios II shell:
./software/cust_inst_app/cust_inst_app.elf

🔬 Technical Highlights

Challenge 1: Timing Violations

Problem: Hardware divider couldn't meet 50MHz timing.

Solution: Mathematical transformation using fixed-point approximation:

1/400 ≈ 5243/2^21
Error: 0.0018%
Result: Zero timing violations

Challenge 2: Endianness Mismatch

Problem: mSGDMA "First Symbol In High-Order Bits" reversed byte order.

Solution: Automatic byte-swapping at pipeline input/output:

assign swapped = {original[7:0], original[15:8], 
                  original[23:16], original[31:24]};

Challenge 3: Pipeline Backpressure

Problem: Data loss when downstream stalls.

Solution: Cascaded Valid-Ready handshake through all stages:

always @(posedge clk) begin
    if (pipe_ready[N] || !pipe_valid[N])
        stage_data[N] <= stage_data[N-1];
end

📖 Learning Resources

If you're new to FPGA or Nios II development, check out:

history.md - Complete design journey with rationale
pipe_template.v - Reusable pipeline template with detailed comments
Cocotb Tests - See tests/cocotb/ for verification examples

🤝 Contributing

Contributions are welcome! Areas of interest:

Additional test cases for edge scenarios
Support for other FPGA boards
Enhanced pipeline configurations
Documentation improvements

📄 License

MIT License - See LICENSE for details

🙏 Acknowledgments

Intel FPGA University Program
Cocotb open-source verification framework
VS Code Surfer waveform viewer

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
RTL		RTL
doc		doc
hps_isw_handoff/custom_inst_qsys_hps_0		hps_isw_handoff/custom_inst_qsys_hps_0
ip		ip
software/cust_inst_app		software/cust_inst_app
tests		tests
.gitignore		.gitignore
README.md		README.md
c5_pin_model_dump.txt		c5_pin_model_dump.txt
cust_cal_hw.tcl		cust_cal_hw.tcl
custom_inst.qpf		custom_inst.qpf
custom_inst.qsf		custom_inst.qsf
custom_inst.sdc		custom_inst.sdc
custom_inst_qsys.qsys		custom_inst_qsys.qsys
custom_inst_qsys.sopcinfo		custom_inst_qsys.sopcinfo
final_test_output.log		final_test_output.log
hps_sdram_p0_all_pins.txt		hps_sdram_p0_all_pins.txt
hps_sdram_p0_summary.csv		hps_sdram_p0_summary.csv
stream_multdiv_hw.tcl		stream_multdiv_hw.tcl
stream_multdiv_simd_hw.tcl		stream_multdiv_simd_hw.tcl
test_output.log		test_output.log
test_output_v2.log		test_output_v2.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nios II Custom Instruction & DMA Acceleration Project

📚 Documentation

📖 Supplemental Docs

Read this in other languages

✨ Key Features

1. Custom Instruction Unit

2. 3-Stage Streaming Pipeline Processor

3. Modular Scatter-Gather DMA Integration

🏗️ System Architecture

🚀 Performance Results

🧪 Verification Environment

Features

Quick Test

View Waveforms

📂 Project Structure

🛠️ Quick Start

Prerequisites

Build FPGA Hardware

Build Software

Program FPGA

Run Application

🔬 Technical Highlights

Challenge 1: Timing Violations

Challenge 2: Endianness Mismatch

Challenge 3: Pipeline Backpressure

📖 Learning Resources

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

bccha/quartus_project

Folders and files

Latest commit

History

Repository files navigation

Nios II Custom Instruction & DMA Acceleration Project

📚 Documentation

📖 Supplemental Docs

Read this in other languages

✨ Key Features

1. Custom Instruction Unit

2. 3-Stage Streaming Pipeline Processor

3. Modular Scatter-Gather DMA Integration

🏗️ System Architecture

🚀 Performance Results

🧪 Verification Environment

Features

Quick Test

View Waveforms

📂 Project Structure

🛠️ Quick Start

Prerequisites

Build FPGA Hardware

Build Software

Program FPGA

Run Application

🔬 Technical Highlights

Challenge 1: Timing Violations

Challenge 2: Endianness Mismatch

Challenge 3: Pipeline Backpressure

📖 Learning Resources

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages