Skip to content

A complete guide to Nios II Hardware Acceleration: From Software implementation to DMA & SIMD optimization. Includes detailed documentation and Cocotb verification environment

Notifications You must be signed in to change notification settings

bccha/quartus_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

63 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Nios II Custom Instruction & DMA Acceleration Project

License: MIT FPGA Nios II

86x faster arithmetic acceleration through optimized custom hardware and DMA pipeline

This project demonstrates high-performance FPGA design using Custom Instructions, Modular Scatter-Gather DMA, and Avalon Streaming Pipeline to achieve massive speedups over pure software implementations on Nios II.

πŸ“š Documentation

For detailed implementation journey, design decisions, and technical deep-dive:

πŸ“– Supplemental Docs

Read this in other languages


✨ Key Features

1. Custom Instruction Unit

Hardware-accelerated arithmetic unit integrated directly into Nios II CPU pipeline.

Optimization Highlights:

  • Target Operation: (A Γ— B) / 400
  • Traditional Approach: Hardware divider β†’ Setup Time Violations at 50MHz
  • Our Solution: Shift-Add approximation (A Γ— 5243) >> 21
    • Mathematical accuracy: 99.998% (0.0018% error)
    • Zero timing violations even at high frequency
    • Massive cycle reduction vs. software division

2. 3-Stage Streaming Pipeline Processor

Parameterizable N-stage pipeline with robust backpressure handling.

Architecture:

Pipeline Architecture

Stage 0: Input Capture & Endian Swap
   ↓
Stage 1: Coefficient Multiplication (Input Γ— Coeff)
   ↓
Stage 2: Division Approximation & Final Endian Swap

Design Features:

  • Valid-Ready Handshake: Industry-standard Avalon-ST backpressure
  • Automatic Byte Swapping: Resolves mSGDMA endianness mismatch
  • Reusable Template: pipe_template.v for future projects
  • Timing Closure: Maintains high throughput while meeting 50MHz+ timing

DPRAM Architecture

3. Modular Scatter-Gather DMA Integration

Disaggregated mSGDMA architecture with inline computation.

Benefits:

  • Zero CPU Load: Calculations happen during DMA transfer
  • Memory Efficiency: Direct memory-to-memory with transformation
  • Flexible Structure: Separate Dispatcher, Read Master, Write Master

πŸ—οΈ System Architecture

System Architecture

πŸš€ Performance Results

Performance Comparison

Benchmarks on Nios II @ 50MHz with 1000-element array processing:

Mode Description Performance vs. Software
Bypass DMA copy only 7.59x faster than CPU memcpy
Full Acceleration DMA + Pipeline computation 86.14x faster than software division

Real Numbers:

  • Software computation: ~860ms
  • DMA + Hardware: ~10ms
  • Result: 86x speedup πŸš€

πŸ§ͺ Verification Environment

Professional hardware verification using Cocotb and pytest.

Features

  • βœ… Python-based testbenches for flexible test scenarios
  • βœ… Automated waveform generation (VCD/FST)
  • βœ… Pytest integration for CI/CD compatibility
  • βœ… Isolated build directories per module
  • βœ… Behavioral models for Altera IP (altsyncram)

Quick Test

cd tests/cocotb
pytest test_runner.py -v

# Output:
# test_runner.py::test_cocotb_modules[my_custom_slave] PASSED    [50%]
# test_runner.py::test_cocotb_modules[stream_processor] PASSED   [100%]
# ==================== 2 passed in 0.81s ====================

View Waveforms

# GTKWave
gtkwave tests/cocotb/sim_build/stream_processor/dump.vcd

# Or use VS Code extension: Surfer

πŸ“‚ Project Structure

quartus_project/
β”œβ”€β”€ RTL/
β”‚   β”œβ”€β”€ stream_processor.v     # 3-Stage Pipeline Accelerator
β”‚   β”œβ”€β”€ pipe_template.v        # Reusable N-Stage Template
β”‚   β”œβ”€β”€ my_multi_calc.v        # Custom Instruction Unit
β”‚   β”œβ”€β”€ my_slave.v             # Avalon-MM Slave w/ DPRAM
β”‚   └── top_module.v           # System Integration
β”‚
β”œβ”€β”€ ip/
β”‚   └── dpram.v                # Dual-Port RAM (1KB)
β”‚
β”œβ”€β”€ software/
β”‚   └── cust_inst_app/
β”‚       └── main.c             # Benchmark & Test Application
β”‚
β”œβ”€β”€ tests/cocotb/
β”‚   β”œβ”€β”€ test_runner.py         # Pytest Runner
β”‚   β”œβ”€β”€ tb_my_slave.py         # Avalon-MM Testbench
β”‚   β”œβ”€β”€ tb_stream_processor_avs.py  # Pipeline Testbench
β”‚   └── sim_models/
β”‚       └── altsyncram.v       # Behavioral Model
β”‚
β”œβ”€β”€ custom_inst_qsys.qsys      # Platform Designer System
β”œβ”€β”€ doc/
β”‚   β”œβ”€β”€ burst_master.md        # Burst Master Documentation
β”‚   β”œβ”€β”€ history.md             # Detailed Implementation Guide (EN)
β”‚   β”œβ”€β”€ history_kor.md         # Detailed Implementation Guide (KR)
β”‚   β”œβ”€β”€ nios.md                # Nios II Implementation Details
β”‚   β”œβ”€β”€ pll.md                 # PLL Reconfiguration Details
β”‚   β”œβ”€β”€ README_kor.md          # Korean README
β”‚   └── TODO.md                # Project TODO List
└── README.md                  # Main English README

πŸ› οΈ Quick Start

Prerequisites

  • Intel Quartus Prime (20.1 or later)
  • Nios II EDS
  • DE10-Nano Board (or Cyclone V FPGA)
  • Python 3.8+ with Cocotb (for verification)

Build FPGA Hardware

# Open Quartus project
quartus_sh --tcl_eval project_open custom_inst.qpf

# Compile (or use Quartus GUI: Processing β†’ Start Compilation)
quartus_sh --flow compile custom_inst

Build Software

cd software/cust_inst_app
nios2-app-generate-makefile --bsp-dir ../cust_inst_bsp
make

Program FPGA

# Via Quartus Programmer or command line
quartus_pgm -c 1 -m JTAG -o "p;output_files/custom_inst.sof"

Run Application

nios2-terminal  # Connect to UART
# Then from Nios II shell:
./software/cust_inst_app/cust_inst_app.elf

πŸ”¬ Technical Highlights

Challenge 1: Timing Violations

Problem: Hardware divider couldn't meet 50MHz timing.

Solution: Mathematical transformation using fixed-point approximation:

1/400 β‰ˆ 5243/2^21
Error: 0.0018%
Result: Zero timing violations

Challenge 2: Endianness Mismatch

Problem: mSGDMA "First Symbol In High-Order Bits" reversed byte order.

Solution: Automatic byte-swapping at pipeline input/output:

assign swapped = {original[7:0], original[15:8], 
                  original[23:16], original[31:24]};

Challenge 3: Pipeline Backpressure

Problem: Data loss when downstream stalls.

Solution: Cascaded Valid-Ready handshake through all stages:

always @(posedge clk) begin
    if (pipe_ready[N] || !pipe_valid[N])
        stage_data[N] <= stage_data[N-1];
end

πŸ“– Learning Resources

If you're new to FPGA or Nios II development, check out:

  1. history.md - Complete design journey with rationale
  2. pipe_template.v - Reusable pipeline template with detailed comments
  3. Cocotb Tests - See tests/cocotb/ for verification examples

🀝 Contributing

Contributions are welcome! Areas of interest:

  • Additional test cases for edge scenarios
  • Support for other FPGA boards
  • Enhanced pipeline configurations
  • Documentation improvements

πŸ“„ License

MIT License - See LICENSE for details


πŸ™ Acknowledgments

  • Intel FPGA University Program
  • Cocotb open-source verification framework
  • VS Code Surfer waveform viewer

About

A complete guide to Nios II Hardware Acceleration: From Software implementation to DMA & SIMD optimization. Includes detailed documentation and Cocotb verification environment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published