Date: 2026-02-06
Project Path: c:/Workspace/quartus_project
This document records the development process of building a Nios II-based SoC on an Intel FPGA. This project focuses on two key hardware components: a memory-mapped Custom Slave interface and a high-performance Custom Instruction unit for arithmetic acceleration. It is not just a simple change log (Changelog), but an in-depth technical record of why it was designed this way.
In the process of integrating a Dual-Port RAM (DPRAM) inside the slave module, we faced a common Verilog error: "Output port must be connected to a structural net expression".
This error occurred because the readdata output port was declared as a reg type and tried to be directly connected to the output of the internally instantiated dpram module. In Verilog, module instances must drive a wire, not a register (reg).
To solve this problem, we changed readdata to a wire type so it could be directly connected. Additionally, the Avalon-MM protocol requires explicit read latency management. Since our BlockRAM read takes 1 clock cycle, we implemented a synchronous readdatavalid signal that becomes 1 exactly 1 cycle after the read request.
Implementation Code (RTL/my_slave.v):
module my_custom_slave (
// ... ports ...
output wire [31:0] readdata, // Changed from reg to wire
output reg readdatavalid // Added for Avalon-MM latency processing
);
// Direct connection to DPRAM instance
dpram dpram_inst (
.clock(clk),
.rdaddress(address),
// ...
.q(readdata) // dpram drives this wire directly
);
// Synchronous Valid generation (1 cycle delay)
always @(posedge clk or negedge reset_n) begin
if (!reset_n) begin
readdatavalid <= 1'b0;
end else begin
readdatavalid <= read; // Passed with a 1-cycle delay
end
end
endmoduleTo give practical functionality to the custom slave, we embedded a Dual-Port RAM (DPRAM). Unlike flip-flop-based registers, DPRAM efficiently provides high-density storage space using the FPGA's dedicated memory blocks (M10K/M9K).
- Why Dual-Port? Because it allows simultaneous access from two different ports. In a more complex scenario, one port is connected to the Nios II processor (via this Avalon slave), and the other port can independently collect high-speed data from sensors or hardware logic.
- "Structural" Connection Cautions:
As in the error resolution process mentioned earlier, the
dprammodule is a structural entity. Theq(output) port drives a wire, and this wire is directly connected to the Avalon interface'sreaddatabus. Important: You must not attach a register (reg) to theqoutput. Thedpraminstance is already driving the signal structurally, so trying to latch it again into aregblock within the same module causes a compilation error.
(Figure: Internal structure of the Custom Slave with integrated DPRAM)
An important detail often overlooked during implementation is how to align the CPU address with the RAM address.
- Conflict Point:
- Nios II (Master): Uses Byte Addressing. When reading consecutive 32-bit integers, the address increases as
0x00,0x04,0x08,0x0C. - DPRAM (Internal): Uses Word Indexing. It is in the order of slot 0, slot 1, and expects
0,1,2,3.
- Nios II (Master): Uses Byte Addressing. When reading consecutive 32-bit integers, the address increases as
- Solution (Qsys Setting):
In Platform Designer, we selected Address Units: WORDS in the Avalon-MM Pipeline Slave settings.
- Operation Principle: The system interconnect automatically shifts the master's byte address 2 bits to the right (
Address >> 2) and passes it to the module'saddressinput. - Result: When the CPU tries to read
0x04(byte address 4),my_slave.vreceives1as theaddressinput. Therefore, you can directly connect the inputaddressto the DPRAM'srdaddressport without separate bit slicing (e.g.,address[9:2]) in the Verilog code.
- Operation Principle: The system interconnect automatically shifts the master's byte address 2 bits to the right (
Standard Nios II processors do not have a hardware floating-point unit by default, and integer division is computationally very expensive (takes many cycles). We needed a way to extremely quickly process a specific arithmetic operation of multiplying two numbers and then dividing by 400.
Hardware dividers consume significant logic resources and timing budget. Instead of using a raw divider, we adopted a mathematical approximation method using bit shift and multiplication (Shift-Add).
Mathematical Principle:
We want to calculate
The error of this method is only 0.0018%, which is almost negligible for our application.
Implementation Code (RTL/my_multi_calc.v):
always @(posedge clk or posedge reset) begin
if (reset) begin
mult_stage <= 0;
result <= 0;
end
else if (clk_en) begin
// [Cycle 1] Hardware Multiplication
mult_stage <= 64'd1 * dataa * datab;
// [Cycle 2] Optimization: Use Shift-Add instead of divider (K=5243, Q=21)
// Logic: (val * 5243) >> 21
result <= (mult_stage * 64'd5243) >> 21;
end
endAt the time of initial implementation, a Setup Time Violation (Timing Error) occurred. Analysis of the cause of this problem and its solution is as follows.
In hardware, the division (/) operation has a much deeper combinational logic depth compared to addition or multiplication.
- We attempted to process 32-bit division within a single clock (1 Cycle).
- The signal propagation delay (Data Path Delay) through the division circuit exceeded the clock period (e.g., 50MHz, 20ns).
- This caused a Setup Time Violation where the data did not arrive at the register on time.
The DSP block (Multiplier) inside the FPGA is very fast, but the divider is slow. Therefore, we solved the Timing issue by converting the division into multiplication and shift operations.
- Before:
Result = (A * B) / 400(Using divider -> slow, Timing Error) - After:
Result = ((A * B) * 1311) >> 19(Multiplier + shift -> fast, Timing Pass)
Through this change, we significantly shortened the Critical Path, resolved the Timing Violation, and were able to secure an operation speed of 50MHz or higher.
Finally, these modules are merged into one in the top-level entity. The custom_inst_qsys system generated by Platform Designer acts as the brain, and our custom HDL modules perform the role of muscles.
Core Integration Code (RTL/top_module.v):
// Qsys system instantiation
custom_inst_qsys u0 (
.clk_clk (CLOCK_50),
.reset_reset_n (RST),
// ... Avalon-MM signal connection ...
.mmio_exp_readdata (w_readdata),
.mmio_exp_readdatavalid (w_readdatavalid), // Connected to our slave
// ...
);
// Custom slave instantiation
my_custom_slave s1 (
.clk(CLOCK_50),
.readdata(w_readdata), // Feedback to Qsys
.readdatavalid(w_readdatavalid)
// ...
);To integrate the accelerator (my_multi_calc.v) into the Nios II system, please follow these steps in Platform Designer.
Step 1: Create New Component
Step 2: Add Files
Step 3 & 4: Interface and Timing Settings
- Interface Type: Select Custom Instruction Slave.
- Timing: Explicitly set Multicycle (2 or 3 cycles) considering the pipeline depth. Combinatorial should not be used.
- Note: Since the hardware logic is 2 steps, 2 or 3 is appropriate.
(Please refer to the setting on the right side of the image below)
Step 5: Completion
- Click Finish to save the component (
cust_cal). - Add the new component to the Qsys system.
Although the Nios II processor is versatile, using it to copy large capacity buffer data (e.g., from main memory to hardware accelerator) is inefficient. It consumes CPU cycles for every single ldw / stw instruction and causes a bottleneck.
To solve this, we integrated an Altera Scatter-Gather DMA Controller into the Qsys system. Through this, we let the hardware independently handle large capacity data transfers so that the CPU can perform other tasks.
This system is designed to seamlessly move processing data:
- Source (On-Chip Memory):
- Holds raw input data (e.g., array of operands for calculation).
- Mapped as a slave in Qsys.
- Transfer Engine (SG-DMA):
- Operates in Memory-to-Memory mode.
- Reads from On-Chip Memory and writes to the custom slave.
- Supports "Scatter-Gather" through descriptors, so it can process non-continuous memory blocks at once if necessary.
- Destination (Custom Slave / DPRAM):
- Receives data stream through the Avalon-MM Slave interface.
- Stores it in internal DPRAM so that custom logic or other masters can access it.
(Figure: Qsys system view showing connections between Nios II, On-Chip Memory, SG-DMA, and custom slave)
Beyond simple memory copying, to perform calculations during data movement ((Data * A) / 400), a single existing Memory-to-Memory DMA is not enough. This is because we need to insert our stream_processor in the middle of the data stream.
For this, we introduced the Modular SGDMA architecture. This method enables flexible connection by functional separation (Disaggregate) of the DMA.
- Existing (Standard SGDMA):
Read MasterandWrite Masterare tied inside. (For simple copy purposes) - New (Simplified Modular SGDMA): Consists of 2 independent master components (each master has its own internal Dispatcher/Descriptor logic).
- mSGDMA Read Master: Reads data from memory and sends it to Avalon-ST Source, receiving descriptors directly from Nios II.
- mSGDMA Write Master: Receives data via Avalon-ST Sink and writes to memory, receiving descriptors directly from Nios II.
We completed the streaming pipeline by configuring it in Platform Designer (Qsys) as follows:
- Add Components:
Modular SGDMA Read Master: Connect the descriptor_slave port to the Nios II data master, the memory map master to source memory, and the streaming source (Data Source) to the processor.Modular SGDMA Write Master: Connect the descriptor_slave port to the Nios II data master, the memory map master to destination memory, and the streaming sink (Data Sink) to the processor.
- Stream Processor Connection (Core):
Read Master.SourceConnects toStream Processor.SinkStream Processor.SourceConnects toWrite Master.Sink- By doing this, data read from memory must inevitably pass through our hardware logic before it can be written back to memory.
The core of the Stream Processor (stream_processor.v) design using the Avalon-Streaming interface is flow control (Backpressure).
Even if the pipeline stage gets longer, whether to transfer data in each step (enable) follows a simple and powerful rule determined by a combination of the following three factors:
- Current Valid (
s1_valid): Do I have data? - Next Valid (
s2_valid): Is the next step full? - Output Ready (
aso_ready): Can the final output go out?
Looking specifically at how the Ready signal is connected like a chain from back to front in the actual 3-stage pipeline (Stage 0, 1, 2) is as follows.
// 1. Ready state of Stage 2 (last stage):
// "When there is no data" OR "When it can be taken from the next stage (mSGDMA)"
assign pipe_ready[2] = (!pipe_valid[2]) || aso_ready;
// 2. Ready state of Stage 1 (middle stage):
// "When there is no data" OR "When the next stage (Stage 2) is empty or ready to take"
assign pipe_ready[1] = (!pipe_valid[1]) || pipe_ready[2];
// 3. Ready state of Stage 0 (first stage):
// "When there is no data" OR "When the next stage (Stage 1) is empty or ready to take"
assign pipe_ready[0] = (!pipe_valid[0]) || pipe_ready[1];
// Finally, notify the very front end (Sink)
assign asi_ready = pipe_ready[0];This expression handles all of the following scenarios:
| Scenario | State Description | Action (Ready) |
Result |
|---|---|---|---|
| 1. Empty state | pipe_valid[i]=0 |
Ready | Since it is an empty seat, it immediately receives data from the previous stage. |
| 2. Flowing state | pipe_valid[i]=1, pipe_valid[i+1]=0 |
Ready | Since current data can be pushed to the next cell, it receives new data. |
| 3. Full state | All pipe_valid=1 |
Depends on aso_ready |
If output goes out (aso_ready=1), everything moves by one cell like dominoes. If output is blocked, everything Stalls. |
Furthermore, generalizing this with a generate statement results in code applicable to any number of stages:
for (i = 0; i < STAGES; i = i + 1) begin : gen_handshake
// Condition for current stage to receive data (Ready):
// "I am currently empty (!pipe_valid[i])" OR "The next stage can take mine (pipe_ready[i+1])"
assign pipe_ready[i] = !pipe_valid[i] || pipe_ready[i+1];
endThis structure can be expanded with the same rules no matter how many pipeline stages increase, and it is an industry-standard handshake method that guarantees accurate data flow without a FIFO.
Hardware can only be as good as the software that drives it. We implemented a C application (main.c) to control the DMA and benchmark the performance of the custom accelerator.
Before moving on to complex DMA, it is essential to understand how the C code "talks" with our custom slave hardware.
When you compile hardware in Qsys and generate a BSP (Board Support Package), Quartus generates a system.h file. This file contains the base addresses of all modules.
- Target:
MMIO_0_BASE(The base address of our "my_custom_slave" component).
To access hardware registers, you must use specific macros provided by the Altera HAL. Choosing the wrong macro can cause segmentation faults or alignment errors.
| Macro | Argument | Description | Addressing Mode |
|---|---|---|---|
IOWR |
(BASE, REG_NUM, DATA) |
Writes 32-bit data to the register. | Word offset (BASE + REG_NUM * 4) |
IORD |
(BASE, REG_NUM) |
Reads 32-bit data from the register. | Word offset (BASE + REG_NUM * 4) |
IOWR_32DIRECT |
(BASE, OFFSET, DATA) |
Writes 32-bit data to a specific byte address. | Byte offset (BASE + OFFSET) |
IORD_32DIRECT |
(BASE, OFFSET) |
Reads 32-bit data from a specific byte address. | Byte offset (BASE + OFFSET) |
IOWR_16DIRECT |
(BASE, OFFSET, DATA) |
Writes 16-bit data. | Byte offset (BASE + OFFSET) |
IOWR_8DIRECT |
(BASE, OFFSET, DATA) |
Writes 8-bit data. | Byte offset (BASE + OFFSET) |
IOWR vs IOWR_32DIRECT Which one should be used?
- Use
IOWR(Recommended): When the component uses slave address alignment in "Word" units like our project. If you pass the index (0, 1, 2...), the macro automatically multiplies by 4. - Use
IOWR_32DIRECT: Used when accessing Raw memory or when access to a component using "Byte" address alignment where the byte address (e.g.,0,4,8...) must be explicitly controlled.
Because the hardware is set to Word Alignment, the index i of the Nios II software perfectly matches the i-th row of the DPRAM.
- Software:
IOWR(MMIO_0_BASE, 5, val)-> CPU outputs byte addressBase + 20(0x14). - Interconnect: Detects that it is a "Word Aligned" slave and shifts the address.
20 >> 2=5. - Hardware: The slave receives address
5. DPRAM writes data to the 5th slot.
Code Example (main.c):
#include "io.h"
#include "system.h"
// Simple R/W Test
for (int i = 0; i != 256; ++i) {
// Write: index 'i' is mapped 1:1 with DPRAM address 'i'
IOWR(MMIO_0_BASE, i, 0x1000 + i);
}
for (int i = 0; i != 256; ++i) {
// Read: data verification
int read_val = IORD(MMIO_0_BASE, i);
// ...
}A trap commonly experienced in Nios II DMA systems is the Data Cache Coherency problem. The CPU has a data cache, but the DMA engine directly reads physical memory (RAM).
If we write data as src_data[i] = ... and start DMA immediately, the data may still stay inside the CPU cache instead of RAM. Then DMA will copy the previous garbage value in RAM.
Solution: Before starting the transfer, the data cache must be explicitly flushed to be written into RAM.
#include <sys/alt_cache.h>
void start_dma_transfer() {
// 1. Data preparation
for(int i=0; i<256; i++) src_data[i] = i * 400;
// [Essential] Flush the cache to RAM so that DMA sees correct data
alt_dcache_flush(src_data, sizeof(src_data));
alt_msgdma_dev *dma_dev = alt_msgdma_open(DMA_ONCHIP_DP_CSR_NAME);
// 2. Create Descriptor
alt_msgdma_standard_descriptor descriptor;
alt_msgdma_construct_standard_mm_to_mm_descriptor(
dma_dev,
&descriptor,
(alt_u32 *)src_data, // Source (Array in RAM)
(alt_u32 *)MMIO_0_BASE, // Destination (Custom slave base address)
sizeof(src_data), // Length
0
);
// 3. Start DMA (Async)
alt_msgdma_standard_descriptor_async_transfer(dma_dev, &descriptor);
}To prove the value of custom instructions, we measured the execution time of the hardware accelerator and pure software implementation using a high-resolution timestamp timer.
Measurement Code:
#include "system.h"
#include "sys/alt_timestamp.h"
// ... inside main() ...
if (alt_timestamp_start() < 0) {
printf("Error: Timestamp timer not defined in BSP.\n");
return -1;
}
// Hardware Measurement (Custom Instruction)
time_start = alt_timestamp();
for (int i = 990; i != 1024; ++i) {
for (int j = 390; j != 400; ++j) {
// New instruction: Multi-cycle multiplication & division
result = (int)ALT_CI_CUST_CAL_0(i, j);
sum += result;
}
}
time_hw = alt_timestamp() - time_start;
// Software Measurement (Standard Operators)
time_start = alt_timestamp();
for (int i = 990; i != 1024; ++i) {
for (int j = 390; j != 400; ++j) {
result = i * j / 400; // Software division is slow
sum += result;
}
}
time_sw = alt_timestamp() - time_start;
printf("HW Cycles: %llu\n", time_hw);
printf("SW Cycles: %llu\n", time_sw);
if (time_sw > 0) {
printf("Speedup: %.2fx faster!\n", (float)time_sw / (float)time_hw);
}This setup provides solid data on the speed improvement that can be obtained by moving heavy arithmetic operations to logic.
Results of performing tests on actual hardware are as follows:
(Figure: Nios II console output screen - You can see that HW Cycles are significantly less than SW Cycles)
As you can see in the above result image, the hardware operation (HW Cycles) using Custom Instruction spent much fewer cycles than the software operation (SW Cycles) for the same operation, which proved a certain acceleration effect.
The actual implementation process of Modular SGDMA, which inserts operation logic in the middle of the data stream beyond simple DMA copying, was not as smooth as the theory. We summarize the "struggling" parts encountered during the development process and the lessons learned accordingly.
- Problem: The standard
alt_msgdma_open()command kept returningNULL. - Cause: The
altera_msgdmaHAL driver provided by Intel expects a "Standard mSGDMA" configuration where Dispatcher + Read Master + Write Master are tied as one complete package. However, since we placed the Read/Write Masters independently to insert operation logic, the software failed to recognize it as one integrated DMA device. - Solution: We boldly gave up calling high-level HAL APIs and chose a method to directly fire commands to the CSR/Descriptor Slave of each Master using
IOWRmacros. The interface became a bit rough, but we were able to perfectly control the hardware.
- Problem: DMA transfer appeared as completed (
BUSY=0), but the result memory was still0(Act=0) or previous data remained. - Cause: The operation mode of the Read/Write Dispatcher was maintained as the default,
Memory-to-Memory. - Lesson: In a separated architecture, Read Master must be explicitly set to
Memory-to-Stream, and Write Master toStream-to-Memorymode. If the mode is wrong, the data stream handshake does not occur normally and the pipeline stops.
- Problem: In
stream_processor.vwhere the operation logic is located, a phenomenon where the division stage was altogether ignored (Bypass) occurred. - Cause: We used an array form like
reg [63:0] stage_data[0:1]when implementing the pipeline stage, but in certain synthesis tool versions, array-based control logic is unintentionally optimized or connections are missing. - Solution: Rewrote the logic with individual registers with clear names like
s0_data,s1_data. Through this, we ensured the reliability of the operation by letting the synthesizer clearly distinguish each stage of the pipeline physically.
- Problem: During testing, if an error occurred or forced termination was made, the DMA often did not respond in the next execution.
- Solution: Added a software reset sequence for each Master at the beginning of the
main.ctest function.We confirmed again that making hardware always in a "predictable state" software-wise is the core of embedded programming.// Reset Read/Write Masters to a clean state IOWR_ALTERA_MSGDMA_CSR_CONTROL(DMA_READ_BASE, ALTERA_MSGDMA_CSR_RESET_MASK); IOWR_ALTERA_MSGDMA_CSR_CONTROL(DMA_WRITE_BASE, ALTERA_MSGDMA_CSR_RESET_MASK);
- Challenge: A minute Rounding difference can occur between the integer operation accelerator (
Shift-Add) and the CPU's floating-point division result. - Solution: In verification code, instead of
actual != expected, we introduced the Tolerance (allowable error) concept likeabs(actual - expected) <= 1. We learned that when designing hardware, it is as important to define the 'allowable range' as much as the 'perfection' of the result.
Warning
★★★★★ Very Important: Read if you don't want to see the magic of data being mixed! ★★★★★
-
Symptom:
- Input data
0x00000190(400) was put in, but it was recognized in hardware as byte-reversed like0x90010000. - It is normal in Bypass mode, but a nonsensically large value comes out after going through the operation.
- Input data
-
Cause:
- Because the "First Symbol In High-Order Bits" option was on in the Avalon-ST Sink settings of Qsys (Platform Designer).
- Nios II (Little-Endian): The first byte must come to the lowest seat (LSB).
- Option On (Big-Endian): Sends the first byte to the highest seat (MSB).
(Figure: First Symbol In High-Order Bits setting of Avalon-ST Sink) -
Solution:
- If possible, uncheck the corresponding checkbox (Uncheck) in the Avalon-ST Sink settings.
- Caution: In cases like mSGDMA where the option is fixed (Grayed out) and cannot be turned off, a measure of manually reversing the byte order (Byte Swap) at the hardware (RTL) input/output stage is necessary.
last_asi_data <= {asi_data[7:0], asi_data[15:8], ...}- Result: Confirmed normal output of
(400 * 800) / 400 = 800after RTL modification! - Bypass mode performance: 7.59x faster than CPU.
- Operation mode performance: 86.14x faster than CPU! (Division operation included)
The mathematical background of the "Shift-Add" method we used to implement division / 400 is Fixed-Point Arithmetic.
To process a real number (decimal)
-
$F$ : Goal real number to express (in our case$1/400 = 0.0025$ ) -
$Q$ : Number of bits to express below the decimal point (Q-Factor). Precision increases as it gets larger, but bit width increases. -
$K$ : Integer coefficient to actually multiply in hardware.
We chose
Rounding this value gives
At this time, the value actually multiplied is
As the final step of the project, we refactored the pipeline, which was composed of 1 stage, into an extensible N-Stage Pipeline structure.
- Securing Timing Margin: If all operations (endian conversion, large multiplication, shift) are gathered in one cycle, the probability of an error occurring when the clock frequency gets high is large. We secured margin by dividing this into 3 stages.
- Backpressure Processing: We implemented the Valid-Ready Handshake according to the rule so that data is not lost when the rear mSGDMA stops, by stopping in turn.
- Stage 0: Input data capture and preprocessing (Byte Swap).
- Stage 1: Coefficient multiplication operation (
Input * Coeff). - Stage 2: Reciprocal multiplication (
* 5243) and shift (>> 21), and final endian restoration.
This structure is an industrial-level RTL design method that can operate stably even at higher clock speeds while maintaining data throughput.
Through this project, we confirmed how powerful performance (86x speedup) can be achieved when Custom Instruction, mSGDMA, and RTL Optimization are combined.
Particularly beyond simply "working code," it is very meaningful in that we squarely broke through key challenges encountered in practical FPGA design, such as resolving endianness problems, fixed-point operation optimization, and N-stage pipeline design.
We hope this record will be a good guidebook for my future self or other colleagues who will maintain this system.


