A 6-stage pipelined RISC-V processor core designed for FPGA deployment, featuring an integrated UART controller for system programming and debugging. Achieved 263.7 CoreMark and 91.0 DMIPS at 100 MHz.
- Features
- Clock Domain and Reset
- Memory Specifications
- Performance Characteristics
- Supported Instructions
- CSR Registers
- UART Subsystem
- Software & Tools
- Directory Structure
- Toolchain Requirements
runtime/— Common Bare-Metal Runtimeapps/firmware/— Custom Test Firmwareapps/coremark/— CoreMark Benchmarkapps/dhrystone/— Dhrystone Benchmarkapps/riscv-tests/— Official RISC-V Test Suiteprogrammer/— Host-Side UART Programming Tool- Typical Workflow
- Simulation
- Directory Structure
- License
- Base ISA: RV32I (RISC-V 32-bit Integer Base Instruction Set)
- Extensions:
- M: Integer Multiplication and Division
- Zicsr: Control and Status Register (CSR) Instructions
- Zifencei: Instruction-Fetch Fence (FENCE.I)
- Pipeline: 6-stage (Fetch, Decode, Execute, Memory 1, Memory 2, Writeback)
- MEM1: Memory Access & Store Align
- MEM2: Data Ready (FPGA BRAM Latency) & Load Data Extend
- Branch Prediction:
- BHT (Branch History Table): 256 entries. Uses a 2-bit Saturating Counter (Strongly/Weakly Taken/Not Taken) to predict conditional branches. Implemented as LUTRAM.
- BTB (Branch Target Buffer): 256 entries. Stores Valid bit, Entry Type (Branch, Jump, Return), Tag, and Target Address. Implemented as LUTRAM.
- RAS (Return Address Stack): 32 entries. Push on JAL/JALR with call hint (
CFHINT_CALL); pop on JALR with return hint (CFHINT_RET). Supports nested function calls. - Prediction Logic: For branches,
pred_taken = bht_taken && btb_hit; for JAL/JALR,pred_taken = btb_hit. No pipeline stalls or flushes on a correct prediction.
- Branch Resolution:
- Branch Unit: Resolution & validation in EX stage. Registers inputs for timing optimization (1-cycle latency).
- Recovery: 3-cycle penalty on misprediction (Flush ID/EX/MEM1, redirect PC)
- Hazard Handling:
- RAW hazards resolved by forwarding from MEM1/MEM2/WB to EX
- Store-Data hazards resolved by forwarding from WB to MEM1
- Load-Use hazards resolved by 2-cycle pipeline stall + WB to EX forwarding
- Branch Misprediction resolved by pipeline flush
- Trap/Exception Support:
- ECALL, EBREAK, MRET
- Illegal Instruction
- Instruction/Data Address Misalign
- Instruction/Data Access Fault
- Hardware Multiplier & Divisor:
- Multiplier: Implemented using inferred DSP blocks (via
(* use_dsp = "yes" *)synthesis attribute). - Divisor: Implemented using a custom iterative shift-subtract logic (2x loop unrolled, 16 cycles for 32-bit division).
- Multiplier: Implemented using inferred DSP blocks (via
- Board: ALINX AX7Z020B (Zynq-7020)
- Input Clock: 50 MHz (onboard oscillator)
- Internal Clock: 100 MHz (via MMCM)
| Signal | Type | Description |
|---|---|---|
clk |
Input | System clock |
rstn_push |
Input | Active-low asynchronous reset (Button, synchronized with 2-FF + debounce) |
uart_rx |
Input | UART Receive Data |
uart_tx |
Output | UART Transmit Data |
rstn_led |
Output | Reset status LED (active when reset asserted) |
start_led |
Output | CPU run status LED (active when CPU running) |
- Hard Reset: Physical button or MMCM lock.
- Soft Reset: Controlled via UART commands during programming.
- Initialization: PC resets to
0x00000000, pipeline flushes.
| Region | Base Address | Size | Description |
|---|---|---|---|
| ROM | 0x00000000 |
128 KB | Program ROM (.text, .rodata, .data load image) |
| RAM | 0x00020000 |
128 KB | Data RAM (.data, .bss, stack) |
| Stack Top | 0x00040000 |
— | Top of RAM (stack grows downward) |
0xFFFF0000 |
— | MMIO: UART TX output (write) | |
| INPUT | 0xFFFF0004 |
— | MMIO: UART RX input (read) |
- Size: 128 KB (32768 words)
- Width: 32-bit
- Access: Read-only from the CPU; writable via UART programmer before
RUN - Contents: Full firmware image. Instruction fetches,
.rodataloads, and.datainitialization reads come from ROM.
- Size: 128 KB (32768 words)
- Width: 32-bit
- Access: Read/Write (byte-enable write strobe)
- Initialization:
crt0.Scopies.datafrom ROM to RAM and clears.bssbeforemain(). - IO Mapping:
- Writes to
0xFFFF_0000(MMIO_PRINT_ADDR) withdata[8] == 0→RES_PRINTvia UART TX. - Writes to
0xFFFF_0000withdata[8] == 1→RES_EXITwithdata[7:0]as exit code. - Reads from
0xFFFF_0004(MMIO_INPUT_ADDR) → returnsinput_dataif RX FIFO non-empty, else0xFFFFFFFF. Polled bygetchar()in syscalls.
- Writes to
- Access Fault: Stores to ROM and accesses outside ROM/RAM/MMIO trigger a Data Access Fault trap.
Measured on Zynq-7020 FPGA running at 100 MHz (RV32IM, -O2, ITERATIONS=3000).
| Metric | Value |
|---|---|
| CoreMark Score | 263.7 |
| CoreMark/MHz | 2.64 |
Reference measurement on Zynq-7020 FPGA running at 100 MHz (RV32IM, -O2, NUMBER_OF_RUNS=100000).
The current simulation sanity configuration uses NUMBER_OF_RUNS=10.
| Metric | Value |
|---|---|
| Dhrystones/Second | 160000 |
| DMIPS | 91.0 |
| DMIPS/MHz | 0.91 |
DMIPS = Dhrystones/Second ÷ 1,757 (VAX 11/780 reference = 1 DMIPS)
Multiplier
- Inferred DSP48 blocks via
(* use_dsp = "yes" *)synthesis attribute. - 3-stage pipeline: result available 3 cycles after issue, causing a 3-cycle stall (
MUL_COUNT = 3).
Divisor
- Algorithm: Shift-compare-subtract. Handles signed/unsigned division and remainder, with special-case handling for divide-by-zero and signed overflow.
- 2x loop unrolled: computes 2 quotient bits per clock cycle combinatorially.
- 32-bit division takes 17 cycles total (1 setup + 16 compute), causing a 17-cycle stall (
DIV_COUNT = 17,SHIFT_COUNT = 16).
- Ideal CPI: 1.0
- Actual CPI: Depends on program characteristics (typically 1.1–1.5 due to hazards)
| Hazard Type | Penalty (Cycles) | Detection Stage | Notes |
|---|---|---|---|
| Data Hazard (RAW) | 0 | EX | Forwarding from MEM1/MEM2/WB to EX. Loads are excluded from MEM1/MEM2 path. |
| Store-Data Hazard | 0 | MEM1 | Forwarding from WB to MEM1 store data register |
| Load-Use Hazard | 2 | ID | Stall while load is in EX (id_ex) and again in MEM1 (id_mem1); 2 bubbles total; result forwarded from WB→EX |
| Branch Prediction Hit | 0 | IF | Zero penalty (Seamless execution) |
| Branch Prediction Miss | 3 | EX | Flush ID/EX/MEM1 stages; redirect PC to correct target |
| Multiplication Stall | 3 | EX | Pipeline stall during multiplication |
| Division Stall | 17 | EX | Pipeline stall during division |
| Trap/Flush Type | Penalty (Cycles) | Processing Stage | Notes |
|---|---|---|---|
| All Traps | 3 | MEM1 | Flush ID/EX/MEM1/MEM2 stages, redirect to mtvec |
| MRET | 3 | MEM1 | Flush ID/EX/MEM1/MEM2 stages, restore PC from mepc |
| FENCE.I | 3 | MEM1 | Flush ID/EX/MEM1/MEM2 stages, instruction memory sync |
- Arithmetic: ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU
- Immediate Arithmetic: ADDI, ANDI, ORI, XORI, SLLI, SRLI, SRAI, SLTI, SLTIU
- Load: LB, LH, LW, LBU, LHU
- Store: SB, SH, SW
- Branch: BEQ, BNE, BLT, BGE, BLTU, BGEU
- Jump: JAL, JALR
- Upper Immediate: LUI, AUIPC
- System: ECALL, EBREAK, MRET, WFI (Hint, NOP)
- Multiplication: MUL, MULH, MULHSU, MULHU
- Division: DIV, DIVU, REM, REMU
- Zicsr: CSRRW, CSRRS, CSRRC, CSRRWI, CSRRSI, CSRRCI
- Zifencei: FENCE.I (Instruction cache/pipeline flush)
| CSR Address | Name | Description |
|---|---|---|
| 0x300 | mstatus | Machine status register (MIE, MPIE bits) |
| 0x304 | mie | Machine interrupt-enable register |
| 0x305 | mtvec | Machine trap-handler base address (default: 0x400) |
| 0x340 | mscratch | Machine scratch register for trap handlers |
| 0x341 | mepc | Machine exception program counter |
| 0x342 | mcause | Machine trap cause |
| 0x343 | mtval | Machine trap value (bad address or instruction) |
| 0x344 | mip | Machine interrupt-pending register |
| 0xB00 | mcycle | Cycle counter (lower 32 bits), auto-incremented every clock cycle |
| 0xB80 | mcycleh | Cycle counter (upper 32 bits) |
| 0xB02 | minstret | Instructions-retired counter (lower 32 bits), incremented per retired instruction |
| 0xB82 | minstreth | Instructions-retired counter (upper 32 bits) |
| CSR Address | Name | Description |
|---|---|---|
| 0xF14 | mhartid | Hardware thread ID (hart ID = 0) |
| 0xC00 | cycle | Alias for mcycle (lower 32 bits) |
| 0xC80 | cycleh | Alias for mcycleh (upper 32 bits) |
| 0xC02 | instret | Alias for minstret (lower 32 bits) |
| 0xC82 | instreth | Alias for minstreth (upper 32 bits) |
Integrated UART controller for communication and system control.
- System Programmer: Loads compiled programs (
.hex) into Program ROM. - Standard Output: MMIO-based character output;
putchar/printfwrite characters toMMIO_PRINT_ADDR, sent to host asRES_PRINTpackets. - Standard Input: MMIO-based character input; host sends bytes via
CMD_INPUT, stored inINPUT_FIFO;getchar/scanfpollMMIO_INPUT_ADDRuntil a byte is available.
| Parameter | Value |
|---|---|
| Baud Rate | 115200 bps |
| Data Bits | 8 bits |
| Stop Bits | 1 bit |
| Parity | None |
| Oversampling | 16x |
FIFO Sizes
| FIFO | Size | Description |
|---|---|---|
INPUT_FIFO |
64 entries | CPU input buffer: stores bytes received via CMD_INPUT, read by CPU via MMIO_INPUT_ADDR. Flushed on CMD_RESET. |
PRINT_FIFO |
2048 entries | CPU output buffer: queues RES_BOOT/RES_PRINT/RES_EXIT packets for UART TX. Flushed on CMD_RESET. |
RX Packet (Host → FPGA)
| START | CMD | LEN | PAYLOAD | CHECKSUM |
|---|---|---|---|---|
| 0xA5 | 1B | 1B | 0~252B | 1B |
- Payload:
CMD_WRITE— Address (4B) + Data (nB), little-endian. - Checksum: Sum of all bytes (START through last PAYLOAD byte), truncated to 8 bits.
TX Packet (FPGA → Host)
| START | RES | LEN | PAYLOAD | CHECKSUM |
|---|---|---|---|---|
| 0xA5 | 1B | 1B | 0~1B | 1B |
| Command | Code | Description | Payload |
|---|---|---|---|
| CMD_RESET | 0x01 | Halt and reset CPU | None |
| CMD_WRITE | 0x02 | Write to memory | Addr(4B) + Data |
| CMD_RUN | 0x03 | Start execution | None |
| CMD_INPUT | 0x04 | Send stdin data to CPU RX FIFO | Data (nB) |
| Response | Code | LEN | Payload | Description |
|---|---|---|---|---|
| RES_ACK | 0x06 | 0 | None | Command acknowledged |
| RES_NAK | 0x15 | 0 | None | Command rejected |
| RES_BOOT | 0x80 | 0 | None | CPU started (sent once on CMD_RUN) |
| RES_EXIT | 0x81 | 1 | Exit code (1B) | Program exited via _exit() |
| RES_PRINT | 0x82 | 1 | Character (1B) | MMIO character output |
| RES_OVERFLOW | 0x83 | 1 | Drop count (1B) | TX FIFO overflow: N output events were dropped (max 255) |
- Host sends
CMD_RUN. The programmer automatically issuesCMD_RESETfirst. - CPU begins execution and immediately sends
RES_BOOT(hardware-generated viaboot_flag). - Host enters CPU Console; keyboard input becomes active (line-buffered, sent as
CMD_INPUT). - User program output follows as
RES_PRINTpackets;getchar()/scanf()in the program pollMMIO_INPUT_ADDR. - On
_exit(code), CPU sendsRES_EXITwith the exit code. Console exits.
Note: If
PRINT_FIFOoverflows, existing entries are preserved and transmitted first. Dropped entries are counted and reported asRES_OVERFLOWafter the FIFO drains. Console exits on receivingRES_OVERFLOW.
Note: The programmer tool is Windows-specific due to Win32 API usage (
CreateFileA,ReadFile,WriteFile, etc.).
Software/
├── runtime/ # Common bare-metal runtime (shared by all apps)
│ ├── crt0.S # Startup code: sp init, data copy, BSS clear, main call, trap handler
│ ├── linker.ld # Linker script: ROM @ 0x0, RAM @ 0x20000, stack @ 0x40000
│ ├── syscalls.c # MMIO syscalls: putchar, printf, sprintf, _exit, memcpy, etc.
│ └── common.mk # Shared Makefile rules (compile, link, hex generation)
├── apps/ # Application source code
│ ├── firmware/ # Custom test firmware
│ ├── coremark/ # CoreMark benchmark
│ ├── dhrystone/ # Dhrystone benchmark
│ └── riscv-tests/ # Official RISC-V test suite (submodule)
├── build/ # Compiled output (per-app subdirectories)
│ ├── firmware/ # firmware.hex, firmware.elf, *.o
│ ├── coremark/ # coremark.hex, coremark.elf, *.o
│ ├── dhrystone/ # dhrystone.hex, dhrystone.elf, *.o
│ └── riscv-tests/ # add.hex, lw.hex, mul.hex, ...
└── programmer/ # Host-side UART programming tool (Windows)
- Compiler:
riscv-none-elf-gcc - Architecture:
rv32im_zicsr_zifencei - ABI:
ilp32 - Compiler Flags:
-O2 -nostdlib -nostartfiles -ffreestanding
Shared startup and syscall code used by all applications.
| File | Description |
|---|---|
crt0.S |
Startup code: initializes sp, copies .data from ROM to RAM, clears .bss, installs trap handler, calls main then _exit |
linker.ld |
Linker script: ROM @ 0x00000000, RAM @ 0x00020000, stack top @ 0x00040000 |
syscalls.c |
Bare-metal syscall and standard library implementation (MMIO-based I/O, string utilities, timer functions) |
common.mk |
Shared build rules: compile .c/.S, link .elf, generate .hex via objcopy |
I/O
| Function | Signature | Description |
|---|---|---|
getchar |
int getchar(void) |
Polls MMIO_INPUT_ADDR until a byte is available, returns it as int |
putchar |
int putchar(int c) |
Writes a character to MMIO_PRINT_ADDR (UART TX) |
printf |
int printf(const char *fmt, ...) |
Formatted output to UART TX; supports %c, %s, %d, %i, %u, %x, %o, %p, %f, %l*, %* (width from arg), width, 0/- padding |
vprintf |
int vprintf(const char *fmt, va_list ap) |
va_list variant of printf; used internally by ee_printf in CoreMark |
sprintf |
int sprintf(char *str, const char *fmt, ...) |
Formatted output into a string buffer; same specifiers as printf |
scanf |
int scanf(const char *fmt, ...) |
Reads input via getchar(); supports %c, %s, %d, %i, %u, %x, width for %s; skips whitespace before numeric/string fields |
_exit |
void _exit(int code) |
Writes `0x100 |
String & Memory
| Function | Signature | Description |
|---|---|---|
memcpy |
void *memcpy(void *dest, const void *src, size_t len) |
Copies len bytes from src to dest |
memset |
void *memset(void *dest, int byte, size_t len) |
Fills len bytes of dest with byte |
strlen |
size_t strlen(const char *s) |
Returns the length of string s |
strcmp |
int strcmp(const char *s1, const char *s2) |
Lexicographic comparison of s1 and s2 |
strcpy |
char *strcpy(char *dest, const char *src) |
Copies string src into dest |
Timer
| Function | Signature | Description |
|---|---|---|
get_cycle |
uint64_t get_cycle(void) |
Returns the 64-bit mcycle counter value |
time_us |
uint32_t time_us(void) |
Returns elapsed microseconds based on mcycle (100 MHz) |
time_ms |
uint32_t time_ms(void) |
Returns elapsed milliseconds based on mcycle (100 MHz) |
A bare-metal test program written in C.
| File | Description |
|---|---|
main.c |
User Program |
Makefile |
Sets APP_NAME, APP_SRCS, includes ../../runtime/common.mk |
cd Software/apps/firmware
make clean && makeRuns the EEMBC CoreMark benchmark (v1.0) on bare-metal RISC-V.
| File | Description |
|---|---|
core_portme.h |
Platform configuration: HAS_FLOAT=1, 64-bit cycle counter (rdcycle/rdcycleh), MEM_STATIC |
core_portme.c |
Timing (start_time/stop_time/get_time/time_in_secs), seed variables, ee_printf via vprintf |
Makefile |
Sets APP_NAME, APP_SRCS, ITERATIONS=3000, -DPERFORMANCE_RUN=1 |
cd Software/apps/coremark
make clean && makeCoreMark requires ≥ 10 seconds of continuous execution for a valid result. At 263.7 CoreMark/s,
ITERATIONS=3000gives ~11.4 seconds.
Runs the classic Dhrystone 2.2 synthetic integer benchmark.
| File | Description |
|---|---|
dhrystone.h |
Merged header: HZ=100000000, CLOCK_TYPE="mcycle", NUMBER_OF_RUNS=10 for simulation sanity runs; Start_Timer/Stop_Timer use mcycle CSR |
dhrystone.c |
Dhrystone auxiliary procedures (Proc_6 |
dhrystone_main.c |
Dhrystone main benchmark loop and result output |
util.h |
Benchmark utilities: setStats declaration, encoding.h include guard |
encoding.h |
CSR access macros: read_csr, write_csr |
Makefile |
Sets APP_NAME, APP_SRCS, EXTRA_CFLAGS, includes ../../runtime/common.mk |
Dhrystone timing uses the mcycle hardware counter (100 MHz):
Start_Timer() → Begin_Time = (long)read_csr(mcycle);
Stop_Timer() → End_Time = (long)read_csr(mcycle);cd Software/apps/dhrystone
make clean && makeRISC-V ISA compliance tests (git submodule).
cd Software/apps/riscv-tests
make TEST=add run
# Build all tests
make all-hex
# Output: Software/build/riscv-tests/*.hexA Windows-only interactive command-line tool that communicates with the FPGA over UART.
| File | Description |
|---|---|
programmer.c |
COM port setup, menu-driven command dispatch (RESET/BUILD/WRITE/RUN), make build integration, .hex parsing and chunked ROM upload via CMD_WRITE. RUN automatically issues CMD_RESET first to flush FIFOs. |
programmer.h |
Constants (CHUNK_SIZE=62), command enum (CMD_RESET/WRITE/RUN/INPUT) |
serial_port.c |
Win32 serial port: open/close/read/write, frame receiver, CPU console; line-buffered keyboard input sent as CMD_INPUT after RES_BOOT; Ctrl+D exits; exits on RES_EXIT or RES_OVERFLOW |
serial_port.h |
Protocol constants (START_FLAG=0xA5, TIMEOUT_MS=20000), response enum (RES_ACK/NAK/BOOT/EXIT/PRINT/OVERFLOW) |
1. RESET Reset System
2. BUILD Build Program Image (invokes make, optional immediate write)
3. WRITE Write Program Image
4. RUN Run Program & CPU Monitor
5. EXIT End Program
On BUILD, after a successful make, the tool asks:
Write Program Now? (Y/N):
If Y, it immediately uploads the freshly built .hex without re-selecting the image.
| Constant | Value | Description |
|---|---|---|
START_FLAG |
0xA5 |
Packet start byte |
CHUNK_SIZE |
62 |
Max words per CMD_WRITE packet (payload = 4 + 62×4 = 252 bytes) |
TIMEOUT_MS |
20000 |
Per-field receive timeout (20 seconds) |
| Parameter | Value |
|---|---|
| Baud Rate | 115200 |
| Byte Size | 8 |
| Stop Bits | 1 |
| Parity | None |
| Read Interval Timeout | 50 ms |
| Read Total Timeout | 50 ms + 10 ms × bytes |
| Write Total Timeout | 50 ms + 10 ms × bytes |
cd Software/programmer
gcc programmer.c serial_port.c -o programmer.execd Software/programmer
programmer.exe- Enter COM port number (e.g.,
3forCOM3) RESET— Halt CPU, flush TX/RX FIFOsBUILD— Build selected app, then optionally write immediatelyWRITE— Upload selected.hexto ROMRUN— Auto-reset, start CPU, enter CPU Console (Ctrl+D to exit)
This project uses Vivado Simulator (XSim) for functional verification.
The CPU testbench (RTL/Testbench/cpu_testbench.sv) prints program output and simulation events ([BOOT], [PASS], [FAIL: exit=N], [TIMEOUT]) to the simulation log.
- Add
RTL/Testbench/cpu_testbench.svto the project as a Simulation Source and setcpu_testbenchas the Top Module. - Add
Simulation/waveform.wcfgto the simulation sources for pre-configured signal views. Includes grouped signals for each pipeline stage (IF, ID, EX, MEM1, MEM2, WB) and debug interfaces. - Run Behavioral Simulation.
Runs the full XSim flow (compile → elaborate → simulate) without opening Vivado GUI.
Simulation\simulate.bat [app] # app defaults to firmwareLoads Software/build/<app>/<app>.hex. Logs are written to xvlog.log, elaborate.log, and simulate.log.
The UART testbench (RTL/Testbench/uart_testbench.sv) verifies the UART controller in isolation, covering command parsing, ROM write, input FIFO, MMIO frame output, overflow behavior, and frame checksums.
Simulation\uart_simulate.batRTL/: Top-level FPGA module (riscv_cpu_fpga.sv)Core/: SystemVerilog source code for the RISC-V CPU CoreUART/: Source code for the UART controller and PHYTestbench/: Simulation testbench files
Simulation/: Waveform configuration files for Vivado SimulatorSoftware/: Firmware, benchmarks, and host programming toolruntime/: Common bare-metal runtime (shared by all apps)apps/: Application source code (firmware, coremark, dhrystone, riscv-tests)build/: Compiled output per appprogrammer/: Host-side UART programming tool (Windows)
Constraints/: FPGA constraint files (.xdc)
See LICENSE file for details.
