|
| 1 | +# WarpForth |
| 2 | + |
| 3 | +An MLIR-based Forth compiler for programming GPU kernels. WarpForth defines a custom MLIR dialect for Forth stack operations and lowers through a pipeline of passes to PTX assembly. |
| 4 | + |
| 5 | +## Dependencies |
| 6 | + |
| 7 | +- LLVM/MLIR |
| 8 | +- CMake |
| 9 | +- C++17 compiler |
| 10 | +- CUDA toolkit (for GPU execution) |
| 11 | +- [uv](https://github.com/astral-sh/uv) (for Python test tooling) |
| 12 | + |
| 13 | +## Building |
| 14 | + |
| 15 | +```bash |
| 16 | +# Configure |
| 17 | +cmake -B build -G Ninja \ |
| 18 | + -DMLIR_DIR=/path/to/llvm/lib/cmake/mlir \ |
| 19 | + -DLLVM_DIR=/path/to/llvm/lib/cmake/llvm |
| 20 | + |
| 21 | +# Build |
| 22 | +cmake --build build |
| 23 | +``` |
| 24 | + |
| 25 | +## Quick Start |
| 26 | + |
| 27 | +Write a naive integer matrix multiply kernel (M=2, N=3, K=4, one thread per output element): |
| 28 | + |
| 29 | +```forth |
| 30 | +\! kernel main |
| 31 | +\! param A i64[8] |
| 32 | +\! param B i64[12] |
| 33 | +\! param C i64[6] |
| 34 | +
|
| 35 | +\ One thread computes C[row, col] where gid = row*N + col. |
| 36 | +GLOBAL-ID |
| 37 | +DUP 3 / |
| 38 | +SWAP 3 MOD |
| 39 | +0 |
| 40 | +4 0 DO |
| 41 | + 2 PICK |
| 42 | + I SWAP 4 * + |
| 43 | + CELLS A + @ |
| 44 | + I 3 * 3 PICK + CELLS B + @ |
| 45 | + * + |
| 46 | +LOOP |
| 47 | +2 PICK 3 * 2 PICK + |
| 48 | +CELLS C + ! |
| 49 | +``` |
| 50 | + |
| 51 | +Compile to PTX: |
| 52 | + |
| 53 | +```bash |
| 54 | +./build/bin/warpforthc matmul.forth -o matmul.ptx |
| 55 | +``` |
| 56 | + |
| 57 | +Test on a GPU (A is 2x4 row-major, B is 4x3 row-major, C is 2x3 output): |
| 58 | + |
| 59 | +```bash |
| 60 | +./build/bin/warpforth-runner matmul.ptx \ |
| 61 | + --param 'i64[]:1,2,3,4,5,6,7,8' \ |
| 62 | + --param 'i64[]:1,2,3,4,5,6,7,8,9,10,11,12' \ |
| 63 | + --param 'i64[]:0,0,0,0,0,0' \ |
| 64 | + --grid 6,1,1 --block 1,1,1 \ |
| 65 | + --output-param 2 --output-count 6 |
| 66 | +``` |
| 67 | + |
| 68 | +## Toolchain |
| 69 | + |
| 70 | +| Tool | Description | |
| 71 | +|------|-------------| |
| 72 | +| `warpforthc` | Compiles Forth source to PTX | |
| 73 | +| `warpforth-translate` | Translates from Forth source to MLIR and MLIR to PTX assembly | |
| 74 | +| `warpforth-opt` | Runs individual MLIR passes or entire pipeline | |
| 75 | +| `warpforth-runner` | Executes PTX kernels on a GPU for testing | |
| 76 | + |
| 77 | +These tools can be composed for debugging or inspecting intermediate stages: |
| 78 | + |
| 79 | +```bash |
| 80 | +./build/bin/warpforth-translate --forth-to-mlir kernel.forth | \ |
| 81 | + ./build/bin/warpforth-opt --warpforth-pipeline | \ |
| 82 | + ./build/bin/warpforth-translate --mlir-to-ptx |
| 83 | +``` |
| 84 | + |
| 85 | +## Language Reference |
| 86 | + |
| 87 | +WarpForth supports stack operations, integer and float arithmetic, control flow, global and shared memory access, reduced-width memory types, user-defined words with local variables, and GPU-specific operations. |
| 88 | + |
| 89 | +See [docs/language.md](docs/language.md) for the full language reference. |
| 90 | + |
| 91 | +## Architecture |
| 92 | + |
| 93 | +WarpForth compiles Forth through a series of MLIR dialect lowerings, each replacing higher-level abstractions with lower-level ones until the program is expressed entirely in LLVM IR and can be handed to the NVPTX backend. |
| 94 | + |
| 95 | +| Stage | Pass | Description | |
| 96 | +|-------|-------------|-------------| |
| 97 | +| **Parsing** | `warpforth-translate --forth-to-mlir` | Parses Forth source into the `forth` dialect. The kernel is represented as a series of stack ops on an abstract `!forth.stack` type. | |
| 98 | +| **Stack lowering** | `warpforth-opt --convert-forth-to-memref` | The abstract `!forth.stack` type is materialized as a `memref<256xi64>` buffer and `index` pair. Stack ops become explicit loads, stores, and pointer arithmetic. | |
| 99 | +| **GPU wrapping** | `warpforth-opt --convert-forth-to-gpu` | Functions are wrapped in a `gpu.module`, the kernel entry point is marked as a `gpu.kernel` and GPU intrinsic words are lowered to `gpu` ops. | |
| 100 | +| **NVVM/LLVM lowering** | Standard MLIR passes | GPU→NVVM, math→LLVM intrinsics and NVVM→LLVM. | |
| 101 | +| **Code generation** | `warpforth-translate --mlir-to-ptx` | The GPU module is serialized to PTX assembly via LLVM's NVPTX backend. | |
| 102 | + |
| 103 | +## Demo |
| 104 | + |
| 105 | +The `demo/` directory contains a GPT-2 text generation demo that routes scaled dot-product attention through a WarpForth-compiled kernel. See [demo/README.md](demo/README.md) for setup instructions. |
| 106 | + |
| 107 | +## Testing |
| 108 | + |
| 109 | +```bash |
| 110 | +# Run the LIT test suite |
| 111 | +cmake --build build --target check-warpforth |
| 112 | + |
| 113 | +# Run end-to-end GPU tests (requires Vast.ai API key) |
| 114 | +VASTAI_API_KEY=xxx uv run pytest -v -m gpu |
| 115 | +``` |
0 commit comments