WarpForth is an MLIR-based compiler for the Forth programming language targeting GPU kernels. It implements a custom MLIR dialect for Forth stack operations and converts them to executable PTX.
# Build from root directory
cmake --build build
# Format code
cmake --build build --target format
# Run tests (requires: uv sync)
cmake --build build --target check-warpforthRequires MLIR/LLVM with MLIR_DIR and LLVM_DIR configured in CMake.
Dialect definition:
include/warpforth/Dialect/Forth/ForthOps.td- Add new operations hereinclude/warpforth/Dialect/Forth/ForthDialect.td- Type definitions
Implementation:
lib/Conversion/ForthToMemRef/ForthToMemRef.cpp- Stack to MemRef conversion patternslib/Conversion/ForthToGPU/ForthToGPU.cpp- GPU conversion logiclib/Translation/ForthToMLIR/ForthToMLIR.cpp- Forth parser and translator
Tools:
tools/warpforth-translate/warpforth-translate.cpp- Translation tool entry pointtools/warpforth-opt/warpforth-opt.cpp- Optimization tool entry pointtools/warpforth-runner/warpforth-runner.cpp- PTX execution tool for GPU kernels
# Forth to MLIR
./build/bin/warpforth-translate --forth-to-mlir test/example.forth
# Run conversion passes
./build/bin/warpforth-opt --convert-forth-to-memref input.mlir
./build/bin/warpforth-opt --convert-forth-to-gpu input.mlir
# Full pipeline to PTX
./build/bin/warpforth-translate --forth-to-mlir test/example.forth | \
./build/bin/warpforth-opt --warpforth-pipeline | \
./build/bin/warpforth-translate --mlir-to-ptx > kernel.ptx
# Execute PTX on GPU
./warpforth-runner kernel.ptx --param i64[]:1,2,3 --param i64:42 --output-param 0 --output-count 3Define in include/warpforth/Dialect/Forth/ForthOps.td:
def Forth_NewOp : Forth_Op<"opname", [Pure]> {
let summary = "Brief description";
let description = [{ Stack effect: ( input -- output ) }];
let arguments = (ins Forth_StackType:$input_stack);
let results = (outs Forth_StackType:$output_stack);
let assemblyFormat = [{
$input_stack attr-dict `:` type($input_stack) `->` type($output_stack)
}];
}
Add corresponding conversion pattern in lib/Conversion/ForthToMemRef/ForthToMemRef.cpp.
End-to-end GPU execution tests live in gpu_test/. They compile Forth kernels locally, rent a GPU on Vast.ai, and verify output.
# Run GPU tests
VASTAI_API_KEY=xxx uv run pytest -v -m gpu
# Lint and format Python code
uv run ruff check gpu_test/
uv run ruff format gpu_test/- Stack Type:
!forth.stack- untyped stack, programmer ensures type safety - Operations: All take stack as input and produce stack as output (except
forth.stack) - Supported Words: literals (integer
42and float3.14),DUP DROP SWAP OVER ROT NIP TUCK PICK ROLL,+ - * / MOD,F+ F- F* F/(float arithmetic),FEXP FSQRT FLOG FABS FNEG(float math intrinsics),FMAX FMIN(float min/max),AND OR XOR NOT LSHIFT RSHIFT,= < > <> <= >= 0=,F= F< F> F<> F<= F>=(float comparison),S>F F>S(int/float conversion),@ !(global memory),F@ F!(float global memory),S@ S!(shared memory),SF@ SF!(float shared memory),I8@ I8! SI8@ SI8!(i8 memory),I16@ I16! SI16@ SI16!(i16 memory),I32@ I32! SI32@ SI32!(i32 memory),HF@ HF! SHF@ SHF!(f16 memory),BF@ BF! SBF@ SBF!(bf16 memory),F32@ F32! SF32@ SF32!(f32 memory),CELLS,IF ELSE THEN,BEGIN UNTIL,BEGIN WHILE REPEAT,DO LOOP +LOOP I J K,LEAVE UNLOOP EXIT,{ a b -- }(local variables in word definitions),TID-X/Y/Z BID-X/Y/Z BDIM-X/Y/Z GDIM-X/Y/Z GLOBAL-ID(GPU indexing),BARRIER(thread block synchronization). - Float Literals: Numbers containing
.ore/Eare parsed as f64 (e.g.3.14,-2.0,1.0e-5,1e3). Stored on the stack as i64 bit patterns; F-prefixed words perform bitcast before/after operations. - Kernel Parameters: Declared in the
\!header.\! kernel <name>is required and must appear first.\! param <name> i64[<N>]becomes amemref<Nxi64>argument;\! param <name> i64becomes ani64argument.\! param <name> f64[<N>]becomes amemref<Nxf64>argument;\! param <name> f64becomes anf64argument (bitcast to i64 when pushed to stack). Using a param name in code emitsforth.param_ref(arrays push address; scalars push value). - Shared Memory:
\! shared <name> i64[<N>]or\! shared <name> f64[<N>]declares GPU shared (workgroup) memory. Emits a taggedmemref.allocaat kernel entry; ForthToGPU converts it to agpu.funcworkgroup attribution. Using the shared name in code pushes its base address onto the stack. UseS@/S!for i64 orSF@/SF!for f64 shared accesses. Cannot be referenced inside word definitions. - Reduced-Width Memory:
I8@ I16@ I32@load a narrow integer, sign-extend to i64.I8! I16! I32!truncate i64 to narrow integer, store.HF@ BF@ F32@load a narrow float, extend to f64, bitcast to i64.HF! BF! F32!bitcast i64 to f64, truncate to narrow float, store.S-prefixed variants (SI8@,SHF!, etc.) use shared memory (address space 3). - Conversion:
!forth.stack→memref<256xi64>with explicit stack pointer - GPU: Functions wrapped in
gpu.module,maingetsgpu.kernelattribute, configured with bare pointers for NVVM conversion - Local Variables:
{ a b c -- }at the start of a word definition binds read-only locals. Pops values from the stack in reverse name order (c, b, a) usingforth.pop, stores SSA values. Referencing a local emitsforth.push_value. SSA values from the entry block dominate all control flow, so locals work across IF/ELSE/THEN, loops, etc. On GPU, locals map directly to registers. - User-defined Words: Modeled as
func.funcwith signature(!forth.stack) -> !forth.stack, called viafunc.call
- Follow LLVM/MLIR naming (CamelCase)
- Document stack effects as
( input -- output ) - C++17 required
- Use
clang-format(config in.clang-format)
- Use context7 MCP for MLIR API documentation: query with
/websites/mlir_llvmfor MLIR dialects, operations, types, and conversion patterns