docs: add project README and language reference

tetsuo-cpp · tetsuo-cpp · commit 83c30d00aa8d · 2026-02-24T01:55:49.000+11:00
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -87,7 +87,7 @@ uv run ruff format gpu_test/
 
 - **Stack Type**: `!forth.stack` - untyped stack, programmer ensures type safety
 - **Operations**: All take stack as input and produce stack as output (except `forth.stack`)
-- **Supported Words**: literals (integer `42` and float `3.14`), `DUP DROP SWAP OVER ROT NIP TUCK PICK ROLL`, `+ - * / MOD`, `F+ F- F* F/` (float arithmetic), `FEXP FSQRT FLOG FABS FNEG` (float math intrinsics), `FMAX FMIN` (float min/max), `AND OR XOR NOT LSHIFT RSHIFT`, `= < > <> <= >= 0=`, `F= F< F> F<> F<= F>=` (float comparison), `S>F F>S` (int/float conversion), `@ !` (global memory), `F@ F!` (float global memory), `S@ S!` (shared memory), `SF@ SF!` (float shared memory), `I8@ I8! SI8@ SI8!` (i8 memory), `I16@ I16! SI16@ SI16!` (i16 memory), `I32@ I32! SI32@ SI32!` (i32 memory), `HF@ HF! SHF@ SHF!` (f16 memory), `BF@ BF! SBF@ SBF!` (bf16 memory), `F32@ F32! SF32@ SF32!` (f32 memory), `CELLS`, `IF ELSE THEN`, `BEGIN UNTIL`, `BEGIN WHILE REPEAT`, `DO LOOP +LOOP I J K`, `LEAVE UNLOOP EXIT`, `{ a b -- }` (local variables in word definitions), `TID-X/Y/Z BID-X/Y/Z BDIM-X/Y/Z GDIM-X/Y/Z GLOBAL-ID` (GPU indexing).
+- **Supported Words**: literals (integer `42` and float `3.14`), `DUP DROP SWAP OVER ROT NIP TUCK PICK ROLL`, `+ - * / MOD`, `F+ F- F* F/` (float arithmetic), `FEXP FSQRT FLOG FABS FNEG` (float math intrinsics), `FMAX FMIN` (float min/max), `AND OR XOR NOT LSHIFT RSHIFT`, `= < > <> <= >= 0=`, `F= F< F> F<> F<= F>=` (float comparison), `S>F F>S` (int/float conversion), `@ !` (global memory), `F@ F!` (float global memory), `S@ S!` (shared memory), `SF@ SF!` (float shared memory), `I8@ I8! SI8@ SI8!` (i8 memory), `I16@ I16! SI16@ SI16!` (i16 memory), `I32@ I32! SI32@ SI32!` (i32 memory), `HF@ HF! SHF@ SHF!` (f16 memory), `BF@ BF! SBF@ SBF!` (bf16 memory), `F32@ F32! SF32@ SF32!` (f32 memory), `CELLS`, `IF ELSE THEN`, `BEGIN UNTIL`, `BEGIN WHILE REPEAT`, `DO LOOP +LOOP I J K`, `LEAVE UNLOOP EXIT`, `{ a b -- }` (local variables in word definitions), `TID-X/Y/Z BID-X/Y/Z BDIM-X/Y/Z GDIM-X/Y/Z GLOBAL-ID` (GPU indexing), `BARRIER` (thread block synchronization).
 - **Float Literals**: Numbers containing `.` or `e`/`E` are parsed as f64 (e.g. `3.14`, `-2.0`, `1.0e-5`, `1e3`). Stored on the stack as i64 bit patterns; F-prefixed words perform bitcast before/after operations.
 - **Kernel Parameters**: Declared in the `\!` header. `\! kernel <name>` is required and must appear first. `\! param <name> i64[<N>]` becomes a `memref<Nxi64>` argument; `\! param <name> i64` becomes an `i64` argument. `\! param <name> f64[<N>]` becomes a `memref<Nxf64>` argument; `\! param <name> f64` becomes an `f64` argument (bitcast to i64 when pushed to stack). Using a param name in code emits `forth.param_ref` (arrays push address; scalars push value).
 - **Shared Memory**: `\! shared <name> i64[<N>]` or `\! shared <name> f64[<N>]` declares GPU shared (workgroup) memory. Emits a tagged `memref.alloca` at kernel entry; ForthToGPU converts it to a `gpu.func` workgroup attribution. Using the shared name in code pushes its base address onto the stack. Use `S@`/`S!` for i64 or `SF@`/`SF!` for f64 shared accesses. Cannot be referenced inside word definitions.
diff --git a/README.md b/README.md
@@ -0,0 +1,115 @@
+# WarpForth
+
+An MLIR-based Forth compiler for programming GPU kernels. WarpForth defines a custom MLIR dialect for Forth stack operations and lowers through a pipeline of passes to PTX assembly.
+
+## Dependencies
+
+- LLVM/MLIR
+- CMake
+- C++17 compiler
+- CUDA toolkit (for GPU execution)
+- [uv](https://github.com/astral-sh/uv) (for Python test tooling)
+
+## Building
+
+```bash
+# Configure
+cmake -B build -G Ninja \
+  -DMLIR_DIR=/path/to/llvm/lib/cmake/mlir \
+  -DLLVM_DIR=/path/to/llvm/lib/cmake/llvm
+
+# Build
+cmake --build build
+```
+
+## Quick Start
+
+Write a naive integer matrix multiply kernel (M=2, N=3, K=4, one thread per output element):
+
+```forth
+\! kernel main
+\! param A i64[8]
+\! param B i64[12]
+\! param C i64[6]
+
+\ One thread computes C[row, col] where gid = row*N + col.
+GLOBAL-ID
+DUP 3 /
+SWAP 3 MOD
+0
+4 0 DO
+  2 PICK
+  I SWAP 4 * +
+  CELLS A + @
+  I 3 * 3 PICK + CELLS B + @
+  * +
+LOOP
+2 PICK 3 * 2 PICK +
+CELLS C + !
+```
+
+Compile to PTX:
+
+```bash
+./build/bin/warpforthc matmul.forth -o matmul.ptx
+```
+
+Test on a GPU (A is 2x4 row-major, B is 4x3 row-major, C is 2x3 output):
+
+```bash
+./build/bin/warpforth-runner matmul.ptx \
+  --param 'i64[]:1,2,3,4,5,6,7,8' \
+  --param 'i64[]:1,2,3,4,5,6,7,8,9,10,11,12' \
+  --param 'i64[]:0,0,0,0,0,0' \
+  --grid 6,1,1 --block 1,1,1 \
+  --output-param 2 --output-count 6
+```
+
+## Toolchain
+
+| Tool | Description |
+|------|-------------|
+| `warpforthc` | Compiles Forth source to PTX |
+| `warpforth-translate` | Translates from Forth source to MLIR and MLIR to PTX assembly |
+| `warpforth-opt` | Runs individual MLIR passes or entire pipeline |
+| `warpforth-runner` | Executes PTX kernels on a GPU for testing |
+
+These tools can be composed for debugging or inspecting intermediate stages:
+
+```bash
+./build/bin/warpforth-translate --forth-to-mlir kernel.forth | \
+  ./build/bin/warpforth-opt --warpforth-pipeline | \
+  ./build/bin/warpforth-translate --mlir-to-ptx
+```
+
+## Language Reference
+
+WarpForth supports stack operations, integer and float arithmetic, control flow, global and shared memory access, reduced-width memory types, user-defined words with local variables, and GPU-specific operations.
+
+See [docs/language.md](docs/language.md) for the full language reference.
+
+## Architecture
+
+WarpForth compiles Forth through a series of MLIR dialect lowerings, each replacing higher-level abstractions with lower-level ones until the program is expressed entirely in LLVM IR and can be handed to the NVPTX backend.
+
+| Stage | Pass | Description |
+|-------|-------------|-------------|
+| **Parsing** | `warpforth-translate --forth-to-mlir` | Parses Forth source into the `forth` dialect. The kernel is represented as a series of stack ops on an abstract `!forth.stack` type. |
+| **Stack lowering** | `warpforth-opt --convert-forth-to-memref` | The abstract `!forth.stack` type is materialized as a `memref<256xi64>` buffer and `index` pair. Stack ops become explicit loads, stores, and pointer arithmetic. |
+| **GPU wrapping** | `warpforth-opt --convert-forth-to-gpu` | Functions are wrapped in a `gpu.module`, the kernel entry point is marked as a `gpu.kernel` and GPU intrinsic words are lowered to `gpu` ops. |
+| **NVVM/LLVM lowering** | Standard MLIR passes | GPU→NVVM, math→LLVM intrinsics and NVVM→LLVM. |
+| **Code generation** | `warpforth-translate --mlir-to-ptx` | The GPU module is serialized to PTX assembly via LLVM's NVPTX backend. |
+
+## Demo
+
+The `demo/` directory contains a GPT-2 text generation demo that routes scaled dot-product attention through a WarpForth-compiled kernel. See [demo/README.md](demo/README.md) for setup instructions.
+
+## Testing
+
+```bash
+# Run the LIT test suite
+cmake --build build --target check-warpforth
+
+# Run end-to-end GPU tests (requires Vast.ai API key)
+VASTAI_API_KEY=xxx uv run pytest -v -m gpu
+```
diff --git a/demo/README.md b/demo/README.md
@@ -18,7 +18,7 @@ A pre-compiled `attention.ptx` is included in this directory.
 ## Step 2: Upload to GPU Instance
 
 ```bash
-scp -r demo/ demo/gpt2_generate.py root@HOST:/workspace
+scp -r demo/ root@HOST:/workspace
 ```
 
 ## Step 3: Install Dependencies (Remote)
@@ -30,7 +30,7 @@ pip install pycuda transformers
 ## Step 4: Generate Text (Remote)
 
 ```bash
-python /workspace/gpt2_generate.py --ptx /workspace/attention.ptx --prompt "The meaning of life is"
+python /workspace/demo/gpt2_generate.py --ptx /workspace/demo/attention.ptx --prompt "The meaning of life is"
 ```
 
 | Flag | Default | Description |
diff --git a/docs/language.md b/docs/language.md