Running log of the types of changes needed to get the original repo code building and running against the current Mojo toolchain. This should generalise to other directories/projects.
- Channels
- Use Modular nightly + conda-forge:
channels = ["https://conda.modular.com/max-nightly", "conda-forge"]
- Use Modular nightly + conda-forge:
- Platform
- Target Apple Silicon:
platforms = ["osx-arm64"]
- Target Apple Silicon:
- Mojo version pin
- Constrain to the nightly toolchain in use:
mojo = ">=0.26.2.0.dev2026012705,<0.27"
- Constrain to the nightly toolchain in use:
- Python version
- Align to:
python = "3.11.*"
- Align to:
- Legacy modular dependency
- The original
7-point-stencil/Mojo/pixi.tomlin the upstream repo pins a lower-level toolchain via:modular = ">=25.5.0.dev2025070105,<26"
- I keep that nested Pixi file for historical context, but the Apple Silicon work here is driven from the top-level Pixi workspace and its
mojodependency.
- The original
Expose each workload as a Pixi task:
[tasks]
test-gpu = "mojo src/test_gpu.mojo"
stencil-bench = "mojo 7-point-stencil/Mojo/laplacian.mojo"
babel-bench = "mojo babelStream/Mojo/babelStream.mojo"
bude-bench = "mojo miniBUDE/Mojo/miniBUDE.mojo"
hf-bench = "mojo hartree-fock/Mojo/hartree-fock.mojo"Pattern: task name → mojo <relative-path-to-main.mojo>.
Centralised text replacements for API changes:
from sys import sizeof→from sys.info import size_ofsizeof[→size_of[from gpu.index import block_dim, block_idx, thread_idx→from gpu import thread_idx, block_idx, block_dim
Applied to:
7-point-stencil/MojobabelStream/MojominiBUDE/Mojohartree-fock/Mojo
Pattern: maintain a repo-level “bulk replacements” file to upgrade older Mojo code to newer APIs.
Status: pixi run stencil-bench (GPU) and pixi run stencil-cpu-bench (CPU) both compile and run successfully on Mojo 0.26.x after the changes below.
Issue:
total_elapsed: Float64 = 0.0elapsed = end - startinferred as an integer-like type (UIntfrommonotonic()).total_elapsed += elapsedfails with a type mismatch.
Fix pattern:
- Explicitly cast timing deltas when mixing with floats:
elapsed = end - start
bw_gbs: Float64 = Float64(datasize) / Float64(elapsed)
total_elapsed += Float64(elapsed)General rule:
- Treat
monotonic()deltas as non-float by default. - Whenever used with
Float32/Float64accumulators or divisions, wrap with explicitFloat32(...)/Float64(...).
Context:
- The original stencil Laplacian targeted double precision; on the Apple Silicon + Metal stack, compiling the
Float64kernel currently triggers an internal GPU compiler error when creating the compute pipeline state.
Changes:
- Standardise the GPU path in
7-point-stencil/Mojo/laplacian.mojoto:comptime precision = Float32comptime dtype = DType.float32
- Ensure the initialisation kernel
test_function_kerneluses a full 3D grid/block configuration consistent with the Laplacian kernel (instead of hard-coding a 1D launch).
Result:
pixi run stencil-bench -- --csvruns reliably on the M1 Pro GPU and emits CSV with header:backend,GPU,precision,L,blk_x,blk_y,blk_z,BW_GBs
Context:
- Add a CPU-only baseline that mirrors the Laplacian’s maths and datasize accounting, so GPU and CPU bandwidth numbers can be compared directly.
New file / task:
benchmarks/stencil_cpu.mojo- Pixi task:
stencil-cpu-bench = "mojo benchmarks/stencil_cpu.mojo"
Key implementation details:
- Modern Mojo syntax and semantics:
- Replace
alias precision = Float32withcomptime precision = Float32. - Replace legacy
letdeclarations withvarand fix indentation so statements start at the beginning of a line. - Use
def main()so argument parsing via__int__()can raise as needed.
- Replace
- Data model:
- Allocate
List[precision]buffers with an explicitcapacityand fill them viaappend. - Compute grid spacings and Laplacian coefficients in Float32, matching the GPU kernel.
- Use the same theoretical datasize formula as the GPU code for fair bandwidth comparison:
theoretical_fetch_size + theoretical_write_sizebased on(nx, ny, nz)andsize_of[precision]().
- Allocate
- Timing and output:
- Measure
monotonic()deltas, accumulate them asFloat64, and explicitly cast where needed. - Default configuration mirrors the GPU run:
L = 512,num_iter = 1000,precision = Float32. - When
--csvis passed, print rows with the same header schema as the GPU benchmark:backend,GPU,precision,L,blk_x,blk_y,blk_z,BW_GBs- For CPU runs this is currently:
backend = "cpu",GPU = "CPU",precision = "float32",blk_x = blk_y = blk_z = 1.
- Measure
Practical note:
- On an M1 Pro, the full
L = 512,num_iter = 1000CPU configuration takes on the order of minutes (~200 seconds). That’s acceptable for a one-off baseline, but for quick iteration it’s better to reduceLand/or--iter.
Symptoms:
- Errors like “expression must be mutable in assignment” where writing through:
UnsafePointer[Scalar[dtype]](a[i] = initA,b[i] = ..., etc.).- Shared buffers from
stack_allocation[...](). - Pointer-based buffers such as
sums[block_idx.x] = tb_sum[local_tid].
Change pattern:
- Ensure that:
- Pointer parameters are declared mutable in their type (or replaced by mutable, higher-level device containers).
- Shared-memory allocations and their element types are considered mutable.
- In newer Mojo, mutability is explicit and enforced at the type level for pointer-like access.
Symptoms:
- “no matching method in call to
enqueue_function” - Notes about candidate parameter
TsbeingDevicePassable, while our kernel types useUnsafePointer[...]parameters.
Change pattern (one of):
-
Refactor kernels to use device-passable arguments:
- Replace raw pointers (
UnsafePointer[...]) in kernel signatures with GPU containers expected by the current API (e.g.Buffer[...],LayoutTensor[...], or similar). - Launch with
ctx.enqueue_function[...]using those containers directly.
- Replace raw pointers (
-
Or use _unchecked APIs for low-level kernels:
- If the runtime offers
enqueue_function_unchecked[...]that accepts pointer-based kernels, switch calls accordingly when we intentionally bypass type checks.
- If the runtime offers
General rule:
- Old code that:
- takes
UnsafePointerparameters, and - is launched via
enqueue_function
- takes
- may need to either:
- adopt device-safe argument types, or
- move to an “unchecked” launch API that’s intended for such low-level kernels.
Symptom:
List[Int64](v1, v2, v3, v4, v5)fails with:- “expected at most 0 positional arguments, got 5”
- plus missing keyword-only args (
capacity,length,fill,__list_literal__, etc.).
Change pattern:
- Use a 0-arg constructor plus
append(or whatever new literal form the stdlib documents), e.g.:
var kernel_data = List[Int64]()
kernel_data.append(2 * SIZE * size_of[Scalar[dtype]]())
kernel_data.append(2 * SIZE * size_of[Scalar[dtype]]())
kernel_data.append(3 * SIZE * size_of[Scalar[dtype]]())
kernel_data.append(3 * SIZE * size_of[Scalar[dtype]]())
kernel_data.append(2 * SIZE * size_of[Scalar[dtype]]())General rule:
- Do not rely on multi-positional-argument
List[T](...)constructors. - Prefer explicit construction and population (
append, literal helpers, or factory functions) that match the current API.
Pattern:
- Timing data stored in a NumPy array:
kernel_timings = np.zeros(Python.tuple(5, num_iter), dtype="float32")- Writes:
kernel_timings[i][k] = Float32(end - start)
Rule:
- Same as in the Laplacian code:
- Always cast timing deltas from
monotonic()into the intended float type before storing into numeric arrays/accumulators.
- Always cast timing deltas from
We have not yet updated babelStream to compile on Mojo 0.26.x, because there are two competing approaches:
-
Low-level pointer-preserving approach
- Keep the existing
UnsafePointer[...]-based kernels. - Update them to the new pointer mutability / address-space rules and find the correct way to write through pointers on device.
- Use the appropriate
DeviceContextlaunch API for pointer-based kernels (e.g. anenqueue_function_unchecked-style API if recommended for this use case).
- Keep the existing
-
Higher-level, device-passable refactor
- Rewrite kernels to operate on
DevicePassablecontainers (e.g.LayoutTensoror other buffer abstractions) instead of raw pointers. - Call
ctx.enqueue_function[...]with those containers, mirroring working patterns in other benchmarks.
- Rewrite kernels to operate on
At this point, we have only documented the issues and options; no code changes have been applied yet to avoid committing to the wrong pattern without confirming the intended 0.26.x GPU idioms.
-
API surface drift:
sys.sizeof→sys.info.size_of- GPU index imports reorganised (
gpu.index→ flatgpuimports). DeviceContext.enqueue_functionnow expectsDevicePassabletypes rather than arbitrary pointer-based kernels.
-
Stricter type system / mutability:
- Mutability of pointers and shared-memory buffers must be explicit and consistent with writes.
- Numeric operations across integer-like time deltas and floats require explicit casts.
-
Standard library changes:
Listconstructor semantics changed; no longer supports “Python-style” multi-element positional construction.
-
Tooling standardisation:
- Pixi workspace config centralises:
- channels,
- platforms,
- Mojo version pin,
- Python version,
- standard task names/paths.
- A single “bulk replacements” config is used to apply systematic mechanical edits across all benchmark directories.
- Pixi workspace config centralises: