Question
Could you clarify the expected optimization contract for mlx_closure_compile(..., shapeless=true) in the C API?
Specifically: should C API users expect it to perform graph optimization / kernel fusion comparable to Python mx.compile(..., shapeless=True) for a closure body, or should it be treated primarily as a compiled callable/cache boundary where fusion behavior must be validated per op family?
Why I am asking
In AX Engine we tested a few opt-in prefill experiments with MLX 0.31.x / mlx-c 0.6.x. The closure bodies were correct and engaged, but they did not produce measurable throughput improvement:
| Probe scope |
Result |
one quantized_matmul inside mlx_closure_compile(shapeless=true) |
about -0.6% prefill, within noise |
dense FFN body: quantized_matmul + activation + multiply + quantized_matmul |
about -0.5% prefill, within noise |
Q+K mlx_fast_rope pair |
about -0.5% to +0.5% prefill, within noise |
This surprised us because the Python MLX compile documentation says compile() can build and optimize the compute graph and fuse certain operations. We may have incorrectly assumed the C API closure compile path has the same practical fusion behavior for these closure bodies.
Related note
I found #104, but that issue appears to be about a shapeless=True correctness/stale-result case involving reductions. This issue is only asking about expected optimization/fusion semantics and what C API users should document or rely on.
What clarification would help
A short answer to any of these would be enough:
- Is
mlx_closure_compile(shapeless=true) intended to expose the same graph optimization/fusion behavior as Python mx.compile(shapeless=True)?
- Are there known op families where C API closure compilation should not be expected to fuse, such as
quantized_matmul or mlx_fast_rope?
- Is there a supported way to inspect whether a compiled C API closure lowered to fewer Metal dispatches/kernels?
- Should downstream projects avoid documenting C API closure compile as equivalent to Python
mx.compile unless they have per-closure benchmark evidence?
Thanks!
Question
Could you clarify the expected optimization contract for
mlx_closure_compile(..., shapeless=true)in the C API?Specifically: should C API users expect it to perform graph optimization / kernel fusion comparable to Python
mx.compile(..., shapeless=True)for a closure body, or should it be treated primarily as a compiled callable/cache boundary where fusion behavior must be validated per op family?Why I am asking
In AX Engine we tested a few opt-in prefill experiments with MLX 0.31.x / mlx-c 0.6.x. The closure bodies were correct and engaged, but they did not produce measurable throughput improvement:
quantized_matmulinsidemlx_closure_compile(shapeless=true)quantized_matmul + activation + multiply + quantized_matmulmlx_fast_ropepairThis surprised us because the Python MLX compile documentation says
compile()can build and optimize the compute graph and fuse certain operations. We may have incorrectly assumed the C API closure compile path has the same practical fusion behavior for these closure bodies.Related note
I found #104, but that issue appears to be about a
shapeless=Truecorrectness/stale-result case involving reductions. This issue is only asking about expected optimization/fusion semantics and what C API users should document or rely on.What clarification would help
A short answer to any of these would be enough:
mlx_closure_compile(shapeless=true)intended to expose the same graph optimization/fusion behavior as Pythonmx.compile(shapeless=True)?quantized_matmulormlx_fast_rope?mx.compileunless they have per-closure benchmark evidence?Thanks!