Clarify whether mlx_closure_compile(shapeless=true) should fuse C API closure bodies

## Question

Could you clarify the expected optimization contract for `mlx_closure_compile(..., shapeless=true)` in the C API?

Specifically: should C API users expect it to perform graph optimization / kernel fusion comparable to Python `mx.compile(..., shapeless=True)` for a closure body, or should it be treated primarily as a compiled callable/cache boundary where fusion behavior must be validated per op family?

## Why I am asking

In AX Engine we tested a few opt-in prefill experiments with MLX 0.31.x / mlx-c 0.6.x. The closure bodies were correct and engaged, but they did not produce measurable throughput improvement:

| Probe scope | Result |
| --- | ---: |
| one `quantized_matmul` inside `mlx_closure_compile(shapeless=true)` | about -0.6% prefill, within noise |
| dense FFN body: `quantized_matmul + activation + multiply + quantized_matmul` | about -0.5% prefill, within noise |
| Q+K `mlx_fast_rope` pair | about -0.5% to +0.5% prefill, within noise |

This surprised us because the Python MLX compile documentation says `compile()` can build and optimize the compute graph and fuse certain operations. We may have incorrectly assumed the C API closure compile path has the same practical fusion behavior for these closure bodies.

## Related note

I found #104, but that issue appears to be about a `shapeless=True` correctness/stale-result case involving reductions. This issue is only asking about expected optimization/fusion semantics and what C API users should document or rely on.

## What clarification would help

A short answer to any of these would be enough:

1. Is `mlx_closure_compile(shapeless=true)` intended to expose the same graph optimization/fusion behavior as Python `mx.compile(shapeless=True)`?
2. Are there known op families where C API closure compilation should not be expected to fuse, such as `quantized_matmul` or `mlx_fast_rope`?
3. Is there a supported way to inspect whether a compiled C API closure lowered to fewer Metal dispatches/kernels?
4. Should downstream projects avoid documenting C API closure compile as equivalent to Python `mx.compile` unless they have per-closure benchmark evidence?

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify whether mlx_closure_compile(shapeless=true) should fuse C API closure bodies #116

Question

Why I am asking

Related note

What clarification would help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Probe scope	Result
one `quantized_matmul` inside `mlx_closure_compile(shapeless=true)`	about -0.6% prefill, within noise
dense FFN body: `quantized_matmul + activation + multiply + quantized_matmul`	about -0.5% prefill, within noise
Q+K `mlx_fast_rope` pair	about -0.5% to +0.5% prefill, within noise

Clarify whether mlx_closure_compile(shapeless=true) should fuse C API closure bodies #116

Description

Question

Why I am asking

Related note

What clarification would help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions