Skip to content

Clarify whether mlx_closure_compile(shapeless=true) should fuse C API closure bodies #116

@automatosx

Description

@automatosx

Question

Could you clarify the expected optimization contract for mlx_closure_compile(..., shapeless=true) in the C API?

Specifically: should C API users expect it to perform graph optimization / kernel fusion comparable to Python mx.compile(..., shapeless=True) for a closure body, or should it be treated primarily as a compiled callable/cache boundary where fusion behavior must be validated per op family?

Why I am asking

In AX Engine we tested a few opt-in prefill experiments with MLX 0.31.x / mlx-c 0.6.x. The closure bodies were correct and engaged, but they did not produce measurable throughput improvement:

Probe scope Result
one quantized_matmul inside mlx_closure_compile(shapeless=true) about -0.6% prefill, within noise
dense FFN body: quantized_matmul + activation + multiply + quantized_matmul about -0.5% prefill, within noise
Q+K mlx_fast_rope pair about -0.5% to +0.5% prefill, within noise

This surprised us because the Python MLX compile documentation says compile() can build and optimize the compute graph and fuse certain operations. We may have incorrectly assumed the C API closure compile path has the same practical fusion behavior for these closure bodies.

Related note

I found #104, but that issue appears to be about a shapeless=True correctness/stale-result case involving reductions. This issue is only asking about expected optimization/fusion semantics and what C API users should document or rely on.

What clarification would help

A short answer to any of these would be enough:

  1. Is mlx_closure_compile(shapeless=true) intended to expose the same graph optimization/fusion behavior as Python mx.compile(shapeless=True)?
  2. Are there known op families where C API closure compilation should not be expected to fuse, such as quantized_matmul or mlx_fast_rope?
  3. Is there a supported way to inspect whether a compiled C API closure lowered to fewer Metal dispatches/kernels?
  4. Should downstream projects avoid documenting C API closure compile as equivalent to Python mx.compile unless they have per-closure benchmark evidence?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions