Fix CPU GatherMM one-row NaNs#3563
Conversation
Avoid the BLAS path for one-row CPU GatherMM calls because it can intermittently produce NaNs after quantized gather warmup on no-Metal aarch64-darwin builds. Add a focused regression covering that warmup sequence. Co-authored-by: Codex <noreply@openai.com>
|
I can't reproduce the error. Could you provide a reproduction. I ran the test in the diff of course and it passes without the PR. |
|
I reproduced the failure locally with the full nixpkgs package build/test path rather than by running the added regression test in isolation. I tested nixpkgs Base This PR at nix build --impure -L --keep-failed \
--argstr rev "$REV" \
--out-link /private/tmp/mlx-result \
-f /private/tmp/mlx-local-src-test.nixwith this local expression: { rev }:
let
pkgs = import <nixpkgs> {};
src = builtins.fetchGit {
url = "<local-path>";
inherit rev;
allRefs = true;
};
in
pkgs.python313Packages.mlx.overridePythonAttrs (old: {
inherit src;
})I think added regression test in isolation is not enough - running only that test can pass on unpatched |
|
I understand that but this doesn't mean that the error is in gather mm unfortunately. It could be an out of bounds write in random number generation or something equivalent. I think unless you can reproduce with static inputs we shouldn't merge this PR. Sorry for that, it's just probably not where the error is. |
|
The fundamental issue appears to be an Accelerate In the failing
MLX trace shows that With temporary local instrumentation in That matches the standalone Accelerate repro. Pure So the fix could be to avoid this Accelerate if that seems like the correct route to take, happy to work on that here or in new pr |
Fix CPU GatherMM one-row NaNs
Fixes #3200.
This fixes an intermittent CPU-only
GatherMMfailure that can produce NaNs after quantized gather warmup on no-Metal aarch64-darwin builds.I reproduced the intermittent NaN locally with a CPU-only build and the quantized test loop now passes 30 consecutive runs.
The CPU
GatherMMpath now handlesM == 1gathered matmuls with a small local one-row helper instead of routing that shape through BLAS. The regularcblas_sgemmpath remains unchanged for largerM.Changes:
gather_mm_one_row(...)for one-row CPU gathered matmuls.M == 1inGatherMM::eval_cpu.Testing:
AI Assistance Disclosure: Codex assisted with debugging, implementation, and PR draft preparation. The changes were reviewed and tested locally by me before submission.
@booxter