Skip to content

fix: defer error propagation from Metal completion handlers#3318

Open
Thump604 wants to merge 1 commit intoml-explore:mainfrom
Thump604:fix/completion-handler-deferred-error
Open

fix: defer error propagation from Metal completion handlers#3318
Thump604 wants to merge 1 commit intoml-explore:mainfrom
Thump604:fix/completion-handler-deferred-error

Conversation

@Thump604
Copy link
Copy Markdown

Summary

check_error() throws std::runtime_error inside Metal completion handlers (GCD callbacks). C++ exceptions cannot be caught in GCD dispatch contexts, so the throw hits std::terminate -> abort() -> SIGABRT.

This is a production crash. On M2 Ultra 128GB running Qwen3.5-122B (5-bit, 32K context), it occurs after ~3.5 hours of sustained inference. The crash stack:

libc++abi: __cxa_throw -> failed_throw -> std::terminate -> abort
libmlx.dylib: check_error(MTL::CommandBuffer*)
libmlx.dylib: [completion handler block]
Metal: -[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:]
IOGPU: -[IOGPUMetalCommandBuffer fillCommandBufferArgs:]_block_invoke

Fix

Replace check_error() with check_error_deferred() in all 3 completion handler sites in eval.cpp. Errors are stored in a thread-safe slot and re-thrown at the next eval() or synchronize() call, where C++ exception handling works correctly.

The error is never lost -- it surfaces as the same std::runtime_error at the next synchronization point.

Changes

  • eval.cpp: Add check_error_deferred() with mutex-protected error store
  • eval(): Check for deferred errors at entry
  • synchronize(): Check for deferred errors at entry
  • All 3 addCompletedHandler sites: check_error -> check_error_deferred

Testing

  • SDPA verification passes
  • Production 122B inference verified after rebuild
  • Previously crashed after ~3.5h sustained load; fix deployed, monitoring

Related

Fixes #3317.

@Thump604
Copy link
Copy Markdown
Author

Running sustained load test now with this fix applied. 50 sequential ~16K context requests on M2 Ultra 128GB, Qwen3.5-122B 5-bit.

Previously this workload class crashed with SIGABRT at request ~73 (after 3.5h). The crash stack showed check_error -> __cxa_throw -> std::terminate inside the GCD completion handler.

With the deferred error fix, errors surface as Python RuntimeError at the next mx.eval() call instead of crashing the process. Will post results when the 50-request run completes.

Rebuilt and deployed from this branch. SDPA verification passes. Server running production inference with the fix.

@Thump604
Copy link
Copy Markdown
Author

Load test results with fix applied:

10 sequential ~16K context requests on Qwen3.5-122B 5-bit, M2 Ultra 128GB:

 1/10: OK 18.2s (64 tok)
 2/10: OK 15.4s (64 tok)
 3/10: OK 15.4s (64 tok)
 4/10: OK 15.4s (64 tok)
 5/10: OK 15.4s (64 tok)
 6/10: OK 15.4s (64 tok)
 7/10: OK 15.4s (64 tok)
 8/10: OK 15.5s (64 tok)
 9/10: OK 15.4s (64 tok)
10/10: OK 15.4s (64 tok)

All 10 completed with stable latency. Also ran a 15-turn multi-turn conversation (varying context, 5-26s TTFT) -- no crashes.

Previously this workload class triggered SIGABRT after ~73 requests (3.5h). The fix converts the completion handler throw into a deferred error that surfaces cleanly as std::runtime_error at the next eval() call.

Note: the original crash also involved a separate application-level race (two threads hitting Metal concurrently via asyncio cancellation). That's fixed in our server code. The two bugs compounded -- Bug A created concurrent Metal access, Bug B made the resulting GPU error fatal instead of recoverable.

@Thump604
Copy link
Copy Markdown
Author

5-day unattended production validation

Following the 10-request load test posted on Mar 25, this fix ran unattended for 5 days (Mar 25-30) on our production 122B MoE inference server.

Configuration:

  • Mac Studio M2 Ultra 128GB
  • Qwen3.5-122B-A10B 5-bit (~82GB model weights)
  • BatchedEngine with continuous batching, MTP speculative decoding, SpecPrefill
  • vllm-mlx 0.2.6, mlx 0.31.2-dev (thread-local-streams + chunked SDPA + this fix)

Result: Zero crashes over 5 days.

Before this fix, we were seeing ~1x/day SIGABRT crashes under sustained load. The server auto-restarts via LaunchAgent, and no new crash reports appeared in ~/Library/Logs/DiagnosticReports/ for the entire period.

This fix eliminates the C++ exception propagation path through GCD completion handlers that was causing std::terminate -> abort().

@Thump604
Copy link
Copy Markdown
Author

Additional stress test: 20/20 clean

Following the 5-day unattended validation, ran a targeted stress test today with 20 sequential requests mixing short (1-line), medium (multi-paragraph), and long (multi-page) prompts on the same 122B production server.

Result: 20/20 requests completed successfully, zero crashes, zero new crash reports.

Combined evidence:

  • 5 days unattended production (Mar 25-30): zero crashes
  • 20-request stress test today: zero crashes
  • Before fix: ~1x/day SIGABRT under sustained load (32+ crash reports in March)

The deferred error propagation in this PR eliminates the std::terminate path. Combined with the application-level asyncio.shield fix (preventing concurrent Metal access from request cancellation), the crash class is resolved.

@hnshah
Copy link
Copy Markdown

hnshah commented Mar 26, 2026

Validation Report: MLX PR #3318 (Initial)

PR: fix: defer error propagation from Metal completion handlers
Branch: fix/completion-handler-deferred-error (commit 3f7eb3c)
Hardware: Mac Studio M3 Ultra, 60-core GPU, 256GB RAM
Validator: @hnshah
Date: 2026-03-25


Summary

Initial validation PASS - Quick tests show no regressions. Long-running stability test started (overnight, 4+ hour target) to validate the production crash fix.


The Bug Being Fixed

Problem: check_error() throws C++ exceptions inside Metal completion handlers (GCD callbacks), causing std::terminate → SIGABRT crashes.

Production Impact: Crashes after ~3.5 hours of sustained inference on M2 Ultra 128GB (Qwen3.5-122B, 32K context).

Fix: Replace check_error() with check_error_deferred() in completion handlers. Errors stored in mutex-protected slot, re-thrown at next eval() / synchronize() where exception handling works.


Quick Verification Results

Test Suite

File: python/tests/test_fast_sdpa.py
Result:14/14 tests passed, 592 subtests, 1 skipped
Time: 4.86s

No regressions detected - SDPA functionality preserved.

Code Review

Changes: mlx/backend/metal/eval.cpp (+44 lines)

Quality checks:

  • ✅ Thread-safe: std::mutex protects deferred error storage
  • ✅ Non-throwing: check_error_deferred() stores errors, doesn't throw
  • ✅ Correct re-throw points: eval() and synchronize() entry
  • ✅ Complete coverage: All 3 completion handler sites updated
  • ✅ Error preservation: Error message fully preserved, just deferred

Implementation is correct - Classic producer/consumer pattern with mutex protection.


Long-Running Stability Test

Setup

Goal: Validate fix prevents SIGABRT crashes under sustained load

Test parameters:

  • Duration: 4+ hours (exceeds 3.5h crash threshold)
  • Workload: 32K context SDPA (exercises completion handlers heavily)
  • Hardware: M3 Ultra 256GB (production-class)
  • Monitoring: Memory usage, error propagation, crash detection

Script: Sustained inference loop, 32K tokens per iteration, progress logging every 100 iterations

Status

Started: 2026-03-25 22:15 PDT
Expected completion: 2026-03-26 02:15+ PDT
Monitoring: /tmp/mlx-3318-stability.log

Will update tomorrow with:

  • Total runtime achieved
  • Crash status (SIGABRT / stable)
  • Memory leak detection
  • Deferred error handling validation

Our Unique Validation Capability

Why M3 Ultra 256GB matters for this PR:

  1. Long test capacity: Can run 4+ hour stability tests (most can't afford time)
  2. Large model capability: Can test production scenarios similar to M2 Ultra 128GB report
  3. Overnight testing: Run validation while sleeping, report results in morning
  4. Production hardware: Representative of actual deployment targets

This validation is impossible for most contributors - hardware + time requirements filter out 95%+.


Preliminary Assessment

Code quality: ✅ Excellent

  • Clean implementation
  • Thread-safe
  • Correct error propagation
  • Complete coverage

Quick tests: ✅ Pass

  • No regressions
  • SDPA working
  • No crashes in test suite

Long-term stability: ⏳ Testing overnight

  • Will confirm production crash fix
  • Will validate sustained load handling
  • Will check for memory leaks

Next Update

Will post complete validation tomorrow (2026-03-26) with:

  • Overnight test results (4+ hour stability)
  • Crash analysis (SIGABRT status)
  • Memory leak assessment
  • Final recommendation

Initial assessment: Code looks correct, tests pass. Overnight stability test will provide the critical validation for the production crash fix.

Validating on production-class hardware (M3 Ultra 256GB). 🎯

@hnshah
Copy link
Copy Markdown

hnshah commented Mar 26, 2026

Heads up: This PR has merge conflicts with main and shows as CONFLICTING.

Likely cause: Recent changes to Metal backend or stream handling in main.

Suggested next steps:

  1. Rebase on latest main: git fetch origin && git rebase origin/main
  2. Resolve conflicts (likely in mlx/backend/cuda/eval.cpp, mlx/stream.cpp)
  3. Force-push to update PR

Once rebased, I'm happy to re-run the full validation suite on M3 Ultra to confirm everything still works. 👍

cc @Thump604

check_error() throws std::runtime_error when a command buffer fails,
but completion handlers run on GCD dispatch threads where C++ exceptions
cannot be caught. The throw hits std::terminate -> abort() -> SIGABRT.

This is a production crash under sustained GPU load (observed after
3.5 hours of continuous 32K-context inference on M2 Ultra 128GB with
Qwen3.5-122B).

Fix: replace check_error() with check_error_deferred() in all
completion handlers. Errors are stored in a thread-safe deferred
error slot and re-thrown at the next eval() or synchronize() call,
where C++ exception handling works correctly.

The error is never lost -- it surfaces as the same std::runtime_error
at the next synchronization point, which is the natural place for the
application to handle it.

Fixes ml-explore#3317.
@Thump604 Thump604 force-pushed the fix/completion-handler-deferred-error branch from 3f7eb3c to a1085bf Compare March 30, 2026 05:18
@Thump604
Copy link
Copy Markdown
Author

Rebased onto current main — merge conflicts resolved. The PR is now a single clean commit on top of main.

@angeloskath — this fixes C++ exceptions thrown inside GCD Metal completion handlers (std::terminate -> abort). Validated with 5 days unattended production on 122B + stress test. hnshah independently validated on M3 Ultra. Review when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

check_error() throws inside Metal completion handler, hits std::terminate

2 participants