fix: defer error propagation from Metal completion handlers#3318
fix: defer error propagation from Metal completion handlers#3318Thump604 wants to merge 1 commit intoml-explore:mainfrom
Conversation
|
Running sustained load test now with this fix applied. 50 sequential ~16K context requests on M2 Ultra 128GB, Qwen3.5-122B 5-bit. Previously this workload class crashed with SIGABRT at request ~73 (after 3.5h). The crash stack showed With the deferred error fix, errors surface as Python RuntimeError at the next Rebuilt and deployed from this branch. SDPA verification passes. Server running production inference with the fix. |
|
Load test results with fix applied: 10 sequential ~16K context requests on Qwen3.5-122B 5-bit, M2 Ultra 128GB: All 10 completed with stable latency. Also ran a 15-turn multi-turn conversation (varying context, 5-26s TTFT) -- no crashes. Previously this workload class triggered SIGABRT after ~73 requests (3.5h). The fix converts the completion handler throw into a deferred error that surfaces cleanly as Note: the original crash also involved a separate application-level race (two threads hitting Metal concurrently via asyncio cancellation). That's fixed in our server code. The two bugs compounded -- Bug A created concurrent Metal access, Bug B made the resulting GPU error fatal instead of recoverable. |
5-day unattended production validationFollowing the 10-request load test posted on Mar 25, this fix ran unattended for 5 days (Mar 25-30) on our production 122B MoE inference server. Configuration:
Result: Zero crashes over 5 days. Before this fix, we were seeing ~1x/day SIGABRT crashes under sustained load. The server auto-restarts via LaunchAgent, and no new crash reports appeared in This fix eliminates the C++ exception propagation path through GCD completion handlers that was causing |
Additional stress test: 20/20 cleanFollowing the 5-day unattended validation, ran a targeted stress test today with 20 sequential requests mixing short (1-line), medium (multi-paragraph), and long (multi-page) prompts on the same 122B production server. Result: 20/20 requests completed successfully, zero crashes, zero new crash reports. Combined evidence:
The deferred error propagation in this PR eliminates the |
Validation Report: MLX PR #3318 (Initial)PR: fix: defer error propagation from Metal completion handlers Summary✅ Initial validation PASS - Quick tests show no regressions. Long-running stability test started (overnight, 4+ hour target) to validate the production crash fix. The Bug Being FixedProblem: Production Impact: Crashes after ~3.5 hours of sustained inference on M2 Ultra 128GB (Qwen3.5-122B, 32K context). Fix: Replace Quick Verification ResultsTest SuiteFile: No regressions detected - SDPA functionality preserved. Code ReviewChanges: Quality checks:
Implementation is correct - Classic producer/consumer pattern with mutex protection. Long-Running Stability TestSetupGoal: Validate fix prevents SIGABRT crashes under sustained load Test parameters:
Script: Sustained inference loop, 32K tokens per iteration, progress logging every 100 iterations StatusStarted: 2026-03-25 22:15 PDT Will update tomorrow with:
Our Unique Validation CapabilityWhy M3 Ultra 256GB matters for this PR:
This validation is impossible for most contributors - hardware + time requirements filter out 95%+. Preliminary AssessmentCode quality: ✅ Excellent
Quick tests: ✅ Pass
Long-term stability: ⏳ Testing overnight
Next UpdateWill post complete validation tomorrow (2026-03-26) with:
Initial assessment: Code looks correct, tests pass. Overnight stability test will provide the critical validation for the production crash fix. Validating on production-class hardware (M3 Ultra 256GB). 🎯 |
|
Heads up: This PR has merge conflicts with main and shows as Likely cause: Recent changes to Metal backend or stream handling in main. Suggested next steps:
Once rebased, I'm happy to re-run the full validation suite on M3 Ultra to confirm everything still works. 👍 cc @Thump604 |
check_error() throws std::runtime_error when a command buffer fails, but completion handlers run on GCD dispatch threads where C++ exceptions cannot be caught. The throw hits std::terminate -> abort() -> SIGABRT. This is a production crash under sustained GPU load (observed after 3.5 hours of continuous 32K-context inference on M2 Ultra 128GB with Qwen3.5-122B). Fix: replace check_error() with check_error_deferred() in all completion handlers. Errors are stored in a thread-safe deferred error slot and re-thrown at the next eval() or synchronize() call, where C++ exception handling works correctly. The error is never lost -- it surfaces as the same std::runtime_error at the next synchronization point, which is the natural place for the application to handle it. Fixes ml-explore#3317.
3f7eb3c to
a1085bf
Compare
|
Rebased onto current main — merge conflicts resolved. The PR is now a single clean commit on top of main. @angeloskath — this fixes C++ exceptions thrown inside GCD Metal completion handlers (std::terminate -> abort). Validated with 5 days unattended production on 122B + stress test. hnshah independently validated on M3 Ultra. Review when you get a chance. |
Summary
check_error()throwsstd::runtime_errorinside Metal completion handlers (GCD callbacks). C++ exceptions cannot be caught in GCD dispatch contexts, so the throw hitsstd::terminate->abort()-> SIGABRT.This is a production crash. On M2 Ultra 128GB running Qwen3.5-122B (5-bit, 32K context), it occurs after ~3.5 hours of sustained inference. The crash stack:
Fix
Replace
check_error()withcheck_error_deferred()in all 3 completion handler sites ineval.cpp. Errors are stored in a thread-safe slot and re-thrown at the nexteval()orsynchronize()call, where C++ exception handling works correctly.The error is never lost -- it surfaces as the same
std::runtime_errorat the next synchronization point.Changes
eval.cpp: Addcheck_error_deferred()with mutex-protected error storeeval(): Check for deferred errors at entrysynchronize(): Check for deferred errors at entryaddCompletedHandlersites:check_error->check_error_deferredTesting
Related
Fixes #3317.