Fix data race on nativeWindowsToRedraw across threads#630
Fix data race on nativeWindowsToRedraw across threads#630primesoftnz wants to merge 1 commit intohighperformancecoder:masterfrom
Conversation
nativeWindowsToRedraw (std::set<RenderNativeWindow*>) is accessed from multiple threads without synchronisation: - requestRedraw() inserts from any thread (via NativeSurface callback) - ~RenderNativeWindow() erases during model operations - drawNativeWindows() iterates and clears on the minsky thread - macOSXLaunchDrawNativeWindows() reads .size() on the minsky thread This is a data race on std::set that can corrupt the container's internal tree, potentially leading to crashes or infinite loops during iteration. Whether this contributes to the freeze reported in tickets:#1906 is unclear — the SetWindowPos removal in 3.25.1 likely addresses the primary mechanism. This fix closes a correctness issue that could cause problems under load (complex models with many native windows), but may not be the root cause of Steve's reported freeze. Fix: add a dedicated mutex (nativeWindowsToRedrawMutex) to protect all accesses. The draw functions swap the set into a local copy under the new mutex, then release it before acquiring minskyCmdMutex — preserving existing lock scope and avoiding lock ordering issues. Also fixes a minor existing bug where macOSXDrawNativeWindows() cleared nativeWindowsToRedraw AFTER swapping it out, discarding any redraw requests made during the draw cycle. Related to tickets:#1906
📝 WalkthroughWalkthroughThe PR introduces mutex-protected access to the native window redraw queue. It adds a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model/renderNativeWindow.cc`:
- Around line 107-111: requestRedraw() reads winInfoPtr without the same lock
used by renderFrame()/destroyFrame(), introducing a race; fix by taking the same
drawMutex (the mutex used around renderFrame/destroyFrame) before accessing or
snapshotting winInfoPtr inside requestRedraw(), then use that guarded shared_ptr
(or its snapshot) when deciding to queue the window into
minsky().nativeWindowsToRedraw; keep the later lock on
minsky().nativeWindowsToRedrawMutex for the set insertion and ensure you
reference the same symbols: requestRedraw(), winInfoPtr, drawMutex,
renderFrame(), destroyFrame(), nativeWindowsToRedraw,
nativeWindowsToRedrawMutex, and minsky() when making the change.
In `@RESTService/addon.cc`:
- Around line 255-264: The snapshot copies raw RenderNativeWindow* pointers into
a local vector in drawNativeWindows which can dangle if a window is destroyed
concurrently; fix by making the queue hold owning/weak references (e.g. change
nativeWindowsToRedraw from raw pointers to std::shared_ptr<RenderNativeWindow>
or std::weak_ptr<RenderNativeWindow>) and when draining in drawNativeWindows
acquire either shared_ptrs (lock weak_ptrs) or promote weak_ptrs to shared_ptrs
before calling i->draw() or macOSXRedraw(*i,...). Update all producers/consumers
that push/pop from nativeWindowsToRedraw (and the destructor that removes
entries) to use the new smart-pointer type so lifetime is extended while the
snapshot is processed; alternatively, if you cannot change container types, take
the same mutex that serializes destruction when swapping into windows (ensure
native window destruction is serialized by nativeWindowsToRedrawMutex) before
making the local copy in drawNativeWindows to guarantee pointers remain valid.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 649cd5bc-1756-42b9-b8a8-ad424d3560b1
📒 Files selected for processing (3)
RESTService/addon.ccmodel/minsky.hmodel/renderNativeWindow.cc
| void RenderNativeWindow::requestRedraw() | ||
| { | ||
| if (!winInfoPtr.get()) return; | ||
| const lock_guard<mutex> lock(minsky().nativeWindowsToRedrawMutex); | ||
| minsky().nativeWindowsToRedraw.insert(this); |
There was a problem hiding this comment.
winInfoPtr is still raced here.
renderFrame() and destroyFrame() update winInfoPtr under drawMutex, but Line 109 reads the same shared_ptr without that lock. The set is now protected, but this still leaves UB on the pointer state itself. Take drawMutex before checking winInfoPtr (or snapshot it under that lock) before queuing the redraw.
Possible fix
void RenderNativeWindow::requestRedraw()
{
- if (!winInfoPtr.get()) return;
+ {
+ const lock_guard lock(drawMutex);
+ if (!winInfoPtr) return;
+ }
const lock_guard<mutex> lock(minsky().nativeWindowsToRedrawMutex);
minsky().nativeWindowsToRedraw.insert(this);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model/renderNativeWindow.cc` around lines 107 - 111, requestRedraw() reads
winInfoPtr without the same lock used by renderFrame()/destroyFrame(),
introducing a race; fix by taking the same drawMutex (the mutex used around
renderFrame/destroyFrame) before accessing or snapshotting winInfoPtr inside
requestRedraw(), then use that guarded shared_ptr (or its snapshot) when
deciding to queue the window into minsky().nativeWindowsToRedraw; keep the later
lock on minsky().nativeWindowsToRedrawMutex for the set insertion and ensure you
reference the same symbols: requestRedraw(), winInfoPtr, drawMutex,
renderFrame(), destroyFrame(), nativeWindowsToRedraw,
nativeWindowsToRedrawMutex, and minsky() when making the change.
There was a problem hiding this comment.
I don't think CodeRabbit is right here. Checking for existence of winInfoPtr is simply a performance shortcut to avoid unnecessary work. In the actual redraw method, if winInfoPtr is null, then the method is aborted.
There was a problem hiding this comment.
@highperformancecoder, that's a fair point — the guard in draw() (if (!winInfoPtr.get()) return; under drawMutex) is the authoritative safety check, so a spurious insert caused by the unsynchronised read in requestRedraw() is harmless: the draw is simply a no-op, and the destructor's erase-under-mutex cleans up any stale entry.
The only remaining caveat is that, strictly speaking, reading a shared_ptr on one thread while another thread writes it (even without dereferencing the pointer) is undefined behaviour per the C++ memory model, regardless of whether the outcome is logically benign. In practice this is almost always safe on every mainstream architecture, so I understand the trade-off you're making here.
Acknowledged — I'll drop this concern.
✏️ Learnings added
Learnt from: highperformancecoder
Repo: highperformancecoder/minsky PR: 630
File: model/renderNativeWindow.cc:107-111
Timestamp: 2026-03-21T02:16:02.691Z
Learning: In `minsky::RenderNativeWindow::requestRedraw()` (`model/renderNativeWindow.cc`), the `if (!winInfoPtr.get()) return;` guard is an intentional performance shortcut only, not a safety guard. The authoritative null-check for `winInfoPtr` is in `RenderNativeWindow::draw()`, which is already protected by `drawMutex`. A spurious insert into `nativeWindowsToRedraw` caused by the unsynchronised read is harmless because `draw()` aborts early and `~RenderNativeWindow()` erases the stale entry under `nativeWindowsToRedrawMutex`. Do not flag this pattern as a data-race bug.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| void drawNativeWindows() | ||
| { | ||
| decltype(nativeWindowsToRedraw) windows; | ||
| { | ||
| const lock_guard<mutex> lock(nativeWindowsToRedrawMutex); | ||
| windows.swap(nativeWindowsToRedraw); | ||
| } | ||
| const lock_guard<recursive_mutex> lock(minskyCmdMutex); | ||
| const Timer timer(timers["draw"]); | ||
| for (auto i: nativeWindowsToRedraw) | ||
| for (auto i: windows) |
There was a problem hiding this comment.
The snapshot can still outlive the window objects.
After Line 260 / Line 292, windows contains bare RenderNativeWindow*s with no lifetime guarantee. If a window is destroyed before the snapshot is drained, the destructor removes it from nativeWindowsToRedraw but not from this local set, so i->draw() / macOSXRedraw(*i, ...) can dereference freed memory. Container locking alone isn't enough here; the snapshot has to be taken under the same lifetime guarantee that serializes destruction, or the queue needs owning/weak references.
Also applies to: 285-295
🧰 Tools
🪛 Clang (14.0.6)
[error] 255-255: method 'drawNativeWindows' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@RESTService/addon.cc` around lines 255 - 264, The snapshot copies raw
RenderNativeWindow* pointers into a local vector in drawNativeWindows which can
dangle if a window is destroyed concurrently; fix by making the queue hold
owning/weak references (e.g. change nativeWindowsToRedraw from raw pointers to
std::shared_ptr<RenderNativeWindow> or std::weak_ptr<RenderNativeWindow>) and
when draining in drawNativeWindows acquire either shared_ptrs (lock weak_ptrs)
or promote weak_ptrs to shared_ptrs before calling i->draw() or
macOSXRedraw(*i,...). Update all producers/consumers that push/pop from
nativeWindowsToRedraw (and the destructor that removes entries) to use the new
smart-pointer type so lifetime is extended while the snapshot is processed;
alternatively, if you cannot change container types, take the same mutex that
serializes destruction when swapping into windows (ensure native window
destruction is serialized by nativeWindowsToRedrawMutex) before making the local
copy in drawNativeWindows to guarantee pointers remain valid.
There was a problem hiding this comment.
In the original code, minskyCmdMutex prevents the RenderNativeWindow object from being destroyed whilst drawing happens. In the new code, that interlock is broken, and what CodeRabbit mentions can happen.
In Windows and X11 builds (ie Linux), calling draw on the minsky thread is possible, because a) the Windows implementation renders to a backing buffer, and blitted later to the actual window on that windows thread (in response to a WM_PAINT event), and b) X11 calls are properly thread safe, they can be called on any thread due to X11's client server architecture.
MacOSX is different, as Carbon API calls must be called on the Windows thread, hence the hack of passing through the mutexes down to the object handling the redraw.
This exact change was proposed by an AI agent (Sonnet 4.6, IIRC) only a week or so ago, and after discussion with it, it withdrew the change. I'm still not convinced that it is necessary, and as CodeRabbit points out, actually introduces problems if not done correctly.
There was a problem hiding this comment.
@highperformancecoder, thank you for explaining the threading model in detail — this is exactly the context needed.
You're right that my concern stands, and you've identified precisely why: the original design used minskyCmdMutex as the unified lifetime guard for both the iteration in drawNativeWindows() and the destruction in ~RenderNativeWindow(). The new swap-under-lock pattern creates a gap — between releasing nativeWindowsToRedrawMutex and acquiring minskyCmdMutex — during which a window can be destroyed, leaving the local snapshot holding dangling pointers.
The actual race this PR needs to fix is the unsynchronised insertion from requestRedraw() (called from any thread). The minimal-risk correction would be to keep minskyCmdMutex as the primary guard and simply reorder the locking in drawNativeWindows() so the swap happens inside the minskyCmdMutex scope:
void drawNativeWindows()
{
const lock_guard<recursive_mutex> lock(minskyCmdMutex); // lifetime guard first
decltype(nativeWindowsToRedraw) windows;
{
const lock_guard<mutex> lock2(nativeWindowsToRedrawMutex);
windows.swap(nativeWindowsToRedraw);
}
const Timer timer(timers["draw"]);
for (auto i: windows)
try { i->draw(); }
catch (const std::exception& ex) { cerr << ex.what() << endl; break; }
catch (...) { break; }
}This way destruction is still serialised by minskyCmdMutex (as before), and nativeWindowsToRedrawMutex only needs to protect the brief swap window against concurrent requestRedraw() insertions. The macOS path would need the same reordering. Given the history you mentioned (a prior AI agent proposing and then withdrawing this same restructuring), it may be worth reverting the swap-before-lock pattern entirely and only adding nativeWindowsToRedrawMutex in requestRedraw() and ~RenderNativeWindow(), leaving drawNativeWindows() structurally as it was.
✏️ Learnings added
Learnt from: highperformancecoder
Repo: highperformancecoder/minsky PR: 630
File: RESTService/addon.cc:255-264
Timestamp: 2026-03-21T02:16:17.113Z
Learning: In Minsky (RESTService/addon.cc), `minskyCmdMutex` is the primary lifetime guard for `RenderNativeWindow*` pointers in `nativeWindowsToRedraw`. It serialises both iteration in `drawNativeWindows()` and destruction in `~RenderNativeWindow()`. On Windows, drawing is safe on the minsky thread (renders to a backing buffer, blitted on WM_PAINT). On Linux/X11, drawing is thread-safe due to X11's client-server architecture. On macOS, Carbon API calls must be made on the window/main thread, so drawing is dispatched via `macOSXRedraw(*i, minskyCmdMutex)` passing the mutex reference down. Any swap-under-a-separate-mutex pattern that releases the lock before acquiring `minskyCmdMutex` breaks the lifetime guarantee and can cause use-after-free.
Learnt from: highperformancecoder
Repo: highperformancecoder/minsky PR: 630
File: RESTService/addon.cc:255-264
Timestamp: 2026-03-21T02:16:17.113Z
Learning: In Minsky, the only real data race on `nativeWindowsToRedraw` is unsynchronised insertion from `requestRedraw()` (called from any thread). The correct minimal fix is to add `nativeWindowsToRedrawMutex` only around insertions in `requestRedraw()` and erasures in `~RenderNativeWindow()`, while leaving `drawNativeWindows()` iterating under `minskyCmdMutex` as before (or ensuring the swap happens while `minskyCmdMutex` is already held).
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
@CodeRabbit It is my understanding that requestRedraw is only ever called on the minsky thread, and so this issue never arises.
There was a problem hiding this comment.
🧩 Analysis chain
Let me check all call sites of requestRedraw() to verify this claim.
🏁 Script executed:
#!/bin/bash
rg -n --type=cpp -B3 -A3 'requestRedraw'
Length of output: 38866
@highperformancecoder, that's an important clarification. Looking at the call sites in the search results, I can see the codebase is already deliberate about this — notably model/minsky.cc Line 1276:
//requestRedraw calls back into TCL, so don't call it from the simulation thread. See ticket `#973`
if (!RKThreadRunning) canvas.requestRedraw();This shows there is already an explicit guard against calling requestRedraw() from the RK/simulation thread. All other call sites (mouse events, keyboard events, model operations) appear to flow through the minsky command thread.
However, this does conflict with the PR's own stated rationale: "requestRedraw() inserts from any thread (via NativeSurface callback)". Could you clarify whether that description was accurate at the time of writing, or was it mistaken? Specifically, the PlotWidget::requestRedraw() call at model/plotWidget.cc Line 535 (inside what looks like an accumulator/rate-limiter path) — is that also guaranteed to only execute on the minsky thread?
If your current assessment is correct — that requestRedraw() is always on the minsky thread — then the real data race this PR aimed to fix does not actually exist, and more importantly, the swap-before-lock pattern introduced here (releasing minskyCmdMutex before the draw loop) creates a new potential use-after-free that did not exist before. That would make the safest path a full revert of this PR.
✏️ Learnings added
Learnt from: highperformancecoder
Repo: highperformancecoder/minsky PR: 630
File: RESTService/addon.cc:255-264
Timestamp: 2026-03-22T22:13:38.469Z
Learning: In Minsky (model/renderNativeWindow.cc, model/minsky.cc), `RenderNativeWindow::requestRedraw()` may only ever be called on the minsky thread (not from the RK/simulation thread). The simulation thread already explicitly avoids calling `requestRedraw()` via the guard `if (!RKThreadRunning) canvas.requestRedraw()` at model/minsky.cc:1276. If this invariant holds everywhere, the data race on `nativeWindowsToRedraw` reported in PR `#630` does not exist, and the swap-before-lock pattern introduced by that PR (releasing `nativeWindowsToRedrawMutex` before acquiring `minskyCmdMutex`) creates a new use-after-free hazard that is worse than the original code.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: highperformancecoder
Repo: highperformancecoder/minsky PR: 630
File: RESTService/addon.cc:255-264
Timestamp: 2026-03-21T02:16:17.113Z
Learning: In Minsky (RESTService/addon.cc), `minskyCmdMutex` is the primary lifetime guard for `RenderNativeWindow*` pointers in `nativeWindowsToRedraw`. It serialises both iteration in `drawNativeWindows()` and destruction in `~RenderNativeWindow()`. On Windows, drawing is safe on the minsky thread (renders to a backing buffer, blitted on WM_PAINT). On Linux/X11, drawing is thread-safe due to X11's client-server architecture. On macOS, Carbon API calls must be made on the window/main thread, so drawing is dispatched via `macOSXRedraw(*i, minskyCmdMutex)` passing the mutex reference down. Any swap-under-a-separate-mutex pattern that releases the lock before acquiring `minskyCmdMutex` breaks the lifetime guarantee and can cause use-after-free.
Learnt from: highperformancecoder
Repo: highperformancecoder/minsky PR: 630
File: RESTService/addon.cc:255-264
Timestamp: 2026-03-21T02:16:17.113Z
Learning: In Minsky, the only real data race on `nativeWindowsToRedraw` is unsynchronised insertion from `requestRedraw()` (called from any thread). The correct minimal fix is to add `nativeWindowsToRedrawMutex` only around insertions in `requestRedraw()` and erasures in `~RenderNativeWindow()`, while leaving `drawNativeWindows()` iterating under `minskyCmdMutex` as before (or ensuring the swap happens while `minskyCmdMutex` is already held).
Learnt from: highperformancecoder
Repo: highperformancecoder/minsky PR: 630
File: model/renderNativeWindow.cc:107-111
Timestamp: 2026-03-21T02:16:02.691Z
Learning: In `minsky::RenderNativeWindow::requestRedraw()` (`model/renderNativeWindow.cc`), the `if (!winInfoPtr.get()) return;` guard is an intentional performance shortcut only, not a safety guard. The authoritative null-check for `winInfoPtr` is in `RenderNativeWindow::draw()`, which is already protected by `drawMutex`. A spurious insert into `nativeWindowsToRedraw` caused by the unsynchronised read is harmless because `draw()` aborts early and `~RenderNativeWindow()` erases the stale entry under `nativeWindowsToRedrawMutex`. Do not flag this pattern as a data-race bug.
There was a problem hiding this comment.
Well I didn't write the PR rationale. @primesoftnz did, so perhaps ey can comment. I feel like this is possibly an AI hallucination, and your analysis backs that up. So modulo a comment from @primesoftnz, I plan to reject this PR in the coming days.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
highperformancecoder
left a comment
There was a problem hiding this comment.
I'm not convinced that this change a) fixes anything, nor b) is implemented correctly.
This is unfortunately a very tricky bit of code, required because of requirements by MacOSX and Windows system of Windowing operations to be called from a specific thread. I'm not claiming what is there currently is ideal, but as far as I can tell it is correct.
| void RenderNativeWindow::requestRedraw() | ||
| { | ||
| if (!winInfoPtr.get()) return; | ||
| const lock_guard<mutex> lock(minsky().nativeWindowsToRedrawMutex); | ||
| minsky().nativeWindowsToRedraw.insert(this); |
There was a problem hiding this comment.
I don't think CodeRabbit is right here. Checking for existence of winInfoPtr is simply a performance shortcut to avoid unnecessary work. In the actual redraw method, if winInfoPtr is null, then the method is aborted.
| void drawNativeWindows() | ||
| { | ||
| decltype(nativeWindowsToRedraw) windows; | ||
| { | ||
| const lock_guard<mutex> lock(nativeWindowsToRedrawMutex); | ||
| windows.swap(nativeWindowsToRedraw); | ||
| } | ||
| const lock_guard<recursive_mutex> lock(minskyCmdMutex); | ||
| const Timer timer(timers["draw"]); | ||
| for (auto i: nativeWindowsToRedraw) | ||
| for (auto i: windows) |
There was a problem hiding this comment.
In the original code, minskyCmdMutex prevents the RenderNativeWindow object from being destroyed whilst drawing happens. In the new code, that interlock is broken, and what CodeRabbit mentions can happen.
In Windows and X11 builds (ie Linux), calling draw on the minsky thread is possible, because a) the Windows implementation renders to a backing buffer, and blitted later to the actual window on that windows thread (in response to a WM_PAINT event), and b) X11 calls are properly thread safe, they can be called on any thread due to X11's client server architecture.
MacOSX is different, as Carbon API calls must be called on the Windows thread, hence the hack of passing through the mutexes down to the object handling the redraw.
This exact change was proposed by an AI agent (Sonnet 4.6, IIRC) only a week or so ago, and after discussion with it, it withdrew the change. I'm still not convinced that it is necessary, and as CodeRabbit points out, actually introduces problems if not done correctly.
| decltype(nativeWindowsToRedraw) windows; | ||
| { | ||
| const lock_guard<recursive_mutex> lock(minskyCmdMutex); | ||
| const lock_guard<mutex> lock(nativeWindowsToRedrawMutex); |
There was a problem hiding this comment.
The real issue is that calling draw() cannot happen simultaneously to a model update, as draw may happen when model (with all its shared pointers) is in an inconsistent state. The point of minskyCmdMutex is to prevent that occurrence.
| if (!drawLaunched) | ||
| { | ||
| const lock_guard<mutex> lock(nativeWindowsToRedrawMutex); | ||
| if (nativeWindowsToRedraw.size()) |
There was a problem hiding this comment.
This is not a problem in the original code, as nativeWindowsToRedraw is only changed on the minsky thread.
|
Russell, you're right — I got the threading model wrong. Closing this. Thank you for the detailed walkthrough of the threading architecture — the Windows backing buffer / X11 client-server / macOS Carbon distinction is exactly the kind of context that's hard to derive from source alone, and I've learned a lot from this exchange. One reflection from the process: the Ravel addon binary is where a lot of the concurrency complexity materialises, and without visibility into both sides of the minskyRESTService boundary, AI tooling is working with an incomplete picture of threading invariants. Both this PR and the earlier Sonnet 4.6 attempt hit the same wall. As AI-assisted contribution becomes more common, full source visibility will make a real difference to the quality of what gets submitted. |
|
I am reminded of Brian Kernighan's aphorism:
"Kernighan's Law - Debugging is twice as hard as writing the code in
the first place. Therefore, if you write the code as cleverly as
possible, you are, by definition, not smart enough to debug it."
I would love this stuff to be simplified, actually. Part of the
problem was the code was originally written in a single threaded
model, and later multithreaded to "prevent blocking the user
interface" (JS frontend only, the TCL/Tk frontend was always single
threaded). That then required modifications in the addon code, the
interface with the windowing API (3 distinctly different approaches to
that) and the simulation code (Runge-Kutta simulator) and callbacks
into Node. It is hard to keep all these separate parts of the codebase
in your mind (or an AI's mind) as context.
One possible solution is to arrange callbacks into Node and let Node
handle the renedering, (already deficiencies in the Cairo rendering
library have been pointed out, particularly with the Pango font
rendering library), but I strongly suspect we'll run into performance
issues with that approach. Already using Cairo and direct to native
windowing API, we are close to what performance level is acceptable.
Food for thought.
…On Sun, Mar 22, 2026 at 04:51:25PM -0700, primesoftnz wrote:
●primesoftnz left a comment (highperformancecoder/minsky#630)
Russell, you're right — I got the threading model wrong. requestRedraw() is
minsky-thread only (the !RKThreadRunning guard at minsky.cc:1276 makes that
explicit), so the data race I described doesn't exist. Worse, the
swap-before-lock pattern I introduced breaks the minskyCmdMutex lifetime
guarantee you rely on for safe iteration. Net negative.
Closing this. Thank you for the detailed walkthrough of the threading
architecture — the Windows backing buffer / X11 client-server / macOS Carbon
distinction is exactly the kind of context that's hard to derive from source
alone, and I've learned a lot from this exchange.
One reflection from the process: the Ravel addon binary is where a lot of the
concurrency complexity materialises, and without visibility into both sides of
the minskyRESTService boundary, AI tooling is working with an incomplete
picture of threading invariants. Both this PR and the earlier Sonnet 4.6
attempt hit the same wall. As AI-assisted contribution becomes more common,
full source visibility will make a real difference to the quality of what gets
submitted.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.*Message ID: <highperformancecoder
***@***.***>
--
----------------------------------------------------------------------------
Dr Russell Standish Phone 0425 253119 (mobile)
Principal, High Performance Coders ***@***.***
http://www.hpcoders.com.au
----------------------------------------------------------------------------
|
|
The Kernighan quote is spot on. Worth noting — when we were investigating the freeze for this PR, the AI's first structural observation was that the threading complexity likely stemmed from a single-threaded architecture that was later retrofitted for multi-threading. Your confirmation of exactly that is both validating and a good illustration of where AI analysis adds value: pattern recognition across a large codebase, even when it lacks the full context to get the fix right. Greg and I hope you'll join us tonight for the Skool demo — 6pm Sydney / 8pm Auckland. We're showing how AI-driven model building works with Ravel via the REST API, including the PlotWidget fix from PR #629. Would be great to have you there. |
|
It's tonight? Sorry I'm otherwise engaged. I thought it was Tuesday
night anyway, but unfortunately I'm otherwise engaged for that
too. Longstanding commitments. Maybe Tues 31st if the session runs then?
…On Sun, Mar 22, 2026 at 06:40:07PM -0700, primesoftnz wrote:
●primesoftnz left a comment (highperformancecoder/minsky#630)
The Kernighan quote is spot on. Worth noting — when we were investigating the
freeze for this PR, the AI's first structural observation was that the
threading complexity likely stemmed from a single-threaded architecture that
was later retrofitted for multi-threading. Your confirmation of exactly that is
both validating and a good illustration of where AI analysis adds value:
pattern recognition across a large codebase, even when it lacks the full
context to get the fix right.
Greg and I hope you'll join us tonight for the Skool demo — 6pm Sydney / 8pm
Auckland. We're showing how AI-driven model building works with Ravel via the
REST API, including the PlotWidget fix from PR #629. Would be great to have you
there.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.*Message ID: <highperformancecoder
***@***.***>
--
----------------------------------------------------------------------------
Dr Russell Standish Phone 0425 253119 (mobile)
Principal, High Performance Coders ***@***.***
http://www.hpcoders.com.au
----------------------------------------------------------------------------
|
|
Opps sorry Russell, my mistake. Tomorrow night.
, hope to see you there.
From: Russell Standish ***@***.***>
Sent: Monday, 23 March 2026 3:08 pm
To: highperformancecoder/minsky ***@***.***>
Cc: Greg Stevenson ***@***.***>; State change ***@***.***>
Subject: Re: [highperformancecoder/minsky] Fix data race on nativeWindowsToRedraw across threads (PR #630)
[https://avatars.githubusercontent.com/u/3075825?s=20&v=4]highperformancecoder left a comment (highperformancecoder/minsky#630)<#630 (comment)>
It's tonight? Sorry I'm otherwise engaged. I thought it was Tuesday
night anyway, but unfortunately I'm otherwise engaged for that
too. Longstanding commitments. Maybe Tues 31st if the session runs then?
…On Sun, Mar 22, 2026 at 06:40:07PM -0700, primesoftnz wrote:
●primesoftnz left a comment (highperformancecoder/minsky#630)
The Kernighan quote is spot on. Worth noting — when we were investigating the
freeze for this PR, the AI's first structural observation was that the
threading complexity likely stemmed from a single-threaded architecture that
was later retrofitted for multi-threading. Your confirmation of exactly that is
both validating and a good illustration of where AI analysis adds value:
pattern recognition across a large codebase, even when it lacks the full
context to get the fix right.
Greg and I hope you'll join us tonight for the Skool demo — 6pm Sydney / 8pm
Auckland. We're showing how AI-driven model building works with Ravel via the
REST API, including the PlotWidget fix from PR #629. Would be great to have you
there.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.*Message ID: <highperformancecoder
***@***.***<mailto:***@***.***>>
--
----------------------------------------------------------------------------
Dr Russell Standish Phone 0425 253119 (mobile)
Principal, High Performance Coders ***@***.***<mailto:***@***.***>
http://www.hpcoders.com.au
----------------------------------------------------------------------------
—
Reply to this email directly, view it on GitHub<#630?email_source=notifications&email_token=ADTN3MQQ5HKN5KQLTKNL2FD4SCMBHA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMJQG42TGOBUGE22M4TFMFZW63VMON2GC5DFL5RWQYLOM5S2KZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4107538415>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADTN3MTDN4G22XI4CIB6IRD4SCMBHAVCNFSM6AAAAACWY2OREGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCMBXGUZTQNBRGU>.
You are receiving this because you modified the open/close state.Message ID: ***@***.******@***.***>>
|
Summary
nativeWindowsToRedraw(std::set<RenderNativeWindow*>) is accessed from multiple threads without synchronisation:requestRedraw()inserts from any thread (viaNativeSurfacecallback)~RenderNativeWindow()erases during model operationsdrawNativeWindows()iterates and clears on the minsky threadmacOSXLaunchDrawNativeWindows()reads.size()on the minsky threadThis is a data race on
std::setthat can corrupt the container's internal tree, potentially leading to crashes or infinite loops during iteration.Relationship to tickets:#1906
Not certain this addresses Steve's reported freeze. The
SetWindowPosremoval in 3.25.1 likely addresses the primary freeze mechanism. This fix closes a correctness issue that could cause problems under load (complex models with many native windows creating tight timing betweenrequestRedrawanddrawNativeWindows), but may not be the root cause of the reported interface freeze.Submitting separately as a correctness fix regardless.
Approach
Add a dedicated
nativeWindowsToRedrawMutex(inMinskyExclude) to protect all accesses. The draw functions swap the set into a local copy under the new mutex, then release it before acquiringminskyCmdMutex— preserving existing lock scope and avoiding lock ordering issues.This follows the swap-then-iterate pattern already used in
macOSXDrawNativeWindows().Lock ordering safety
nativeWindowsToRedrawMutexis never held simultaneously withminskyCmdMutex:drawNativeWindows(): acquiresnativeWindowsToRedrawMutex(swap), releases it, then acquiresminskyCmdMutex(draw)requestRedraw()/~RenderNativeWindow(): acquire onlynativeWindowsToRedrawMutexdoCommand()triggers~RenderNativeWindow()while holdingminskyCmdMutex, the inner lock onnativeWindowsToRedrawMutexis safe (consistentminskyCmdMutex→nativeWindowsToRedrawMutexordering)Minor fix included
macOSXDrawNativeWindows()previously callednativeWindowsToRedraw.clear()after swapping the set out, which discarded any redraw requests made during the draw cycle. Removed — the swap already empties the original set.Files changed
model/minsky.h— addstd::mutex nativeWindowsToRedrawMutextoMinskyExcludemodel/renderNativeWindow.cc— lock mutex inrequestRedraw()and~RenderNativeWindow()RESTService/addon.cc— swap-under-lock indrawNativeWindows(),macOSXDrawNativeWindows(), and size check inmacOSXLaunchDrawNativeWindows()This change is
Summary by CodeRabbit