Skip to content

DistArray::lazy_deleter: skip lazy_sync when invoked from fence's do_cleanup#551

Merged
evaleev merged 2 commits into
masterfrom
evaleev/feature/lazy-deleter-skip-sync-in-do-cleanup
May 21, 2026
Merged

DistArray::lazy_deleter: skip lazy_sync when invoked from fence's do_cleanup#551
evaleev merged 2 commits into
masterfrom
evaleev/feature/lazy-deleter-skip-sync-in-do-cleanup

Conversation

@evaleev
Copy link
Copy Markdown
Member

@evaleev evaleev commented May 21, 2026

Summary

Use MADNESS's new WorldGopInterface::is_in_do_cleanup() flag to short-circuit the cross-rank lazy_sync handshake when lazy_deleter is called from inside fence_impl's deferred-cleanup phase: delete pimpl directly, decrement cleanup_counter_, return.

Also bumps the MADNESS pin to m-a-d-n-e-s-s/madness#695 to pick up the flag.

Why

lazy_sync's purpose in lazy_deleter is to keep a peer from sending AM addressed to a WorldObject after the local rank has deleted it. When lazy_deleter runs inside fence_impl's do_cleanup:

  • fence_impl's global-termination protocol (drain loop + Dykstra-style nsent/nrecv tree handshake) has already established that no AM is in flight and all ranks are at the same syntactic point.
  • defer_deleter_to_next_fence is, by contract, used collectively, so every rank's deferred list holds the same pimpls at this point and every rank reaches the same delete in lockstep.

The handshake is therefore redundant — and actively harmful. The lazy_sync_children task it schedules lives on this world's taskq after the drain loop has exited. That task survived the fence and waited for matching lazy_sync calls that could come arbitrarily later — or not at all, if the world was torn down first. The most visible symptom: ephemeral worlds (e.g. TA-einsum's per-Hadamard sub-Worlds) being torn down at function exit or during exception unwind, leaving the task to be picked up by some unrelated later fence which runs delete pimpl against a freed world. ~WorldObject then trips its World::exists(&world) assertion and aborts.

The general (non-deferred) path is unchanged: lazy_deleter invoked outside do_cleanup still routes through lazy_sync because we cannot rely on synchronization with peers in that case.

What

  • external/versions.cmake: bump pinned MADNESS to 666765ca6.
  • src/TiledArray/array_impl.h: in lazy_deleter, after the existing world.await(num_live_ds == 0), check world.gop.is_in_do_cleanup(); if set, delete directly.

Test plan

  • Compiles cleanly against MADNESS#695.
  • Local downstream MPQC repro that previously aborted in ~WorldObject no longer needs any einsum-side workaround to run end-to-end (verified by toggling).
  • CI green.

Follow-up

ValeevGroup/tiledarray#550 builds on this with a smaller einsum RAII guard and the (outer Contraction, inner Hadamard) view-cell case in cont_engine.

evaleev added 2 commits May 21, 2026 12:21
…cleanup

Use the new MADNESS `WorldGopInterface::is_in_do_cleanup()` flag to
short-circuit the cross-rank `lazy_sync` handshake when `lazy_deleter`
is called from inside `fence_impl`'s deferred-cleanup phase: `delete
pimpl` directly, decrement `cleanup_counter_`, return.

Why it is safe:

- `fence_impl` runs the global-termination protocol before calling
  `deferred_->do_cleanup()`, so all ranks are at the same point with
  no AM in flight.
- `defer_deleter_to_next_fence` is, by contract, used collectively, so
  every rank's deferred list holds the same set of pimpls at this
  point and every rank performs the matching delete in lockstep.
- The `lazy_sync` handshake exists to guarantee that no peer is still
  about to send AM addressed to this object before we delete it; the
  fence already establishes that.

Why it matters: the original `lazy_sync` path enqueues a
`lazy_sync_children` task on this world's taskq *after* the fence's
drain loop has exited. Such tasks survive the fence and are picked up
later by some other fence that drives the global ThreadPool. If the
world is destroyed in the meantime (e.g. einsum's per-Hadamard
sub-Worlds torn down at function exit or during exception unwind),
the stranded task runs `delete pimpl` against a world whose taskq /
gop are already freed; `~WorldObject` then trips its
`World::exists(&world)` assertion and aborts, masking any real error.
The fast path avoids ever scheduling that task.

The general (non-deferred) path is unchanged: `lazy_deleter` invoked
outside `do_cleanup` still goes through `lazy_sync` because we cannot
rely on synchronization with peers in that case.
@evaleev evaleev merged commit 60bb88d into master May 21, 2026
9 checks passed
@evaleev evaleev deleted the evaleev/feature/lazy-deleter-skip-sync-in-do-cleanup branch May 21, 2026 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant