Summary
While root-causing the #248 Zhong-benchmark velocity catastrophe (now fixed by #242 — it was a broken SphericalShellInternalBoundary Lower/Internal label bug), three smaller but real parallel-correctness issues in Stokes_Constrained (in-saddle multiplier free-slip) were isolated with reliable integral diagnostics. Tracking them here for a focused fix.
Method note: point evaluation (uw.function.evaluate) is unreliable for serial-vs-parallel field comparison here (gave 134%-of-|v| pointwise noise while volume L2 differed 0.4%). Use partition-independent integrals only.
Ruled out: velocity rigid-rotation nullspace (angular momentum ∫r×v dV reproducible to ~1e-5 serial vs np=8); iterative under-convergence (serial & np=8 each bit-identical at tol 1e-7 vs 1e-11 — both fully converge, to different answers).
1. [p,λ] / topography gauge is partition-inconsistent (the nullspace bug)
The combined constant-(pressure, multiplier) gauge (_build_block_gauge_nullspace_vector) is a true velocity null mode, but its level is partition-dependent: mean pressure = 3.333 (serial) vs 2.983 (np=8), 10.5%. Because topography is read from the multiplier, this corrupts topography across ranks unless mean-stripped. set_pressure_gauge("Upper", 0.0) (#250, global BdIntegral callback) pins it consistently → 0.4%, bit-identical on velocity.
Fix: make the gauge partition-reproducible by construction (default-on pin, or enforce topography(reference="mean")); add a parallel regression asserting meanP/topography match serial to ~1e-6. Highest priority — topography correctness, fix already exists.
2. Residual ~0.4% velocity partition-dependence in constraint/augmentation assembly (separate, NOT a nullspace)
With the gauge pinned, velocity L2/|Ut|/|Ub| still differ 0.36/0.37/0.41% serial vs np=8 (np=8 bit-identical with/without the pin). Fully converged ⇒ the assembled operator/RHS differs ~0.4% by partition — locus is the add_constraint_bc / augmented-Lagrangian (augmentation_base=1e4) boundary assembly, not a nullspace. 3D-specific (2D test_1063 is bit-identical to 1e-9; boundary areas match exactly, so it's the v·n / augmentation term at inter-rank seams on the 2-D free-slip surface).
Fix: find and fix the partition-dependent constraint/augmentation assembly; goal = converged velocity partition-independent to round-off.
3. Interior-multiplier knockout is lossy (docstring says "lossless")
_constrain_interior_multipliers_in_section (gated by _reduce_interior_multiplier, default True) pins interior multiplier DOFs. Disabling it changes the serial answer by 2% (L2/Ut) to 5% (Ub) — so the pinned DOFs are not inert. DOF count is partition-identical (5170 at np=1/2/4/8), so not gross loss. (Off makes parallel spread worse — 1.7% at CMB — so it's a distinct issue from #2, consistent with block-error trade-off.)
Fix: determine whether the boundary-trace closure drops DOFs that matter (or "inert interior multiplier" is the wrong model); fix or re-document. Also: monolithic MUMPS LU on the constrained system gives wrong physics (|Ut|=4e-3 vs validated 1e-2) and segfaults at np=8 — capture as known-bad or fix.
Reproduction
~/+Simulations/repro_248_internal_load.py (lifts the test_1064 Zhong setup; reports volume L2, |Ut|/|Ub|, angular momentum, meanP — all partition-independent). Args: <method> <write|check> <solver> <tol> <knockout on|off> <gauge none|pin>. Build/run in the amr-dev pixi env; parallel needs --with-mpi.
Related: #248 (velocity half resolved by #242; topography half open), #244 (constrained slow in parallel).
cc @gthyagi
Underworld development team with AI support from Claude Code
Summary
While root-causing the #248 Zhong-benchmark velocity catastrophe (now fixed by #242 — it was a broken
SphericalShellInternalBoundaryLower/Internallabel bug), three smaller but real parallel-correctness issues inStokes_Constrained(in-saddle multiplier free-slip) were isolated with reliable integral diagnostics. Tracking them here for a focused fix.Method note: point evaluation (
uw.function.evaluate) is unreliable for serial-vs-parallel field comparison here (gave 134%-of-|v| pointwise noise while volume L2 differed 0.4%). Use partition-independent integrals only.Ruled out: velocity rigid-rotation nullspace (angular momentum ∫r×v dV reproducible to ~1e-5 serial vs np=8); iterative under-convergence (serial & np=8 each bit-identical at tol 1e-7 vs 1e-11 — both fully converge, to different answers).
1.
[p,λ]/ topography gauge is partition-inconsistent (the nullspace bug)The combined constant-(pressure, multiplier) gauge (
_build_block_gauge_nullspace_vector) is a true velocity null mode, but its level is partition-dependent: mean pressure = 3.333 (serial) vs 2.983 (np=8), 10.5%. Because topography is read from the multiplier, this corrupts topography across ranks unless mean-stripped.set_pressure_gauge("Upper", 0.0)(#250, globalBdIntegralcallback) pins it consistently → 0.4%, bit-identical on velocity.Fix: make the gauge partition-reproducible by construction (default-on pin, or enforce
topography(reference="mean")); add a parallel regression assertingmeanP/topography match serial to ~1e-6. Highest priority — topography correctness, fix already exists.2. Residual ~0.4% velocity partition-dependence in constraint/augmentation assembly (separate, NOT a nullspace)
With the gauge pinned, velocity L2/|Ut|/|Ub| still differ 0.36/0.37/0.41% serial vs np=8 (np=8 bit-identical with/without the pin). Fully converged ⇒ the assembled operator/RHS differs ~0.4% by partition — locus is the
add_constraint_bc/ augmented-Lagrangian (augmentation_base=1e4) boundary assembly, not a nullspace. 3D-specific (2Dtest_1063is bit-identical to 1e-9; boundary areas match exactly, so it's the v·n / augmentation term at inter-rank seams on the 2-D free-slip surface).Fix: find and fix the partition-dependent constraint/augmentation assembly; goal = converged velocity partition-independent to round-off.
3. Interior-multiplier knockout is lossy (docstring says "lossless")
_constrain_interior_multipliers_in_section(gated by_reduce_interior_multiplier, default True) pins interior multiplier DOFs. Disabling it changes the serial answer by 2% (L2/Ut) to 5% (Ub) — so the pinned DOFs are not inert. DOF count is partition-identical (5170 at np=1/2/4/8), so not gross loss. (Off makes parallel spread worse — 1.7% at CMB — so it's a distinct issue from #2, consistent with block-error trade-off.)Fix: determine whether the boundary-trace closure drops DOFs that matter (or "inert interior multiplier" is the wrong model); fix or re-document. Also: monolithic MUMPS LU on the constrained system gives wrong physics (|Ut|=4e-3 vs validated 1e-2) and segfaults at np=8 — capture as known-bad or fix.
Reproduction
~/+Simulations/repro_248_internal_load.py(lifts thetest_1064Zhong setup; reports volume L2, |Ut|/|Ub|, angular momentum, meanP — all partition-independent). Args:<method> <write|check> <solver> <tol> <knockout on|off> <gauge none|pin>. Build/run in theamr-devpixi env; parallel needs--with-mpi.Related: #248 (velocity half resolved by #242; topography half open), #244 (constrained slow in parallel).
cc @gthyagi
Underworld development team with AI support from Claude Code