fix: bound pipeline cleanup waits#24972
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
cb432d0 to
2dc41d9
Compare
XuPeng-SH
left a comment
There was a problem hiding this comment.
Blocking issue: the new timeout/fallback cleanup paths can leak PipelineSpool-owned memory.
In both pkg/sql/colexec/connector/types.go and pkg/sql/colexec/dispatch/types.go, we now nil out ctr.sp even when CloseWithTimeout() returns false or spool close is skipped after falling back to a direct signal. But pkg/container/pSpool/sender.go only frees the cached spool buffers on the success path of CloseWithTimeout() (ps.cache.free()). These spools are created from proc.Mp(), so a timed-out cleanup can now abandon cached batch buffers/off-heap vector memory until session teardown, which turns the hang fix into a cleanup-time memory leak on the same failure path.
Please keep ownership of the spool until its cache is actually freed, or hand timed-out spools to a deferred reaper/background close path instead of dropping the reference immediately.
gouhongshen
left a comment
There was a problem hiding this comment.
Reviewed the bounded pipeline cleanup changes again. I did not find any blocking issues in the cleanup ordering, bounded signal send/drain paths, deferred spool cleanup, or remote receiver notification handling.
aptend
left a comment
There was a problem hiding this comment.
Reviewed again from aptend. No blocking issues found in the bounded cleanup paths.
What type of PR is this?
Which issue(s) this PR fixes:
issue #24962
Follow-up refactor issue: #24971
What this PR does / why we need it:
This is a hotfix for pipeline cleanup hangs observed in #24962. The root problem is that pipeline cleanup can wait indefinitely for terminal signals when the peer sender/receiver is no longer guaranteed to make progress.
This PR:
Merge.Resetcleanup waits for terminal pipeline signals.Connector,Dispatch, and remote notify cleanup.PipelineSpool.Closewith a timeout variant and avoids force-freeing spool cache when receivers may still be late.mo_pipeline_cleanup_event_total{event=...}so abnormal cleanup events remain observable even when logs are rate-limited.The change is intentionally scoped to cleanup/reset paths. Normal long-running query data paths are not bounded by these cleanup-only timeouts.
Validation run locally: