Skip to content

fix: bound pipeline cleanup waits#24972

Open
XuPeng-SH wants to merge 3 commits into
matrixorigin:mainfrom
XuPeng-SH:xupeng/fix-pipeline-cleanup-hang-24962
Open

fix: bound pipeline cleanup waits#24972
XuPeng-SH wants to merge 3 commits into
matrixorigin:mainfrom
XuPeng-SH:xupeng/fix-pipeline-cleanup-hang-24962

Conversation

@XuPeng-SH

@XuPeng-SH XuPeng-SH commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24962

Follow-up refactor issue: #24971

What this PR does / why we need it:

This is a hotfix for pipeline cleanup hangs observed in #24962. The root problem is that pipeline cleanup can wait indefinitely for terminal signals when the peer sender/receiver is no longer guaranteed to make progress.

This PR:

  • Bounds Merge.Reset cleanup waits for terminal pipeline signals.
  • Bounds cleanup-time terminal signal sends in Connector, Dispatch, and remote notify cleanup.
  • Bounds PipelineSpool.Close with a timeout variant and avoids force-freeing spool cache when receivers may still be late.
  • Cleans up sender/receiver linked pipelines by resetting the terminal sender and leaf merge concurrently, while preserving CTE special cleanup gates.
  • Makes remote receiver cleanup error notification non-blocking to avoid cleanup deadlocks.
  • Adds mo_pipeline_cleanup_event_total{event=...} so abnormal cleanup events remain observable even when logs are rate-limited.
  • Adds rate-limited cleanup warnings to avoid log storms under large abnormal cleanup bursts.
  • Sends terminal signals to immediately healthy local dispatch receivers before waiting on blocked local receivers.
  • Adds tests for missing end signals, full channels, cleanup ordering, remote error channel backpressure, local receiver backpressure, and cleanup warning suppression.

The change is intentionally scoped to cleanup/reset paths. Normal long-running query data paths are not bounded by these cleanup-only timeouts.

Validation run locally:

go test ./pkg/vm/process ./pkg/vm/pipeline ./pkg/sql/colexec/dispatch
go test ./pkg/sql/colexec/merge ./pkg/sql/colexec/connector ./pkg/sql/compile ./pkg/util/metric/v2 -run '^$'
make build

@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issue: the new timeout/fallback cleanup paths can leak PipelineSpool-owned memory.

In both pkg/sql/colexec/connector/types.go and pkg/sql/colexec/dispatch/types.go, we now nil out ctr.sp even when CloseWithTimeout() returns false or spool close is skipped after falling back to a direct signal. But pkg/container/pSpool/sender.go only frees the cached spool buffers on the success path of CloseWithTimeout() (ps.cache.free()). These spools are created from proc.Mp(), so a timed-out cleanup can now abandon cached batch buffers/off-heap vector memory until session teardown, which turns the hang fix into a cleanup-time memory leak on the same failure path.

Please keep ownership of the spool until its cache is actually freed, or hand timed-out spools to a deferred reaper/background close path instead of dropping the reference immediately.

@gouhongshen gouhongshen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the bounded pipeline cleanup changes again. I did not find any blocking issues in the cleanup ordering, bounded signal send/drain paths, deferred spool cleanup, or remote receiver notification handling.

@aptend aptend left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed again from aptend. No blocking issues found in the bounded cleanup paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working kind/test-ci size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants