Skip to content

Fix CloudFetch goroutine leak that retains Arrow buffers after Close#357

Merged
vikrantpuppala merged 1 commit into
mainfrom
fix/356-cloudfetch-goroutine-leak
May 12, 2026
Merged

Fix CloudFetch goroutine leak that retains Arrow buffers after Close#357
vikrantpuppala merged 1 commit into
mainfrom
fix/356-cloudfetch-goroutine-leak

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Collaborator

Summary

Fixes #356.

Under high CloudFetch concurrency (≥6 simultaneous downloads), in-flight cloudFetchDownloadTask goroutines could leak when the consumer closed the iterator before draining all results. Each leaked goroutine pinned a downloaded chunk in the Go heap, producing the multi-GiB heap plateau described in the issue that only released on process restart.

Root cause

cloudFetchDownloadTask.Run sends the download result on an unbuffered channel without honoring context cancellation:

cft.resultChan <- cloudFetchDownloadTaskResult{data: bytes.NewReader(buf), ...}

Sequence that triggers the leak:

  1. cloudIPCStreamIterator.Next schedules MaxDownloadThreads (default 10) tasks concurrently.
  2. The consumer dequeues task 1, gets its result, returns.
  3. Tasks 2..N have completed their HTTP read in parallel and are now blocked on the unbuffered send, holding their downloaded buffer.
  4. The consumer abandons the iterator (timeout, error, early close, etc.) and calls iterator.Close().
  5. Close calls task.cancel() on each remaining task. But context cancellation does not unblock an in-flight channel send — the goroutines stay blocked forever, retaining their buffers.

In v1.7.1 (the version the reporter is on) the goroutine had already decoded the bytes into Arrow records before the send, so the leaked memory was Arrow-allocator buffers — matching the stack trace in the issue:

(*cloudFetchDownloadTask).Run.func1
  getArrowRecords → (*ipc.Reader).Next → newRecord → loadArray
    → loadBinary → buffer → (*ipcSource).buffer → NewResizableBuffer
      → (*Buffer).Resize → (*GoAllocator).Allocate

In the current code (v1.11.0) the decode happens later in batchIterator.Next, so the leak is the raw decompressed buf instead — same shape, smaller per-goroutine retention, same plateau pattern.

Fix

Route every channel send through a helper that selects on ctx.Done():

func (cft *cloudFetchDownloadTask) sendResult(result cloudFetchDownloadTaskResult) {
    select {
    case cft.resultChan <- result:
    case <-cft.ctx.Done():
    }
}

cloudIPCStreamIterator.Close already calls task.cancel() for every queued task, so cancellation now correctly drains stuck goroutines and lets their buffers be GC'd.

Test plan

  • New unit test TestCloudFetchIterator_CloseReleasesInFlightDownloads reproduces the leak: spawns MaxDownloadThreads concurrent downloads, releases them after the iterator has consumed only the first, then calls Close() and asserts that no cloudFetchDownloadTask.Run goroutines remain.
    • Fails on main (~9 leaked goroutines after Close).
    • Passes with this change.
  • Full go test ./... passes locally.
  • go vet and gofmt clean.

Who is affected

Any user with CloudFetch enabled (default since v1.7.0) whose query context can be cancelled or whose result set can be abandoned mid-stream — i.e., basically everyone running large CloudFetch queries with timeouts.

This pull request and its description were written by Isaac.

The cloudFetchDownloadTask goroutine sends its result on an unbuffered
channel without respecting context cancellation. When the iterator is
closed mid-download (timeout, error, or consumer dropping the result
set), in-flight goroutines that have already finished their HTTP read
sit blocked on the send forever, pinning the downloaded buffer in the
Go heap. Under bursts of large CloudFetch queries this manifests as a
multi-GiB heap plateau that only releases on process restart.

Fix: route the channel send through a helper that selects on
ctx.Done(), so cancellation via task.cancel() (already issued from
cloudIPCStreamIterator.Close) drains the goroutine and frees its
buffer.

Closes #356

Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala vikrantpuppala requested a review from gopalldb May 12, 2026 09:17
@vikrantpuppala vikrantpuppala merged commit 2557e6f into main May 12, 2026
3 checks passed
@vikrantpuppala vikrantpuppala deleted the fix/356-cloudfetch-goroutine-leak branch May 12, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory issue: Heap retention after CloudFetch burst on v1.7.1

2 participants