Skip to content

perf(table): dedupe shared manifest reads in orphan cleanup#1133

Open
paveon wants to merge 3 commits into
apache:mainfrom
paveon:perf/orphan-cleanup-manifest-dedupe
Open

perf(table): dedupe shared manifest reads in orphan cleanup#1133
paveon wants to merge 3 commits into
apache:mainfrom
paveon:perf/orphan-cleanup-manifest-dedupe

Conversation

@paveon
Copy link
Copy Markdown
Contributor

@paveon paveon commented May 27, 2026

Fixes #1132

Summary

  • Deduplicate manifest reads in getReferencedFiles using a two-pass approach: collect unique manifests from manifest lists (pass 1), then read entries in parallel via errgroup.SetLimit (pass 2)
  • Run S3 walk and referenced-file collection concurrently via errgroup
  • Fix time.Now() being evaluated per-file inside walk callback instead of once before the walk

Reduces getReferencedFiles I/O from O(snapshots × manifests-per-snapshot) to O(unique manifests).

@paveon paveon requested a review from zeroshade as a code owner May 27, 2026 07:02
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this one.

Two things before merge:

  1. DeleteOrphanFiles, executeOrphanCleanup, and getReferencedFiles moved to pointer receivers, but PurgeFiles, which also calls getReferencedFiles, stayed as a value receiver. Since none of these mutate t, I’d either revert all three or move PurgeFiles too. Splitting receiver kinds in the public API is awkward to change later.

  2. The _, ok := referencedFiles[file] change in isFileOrphan is actually a correctness fix, not just cleanup. The old map-value form returned false for metadata entries and let them fall through into the prefix-mismatch path. Worth adding a short comment, and ideally a small test for metadata with a host mismatch.

One non-blocker: TotalSizeBytes changed meaning from “all age-filtered scanned files” to “orphans only”. I think the new meaning is better and matches the CLI prompt, but a field comment would make that explicit. Fine as a follow-up.

Once the receiver consistency and _, ok intent are covered, happy to approve.

Comment thread table/orphan_cleanup.go Outdated
}

func (t Table) DeleteOrphanFiles(ctx context.Context, opts ...OrphanCleanupOption) (OrphanCleanupResult, error) {
func (t *Table) DeleteOrphanFiles(ctx context.Context, opts ...OrphanCleanupOption) (OrphanCleanupResult, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteOrphanFiles, executeOrphanCleanup, and getReferencedFiles moved to pointer receivers, but PurgeFiles further down still uses a value receiver and calls into getReferencedFiles. None of these methods actually mutate t, so I'd either revert all three back to value receivers, or move PurgeFiles to *Table as well — splitting the method set across receiver kinds in the public API is the kind of subtle thing that's hard to undo later. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got flagged by my IDE because some methods already use pointer receivers so I changed it but forgot to change it back after I noticed that most of the public API uses value receivers 😅 Struct Table has methods on both value and pointer receivers. Such usage is not recommended by the Go Documentation.

I reverted the change 👍

Comment thread table/orphan_cleanup.go
if info.IsDir() || !info.ModTime().Before(cutoff) {
return nil
}
scannedFiles = append(scannedFiles, scannedFile{path: path, size: info.Size()})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe today — goroutine 1 only writes referencedFiles, goroutine 2 only writes scannedFiles, no cross-reads. But the contract that these two are partitioned is invisible to a future maintainer, and there's no mutex or comment to flag it.

A short comment on the errgroup block ("goroutine 1 owns referencedFiles, goroutine 2 owns scannedFiles; no shared writes") would make the next edit safer. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added short comment.

Comment thread table/orphan_cleanup.go
normalizedFile := normalizeFilePathWithConfig(file, cfg)

if referencedFiles[file] || referencedFiles[normalizedFile] {
if _, ok := referencedFiles[file]; ok {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a correctness fix hiding inside the refactor, not just a style change. The old referencedFiles[file] || referencedFiles[normalizedFile] returned the map value, which is false for metadata-file entries (manifest lists, manifests, statistics) — so those entries silently fell through into the checkPrefixMismatch branch and could trip PrefixMismatchError on a host mismatch. The comma-ok form correctly treats any key-in-map as referenced, regardless of the bool value.

Worth a one-line comment locking in the intent ("any presence in referencedFiles means referenced; the bool distinguishes data vs metadata for gc.enabled, not membership") and ideally a regression test for the metadata-file-with-host-mismatch case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, didn't notice that 😀 I added both a comment and unit tests.

@paveon paveon requested a review from laskoviymishka May 30, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orphan cleanup reads shared manifests redundantly — O(snapshots × manifests) I/O

2 participants