perf(table): dedupe shared manifest reads in orphan cleanup#1133
perf(table): dedupe shared manifest reads in orphan cleanup#1133paveon wants to merge 3 commits into
Conversation
laskoviymishka
left a comment
There was a problem hiding this comment.
Nice work on this one.
Two things before merge:
-
DeleteOrphanFiles,executeOrphanCleanup, andgetReferencedFilesmoved to pointer receivers, butPurgeFiles, which also callsgetReferencedFiles, stayed as a value receiver. Since none of these mutatet, I’d either revert all three or movePurgeFilestoo. Splitting receiver kinds in the public API is awkward to change later. -
The
_, ok := referencedFiles[file]change inisFileOrphanis actually a correctness fix, not just cleanup. The old map-value form returnedfalsefor metadata entries and let them fall through into the prefix-mismatch path. Worth adding a short comment, and ideally a small test for metadata with a host mismatch.
One non-blocker: TotalSizeBytes changed meaning from “all age-filtered scanned files” to “orphans only”. I think the new meaning is better and matches the CLI prompt, but a field comment would make that explicit. Fine as a follow-up.
Once the receiver consistency and _, ok intent are covered, happy to approve.
| } | ||
|
|
||
| func (t Table) DeleteOrphanFiles(ctx context.Context, opts ...OrphanCleanupOption) (OrphanCleanupResult, error) { | ||
| func (t *Table) DeleteOrphanFiles(ctx context.Context, opts ...OrphanCleanupOption) (OrphanCleanupResult, error) { |
There was a problem hiding this comment.
DeleteOrphanFiles, executeOrphanCleanup, and getReferencedFiles moved to pointer receivers, but PurgeFiles further down still uses a value receiver and calls into getReferencedFiles. None of these methods actually mutate t, so I'd either revert all three back to value receivers, or move PurgeFiles to *Table as well — splitting the method set across receiver kinds in the public API is the kind of subtle thing that's hard to undo later. wdyt?
There was a problem hiding this comment.
It got flagged by my IDE because some methods already use pointer receivers so I changed it but forgot to change it back after I noticed that most of the public API uses value receivers 😅 Struct Table has methods on both value and pointer receivers. Such usage is not recommended by the Go Documentation.
I reverted the change 👍
| if info.IsDir() || !info.ModTime().Before(cutoff) { | ||
| return nil | ||
| } | ||
| scannedFiles = append(scannedFiles, scannedFile{path: path, size: info.Size()}) |
There was a problem hiding this comment.
Safe today — goroutine 1 only writes referencedFiles, goroutine 2 only writes scannedFiles, no cross-reads. But the contract that these two are partitioned is invisible to a future maintainer, and there's no mutex or comment to flag it.
A short comment on the errgroup block ("goroutine 1 owns referencedFiles, goroutine 2 owns scannedFiles; no shared writes") would make the next edit safer. wdyt?
There was a problem hiding this comment.
Good point, added short comment.
| normalizedFile := normalizeFilePathWithConfig(file, cfg) | ||
|
|
||
| if referencedFiles[file] || referencedFiles[normalizedFile] { | ||
| if _, ok := referencedFiles[file]; ok { |
There was a problem hiding this comment.
This is actually a correctness fix hiding inside the refactor, not just a style change. The old referencedFiles[file] || referencedFiles[normalizedFile] returned the map value, which is false for metadata-file entries (manifest lists, manifests, statistics) — so those entries silently fell through into the checkPrefixMismatch branch and could trip PrefixMismatchError on a host mismatch. The comma-ok form correctly treats any key-in-map as referenced, regardless of the bool value.
Worth a one-line comment locking in the intent ("any presence in referencedFiles means referenced; the bool distinguishes data vs metadata for gc.enabled, not membership") and ideally a regression test for the metadata-file-with-host-mismatch case.
There was a problem hiding this comment.
Good catch, didn't notice that 😀 I added both a comment and unit tests.
Fixes #1132
Summary
getReferencedFilesusing a two-pass approach: collect unique manifests from manifest lists (pass 1), then read entries in parallel viaerrgroup.SetLimit(pass 2)time.Now()being evaluated per-file inside walk callback instead of once before the walkReduces
getReferencedFilesI/O from O(snapshots × manifests-per-snapshot) to O(unique manifests).