Skip to content

gix: cache packed-refs in a HashMap during fetch update_refs#2605

Closed
Nik B (nikicat) wants to merge 3 commits into
GitoxideLabs:mainfrom
nikicat:packed-refs-hashmap-in-fetch-update
Closed

gix: cache packed-refs in a HashMap during fetch update_refs#2605
Nik B (nikicat) wants to merge 3 commits into
GitoxideLabs:mainfrom
nikicat:packed-refs-hashmap-in-fetch-update

Conversation

@nikicat
Copy link
Copy Markdown

Summary

fetch::refs::update() calls repo.try_find_reference(name) once per advertised mapping. Each call does a filesystem stat for a loose ref followed by a binary search over packed-refs — and on wide-refs mirrors (e.g. AUR's 154k branches) that loop dominates wall time.

This PR snapshots packed-refs into a HashMap<BString, gix_ref::Reference> once and enumerates loose ref names into a HashSet once at the start of update(). Per-mapping lookups consult the HashMap when the name has no loose shadow, falling back to repo.try_find_reference() only when a loose ref exists (preserving the loose-shadows-packed precedence) or when the snapshot couldn't be built.

The optimization is purely additive: any failure to build the snapshots makes build_lookup_fast_path return None and the loop falls back to the unchanged slow path.

Measured impact

gitaur -Sy against the AUR mirror (154k packed refs, ~200 loose refs, warm cache, no incoming updates):

variant wall user CPU
unpatched main 11.0 s 8.1 s
this PR alone 8.0 s 5.8 s
this PR + #2604 (gix-ref binary-search skip-validation) 5.0 s 3.3 s

The two changes target different fetch phases:

  • This PR drops the post-pack ref-update loop from 154k binary searches to 154k hash lookups.
  • gix-ref: skip name validation in packed-refs binary search #2604 drops the per-comparison validation in packed::Buffer::binary_search_by, which speeds up the negotiation have-set build phase (still does many binary searches via Negotiate::mark_complete).

Together they bring gitaur's incremental fetch on the AUR mirror to roughly the same wall time as git fetch, down from ~2.2× slower previously.

Trade-offs / scope

  • One extra full iteration over packed-refs per update() call (cost amortized across all mappings). For small repos with few refs the overhead is in the microseconds — not benchmarked specifically, but the HashMap insertions are O(n) where n is the packed-refs entry count, and a gitoxide smoke run on a small repo showed no regression.
  • Loose-ref snapshot uses repo.refs.loose_iter() and bails to the slow path on iteration error. For repos where most refs are loose this is essentially the same work as today; for mirror-style repos with packed-heavy state it's the right shape.
  • The optimization could plausibly live deeper, inside packed::Buffer itself (e.g. a lazy index of name→offset), to also accelerate non-fetch callers of try_find_full_name. That's a larger change touching a cached data structure — happy to discuss whether you'd prefer the optimization in that form instead, but this localized version was easier to scope, measure, and review independently.

Test coverage

cargo test -p gix-ref passes (161/161). cargo test -p gix --features blocking-network-client --test gix remote::fetch::blocking_and_async_io passes the 7 tests that pass on main (3 tests fail with the same hash-mismatch on main without this PR — pre-existing, fixture-dependent).

AI disclosure

Per CONTRIBUTING.md, this change was drafted with the help of Claude Opus 4.7 (see the Co-authored-by: trailer on the commit). Investigation, profiling, and review were done by me.

`fetch::refs::update()` calls `repo.try_find_reference(name)` once per
advertised mapping. Each call does a filesystem stat for a loose ref
followed by a binary search over `packed-refs` — and on wide-refs
mirrors (e.g. AUR's 154k branches) that loop dominates wall time.

Snapshot `packed-refs` into a `HashMap<BString, gix_ref::Reference>`
once and enumerate loose refs into a `HashSet` once at the start of
`update()`. Per-mapping lookups consult the HashMap when the name has
no loose shadow, falling back to `repo.try_find_reference()` only when
a loose ref exists (preserving the precedence rule) or when the
snapshot couldn't be built. The optimization is purely additive: any
failure to build the snapshots falls back to the unchanged slow path.

Measured on a `gitaur -Sy` against the AUR mirror (154k packed refs,
~200 loose refs, warm cache, no incoming updates):

  unpatched   11.0s wall / 8.1s user CPU
  (C) alone    8.0s wall / 5.8s user CPU   (~27% faster)
  (B) + (C)    5.0s wall / 3.3s user CPU   (~55% faster, ≈ git CLI)

(B) is the sibling change in gix-ref that drops name validation from
the binary-search comparator; (C) and (B) target different phases —
the have-set build still drives binary searches (helped by B), while
update_refs itself is now a hash lookup (helped by C).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa9dc5550a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".

let packed: HashMap<BString, gix_ref::Reference> = iter
.filter_map(Result::ok)
.map(|r| {
let name = r.name.as_bstr().to_owned();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve namespace when keying packed-refs snapshot

Store entries in packed using namespace-aware names: this code keys by r.name as read from packed-refs, but lookups are performed with local mapping names (fp.lookup(name.as_bstr())) that are not namespace-prefixed. In repositories with core.refsNamespace set, try_find_reference() applies/strips the namespace during lookup, while this fast path does not, so existing packed refs can be misclassified as missing (or matched against non-namespaced refs), leading to incorrect update planning and possible transaction failures during fetch.

Useful? React with 👍 / 👎.

let buf = repo.refs.cached_packed_buffer().ok().flatten()?;
let iter = buf.iter().ok()?;
let packed: HashMap<BString, gix_ref::Reference> = iter
.filter_map(Result::ok)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop discarding ref-parse errors in snapshot construction

Do not ignore iterator errors here: filter_map(Result::ok) silently drops malformed packed/loose refs while building the fast-path snapshot. That changes behavior versus repo.try_find_reference(); for example, a malformed loose ref that shadows a packed ref would now be skipped, allowing the packed ref to be used instead of surfacing the corruption. The fast path should bail out to the slow path (or propagate) on iterator errors to preserve existing semantics.

Useful? React with 👍 / 👎.

Nik B (nikicat) added a commit to nikicat/gitaur that referenced this pull request May 17, 2026
…wins

Combines two upstream PRs that are not yet released:
  - GitoxideLabs/gitoxide#2604 — gix-ref: skip name validation in
    packed-refs binary search.
  - GitoxideLabs/gitoxide#2605 — gix: cache packed-refs in a HashMap
    during fetch update_refs.

On `gitaur -Sy` against the AUR mirror (154k branches in packed-refs,
warm cache, no incoming updates) the combo brings wall time from
~11.0s on stock gix 0.83 to ~3.4-5.0s, putting gitaur on par with
system `git fetch` on the same workload. Profile evidence:
docs/PROFILING.md plus the `gitaur-bc.json.gz` samply capture.

Revert to a versioned crates.io dependency once both PRs land and a
gix release ships with them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The packed-refs fast path keyed entries by their raw on-disk names, which
include the namespace prefix when `core.refsNamespace` is set, while
lookups use namespace-stripped local names — packed refs would be
misclassified as missing. Bail out when a namespace is configured so the
slow path applies/strips it correctly.

Also stop dropping iterator errors via `filter_map(Result::ok)`. A
malformed loose ref shadowing a packed entry would otherwise let the
fast path return the packed value instead of surfacing the corruption
the slow path raises.
@nikicat
Copy link
Copy Markdown
Author

Superseded by #2608, which moves the optimization inside gix_ref::packed::Buffer as a lazy name → record-offset index. Same speedup for fetch on wide-refs mirrors, plus every other caller of try_find_full_name benefits, no changes needed in update_refs. Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant