Skip to content

gix-ref: skip name validation in packed-refs binary search#2604

Open
Nik B (nikicat) wants to merge 3 commits into
GitoxideLabs:mainfrom
nikicat:skip-validation-in-binary-search
Open

gix-ref: skip name validation in packed-refs binary search#2604
Nik B (nikicat) wants to merge 3 commits into
GitoxideLabs:mainfrom
nikicat:skip-validation-in-binary-search

Conversation

@nikicat
Copy link
Copy Markdown

Summary

packed::Buffer::binary_search_by currently calls decode::reference on every comparison step of the binary search (~17 per lookup on a 154k-ref store). Each decode::reference runs gix_validate::reference::namegix_validate::tag::name_inner and full hex validation on the hash. The binary search only needs the name bytes for ordered comparison; the final match in try_find_full_name already re-parses the record through decode::reference and surfaces any parse error.

This PR introduces decode::name_at_record_start, which scans for the <hash><space><name><newline> shape and returns just the name slice without validating either piece, and uses it from the binary search. Parse-failure detection is preserved via the same flag the previous implementation set.

Measured impact

Profiled on a real-world wide-refs incremental fetch — gitaur -Sy against the AUR mirror (154k branches in packed-refs, ~12 MB file) on a warm cache with no incoming ref updates, sampled with samply:

metric before after delta
wall time 11.0 s 7.7 s −30 %
user CPU 8.1 s 5.1 s −37 %
gix_validate::tag::name_inner self time 14.5 % 2.5 % −83 %
gix_ref::packed::decode::reference self time 13.7 % 2.5 % −82 %

gix_validate::tag::name_inner was the single largest CPU sink, all of it coming from inside this loop.

Test coverage

cargo test -p gix-ref passes locally (161/161). The existing tests for try_find / try_find_full_name exercise both the hit and miss paths against packed-refs fixtures; behavior on malformed records is preserved by detecting the shape mismatch and setting the same encountered_parse_failure flag as before, so a corrupt packed-refs file still surfaces as Error::Parse.

AI disclosure

Per CONTRIBUTING.md, this change was drafted with the help of Claude Opus 4.7 (see the Co-authored-by trailer on the commit). The investigation, profiling methodology, and final review were performed by me.

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8726942cfc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".

Comment on lines +98 to +102
match packed::decode::name_at_record_start(line, self.object_hash) {
Some(name) => name,
None => {
encountered_parse_failure = true;
})
.unwrap_or(&[])
&[]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface malformed packed-refs records as parse failures

This match only treats shape mismatches as parse failures, so records with the right <hash><space><name><newline> layout but invalid hash/refname content are now considered comparable and no longer flip encountered_parse_failure. In a sorted packed-refs file containing such a malformed line (that is not the exact lookup target), try_find_full_name can return Ok(None) instead of Error::Parse, which hides on-disk corruption that the previous decode::reference-based comparison reported whenever the binary search touched that record.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-PR detection here was already opportunistic, not a guarantee: encountered_parse_failure only flipped for records the binary search happened to touch (~O(log n) per query), so corruption in untouched lines was never reported by try_find_full_name either. Skipping the full gix_validate::reference::name call on those comparisons is the whole point of the change.

Full validation still runs in the paths that matter:

  • on a match, try_find_full_name re-parses through decode::reference and returns Err(Parse) on bad content
  • full iteration (buf.iter(), used by loose_iter / iter_packed) validates every record

Leaving as-is.

`packed::Buffer::binary_search_by` calls `decode::reference` on every
comparison step (~17 per lookup on a 154k-ref store), and each decode
runs `gix_validate::reference::name` -> `gix_validate::tag::name_inner`
plus full hex validation on the hash. The binary search only needs the
name bytes for ordered comparison; the final match in `try_find_full_name`
already re-parses the record through `decode::reference` and surfaces any
parse error.

Introduce `decode::name_at_record_start`, which scans for the
`<hash><space><name><newline>` shape and returns just the name slice
without validating either piece, and use it from the binary search.
Parse-failure detection is preserved via the same flag the previous
implementation set.

On a wide-refs incremental fetch (gitaur AUR mirror, 154k branches)
this is the single largest CPU sink — `gix_validate::tag::name_inner`
shows up at ~14% of total fetch CPU in `samply`, all of it from this
loop. End-to-end on the same fetch the change brings wall time from
~11.0s to ~7.7s (~30%) and user CPU from ~8.1s to ~5.1s (~37%).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@nikicat Nik B (nikicat) force-pushed the skip-validation-in-binary-search branch from 8726942 to a8f7474 Compare May 17, 2026 17:00
@nikicat
Copy link
Copy Markdown
Author

Pushed a small follow-up that elides the redundant lifetime on name_at_record_start (clippy needless_lifetimes). The remaining gix-pack::pack multi_index::access::from_memory failure on the previous CI run looks unrelated to this PR — the test touches nothing under gix-ref and passes locally on this branch with cargo test -p gix-pack.

Nik B (nikicat) added a commit to nikicat/gitaur that referenced this pull request May 17, 2026
…wins

Combines two upstream PRs that are not yet released:
  - GitoxideLabs/gitoxide#2604 — gix-ref: skip name validation in
    packed-refs binary search.
  - GitoxideLabs/gitoxide#2605 — gix: cache packed-refs in a HashMap
    during fetch update_refs.

On `gitaur -Sy` against the AUR mirror (154k branches in packed-refs,
warm cache, no incoming updates) the combo brings wall time from
~11.0s on stock gix 0.83 to ~3.4-5.0s, putting gitaur on par with
system `git fetch` on the same workload. Profile evidence:
docs/PROFILING.md plus the `gitaur-bc.json.gz` samply capture.

Revert to a versioned crates.io dependency once both PRs land and a
gix release ships with them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cover the comparator's contract directly: valid records terminated by
`\n` or `\r`, the empty-name and missing-space rejections, truncated
input, inputs shorter than the hash, SHA-256 records, and the
"only the first line" rule that lets the match-site decoder handle the
optional peeled `^<hash>` follow-up.

The new helper sits inside the hot binary-search loop, so pinning its
shape contract independently of `decode::reference` guards against
accidental drift if either is touched later.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@nikicat Nik B (nikicat) force-pushed the skip-validation-in-binary-search branch from d778df6 to 0cf8980 Compare May 18, 2026 09:38
Two small polish changes to the packed-refs binary-search hot path:

- Reword the comment in `binary_search_by` to talk about `log₂(n)`
  redundant `gix_validate::reference::name` calls rather than the
  concrete "~17×" figure tied to a 154k-ref store, so the comment ages
  with the codebase rather than with a specific repo's ref count.
- Expand the doc on `decode::name_at_record_start` to make the
  parse-failure contract explicit: returning `None` is the signal for
  the caller to flip its parse-failure flag, which is what lets a
  search miss against a corrupt packed-refs surface as `Error::Parse`
  instead of `Ok(None)`.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant