feat: add support for GIT LFS checkout#3909
Conversation
f9e4186 to
255aa11
Compare
|
@lizrabuya could you ask for a review in #ask-agents or in our project channel tagging Josh/Ming once you address Dahtey's comments? |
…y are not runnable on windows
242cb29 to
9a649dd
Compare
|
going to tag and throw @buildsworth-bk in here for a review |
There was a problem hiding this comment.
Adds opt-in Git LFS support to the default checkout: a --git-lfs-enabled flag, GIT_LFS_SKIP_SMUDGE=1 so checkout writes pointer files rather than downloading inline, and an explicit git lfs fetch / git lfs checkout after checkout with its own bounded retrier. The double-retry concern from the earlier round looks resolved — exhausting the inner retrier returns WasRetried: true, which breaks the outer checkout retrier, so a persistent LFS failure fails once rather than being re-attempted ~30 times.
Two things to confirm and one small cleanup, all inline. The one I'd most want an answer on is the env var name: the SUP-6529 issue specifies BUILDKITE_GIT_LFS, but this reads BUILDKITE_GIT_LFS_ENABLED — if those don't match what the backend emits, the feature silently never activates.
The PR has no risk: label, so I'm leaving this as a comment rather than an approval.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 518, then answer my questions about the findings.
| cli.BoolFlag{ | ||
| Name: "git-lfs-enabled", | ||
| Usage: "Enable Git LFS object download during checkout (default: false)", | ||
| EnvVar: "BUILDKITE_GIT_LFS_ENABLED", |
There was a problem hiding this comment.
Question: The SUP-6529 issue specifies the env var as BUILDKITE_GIT_LFS ("the backend sets BUILDKITE_GIT_LFS=true"), but this reads BUILDKITE_GIT_LFS_ENABLED. The agent only enables LFS when this name matches exactly what the backend writes into the job environment, so if the backend still emits BUILDKITE_GIT_LFS the feature would silently never run. Has the backend side been confirmed to send BUILDKITE_GIT_LFS_ENABLED?
| // Fail fast before any git work if git-lfs is required but missing. | ||
| if e.GitLFSEnabled { | ||
| if _, err := exec.LookPath("git-lfs"); err != nil { | ||
| return fmt.Errorf("BUILDKITE_GIT_LFS_ENABLED=true but git-lfs binary is not found on PATH: %w", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
Non-blocking: Two things about this fail-fast check. First, the comment says "before any git work", but the mirror update and clone earlier in defaultCheckoutPhase have already run by the time we reach here. Second — and more impactful — the returned error is a plain fmt.Errorf, not a *gitError, so the outer checkout retrier in checkout() doesn't recognise it: errors.As(err, &errGit) fails and it falls through to the default: branch, which removes the checkout dir and retries. A missing git-lfs binary is deterministic, so this re-clones the repo on every attempt (up to ~6×) before finally failing — the opposite of fail-fast. Moving the check above the mirror/clone block and returning something that breaks the outer retrier (e.g. a *gitError{..., WasRetried: true}, the way gitLFSFetchCheckout does on exhaustion) would let it fail once, immediately.
| // disk rather than downloading objects inline. Intentionally not | ||
| // restored — git lfs checkout materialises files from cache without | ||
| // triggering the smudge filter. | ||
| e.shell.Env.Set("GIT_LFS_SKIP_SMUDGE", "1") |
There was a problem hiding this comment.
Non-blocking: GIT_LFS_SKIP_SMUDGE=1 is already set unconditionally in setUp() (executor.go:921), which runs before the checkout phase and is never unset, so this re-set looks redundant — and the "Intentionally not restored" note reads a little oddly since there's no earlier value being restored here. Is there a path where the setUp() setting wouldn't already be in effect at this point? If not, this block could just be the git lfs install --local call.
There was a problem hiding this comment.
The interesting part here is the retry interaction. gitLFSFetchCheckout runs its own 5-attempt budget and tags an exhausted failure with WasRetried + gitErrorLFS, so the outer checkout retrier breaks (and gitErrorLFS falls through to return err without wiping the checkout dir) rather than multiplying this into ~30 attempts. I traced that path and it holds up. The fail-fast git-lfs lookup and the split fetch/checkout with distinct error prefixes are sensible.
Three non-blocking notes inline: a redundant GIT_LFS_SKIP_SMUDGE set, an env-var naming question worth confirming against the backend, and an error-path test that pays the full retry backoff. The PR also has no risk: label, so it falls outside AI-only approval.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 519, then answer my questions about the findings.
| GitSubmodules bool `env:"BUILDKITE_GIT_SUBMODULES"` | ||
|
|
||
| // Whether to enable Git LFS operations during checkout | ||
| GitLFSEnabled bool `env:"BUILDKITE_GIT_LFS_ENABLED"` |
There was a problem hiding this comment.
Question: the SUP-6529 spec uses BUILDKITE_GIT_LFS throughout, but this field (and the CLI flag / backend, per the PR description) uses BUILDKITE_GIT_LFS_ENABLED. Since the backend is what sets this var in the job env, a name mismatch would silently no-op the feature rather than erroring. Has the rename been confirmed end-to-end with the backend, and the spec updated to match?
| // disk rather than downloading objects inline. Intentionally not | ||
| // restored — git lfs checkout materialises files from cache without | ||
| // triggering the smudge filter. | ||
| e.shell.Env.Set("GIT_LFS_SKIP_SMUDGE", "1") |
There was a problem hiding this comment.
Non-blocking: GIT_LFS_SKIP_SMUDGE=1 is already set unconditionally in setUp() (executor.go:921), which always runs before the checkout phase, so this set is redundant. The comment here reads as though it's doing the work, which could lead a future reader to think the setUp() one is removable (or vice versa). Probably cleaner to drop this line and rely on the setUp() set, or note explicitly that it's belt-and-braces.
| wantErr: "installing git lfs filter", | ||
| }, | ||
| { | ||
| name: "LFS enabled git lfs fetch fails", |
There was a problem hiding this comment.
Non-blocking: this subtest reaches gitLFSFetchCheckout through the production path with Retry: true, so it burns the full 5-attempt backoff (~16s per the comment in git.go) on every run. Since the assertion is really about error wrapping ("git lfs fetch"), you could call gitLFSFetchCheckout directly with Retry: false to get the same coverage without the wait, and keep any retried-path assertions to a dedicated test.
Description
Add configuration and environment variables when the backend sets
BUILDKITE_GIT_LFS_ENABLED=truein the job environment. WhenBUILDKITE_GIT_LFS_ENABLED=trueis set, the agent will check if thegit-lfsbinary is installed and proceed to perform agit lfs fetch+git lfs checkout.Changes
Configuration & CLI (clicommand/bootstrap.go, internal/job/config.go)
Checkout behaviour (internal/job/checkout.go)
LFS fetch helper (internal/job/git.go)
Smudge bypass (internal/job/executor.go)
Tests (internal/job/checkout_test.go)
Windows test caveats
Two of the LFS error-path subtests (
LFS enabled git lfs command failsandLFS enabled git lfs fetch fails) are skipped on Windows. They cannot run there for a fundamental reason, not flakiness:PATHto point at a fakegit-lfsscript that exits non-zero, so we can assert our error-wrapping behaviour.git-lfs.exeinsideGIT_EXEC_PATH(the Git installation'slibexec/git-coredirectory). Whengitresolves a subcommand likegit lfs ..., it looks inGIT_EXEC_PATHbefore falling back toPATH.git-lfs.exeruns, succeeds, and the test fails because we expected our fake to return an error.gitLFSFetchCheckoutandgit lfs install --localfailure handling) is platform-agnostic, so the Linux runs give us the real coverage. Skipping on Windows is a test-harness limitation, not a behavioural gap.Additionally, the two happy-path subtests (
LFS disabled,LFS enabled binary present) useos.MkdirTemp+ a best-effortRemoveAllint.Cleanuprather thant.TempDir(). On Windows, git's child processes (credential helpers, thegit-lfsfilter-process) can keep file handles open past their parent's exit.t.TempDir()'s cleanup is strict and fails the test onunlinkat ... being used by another processerrors; the best-effort variant lets the test pass while still cleaning up what it can.Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Disclosures / Credits
The main function changes were implemented by me based on the recommendation from the research done for this function. I used Claude to review the code and to assist with writing the tests.