feat(pm): add native-git BFS resolution (2/3)#2623
Conversation
Summary of ChangesHello @killagu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces native git dependency resolution using the gix crate, gated behind a native-git feature flag. The implementation includes a new resolver/git.rs module with logic for cloning, caching, and resolving git dependencies. The changes are well-structured, with a clear separation of concerns between blocking git operations and the async resolver logic. The use of tokio::sync::OnceCell for deduplicating concurrent clones is a good approach. My feedback focuses on improving the robustness of error handling in the new git resolver by propagating errors instead of silently ignoring them, and refining error messages for clarity.
| if entry.mode().kind() == gix::object::tree::EntryKind::BlobExecutable { | ||
| use std::os::unix::fs::PermissionsExt; | ||
| let perms = std::fs::Permissions::from_mode(0o755); | ||
| std::fs::set_permissions(&entry_path, perms).ok(); |
There was a problem hiding this comment.
Silently ignoring the error from set_permissions with .ok() could lead to unexpected behavior if the executable bit is not set correctly on systems where it matters. It's better to propagate the error to make the cloning process more robust and debuggable.
| std::fs::set_permissions(&entry_path, perms).ok(); | |
| std::fs::set_permissions(&entry_path, perms)?; |
| let obj = repo.find_object(entry.object_id())?; | ||
| let target = std::str::from_utf8(&obj.data).context("non-UTF-8 symlink target")?; | ||
| #[cfg(unix)] | ||
| std::os::unix::fs::symlink(target, &entry_path).ok(); |
There was a problem hiding this comment.
Silently ignoring the error from symlink with .ok() could hide issues like lack of filesystem support or permission errors. It's better to propagate the error to ensure the cached package is a faithful representation of the git tree.
| std::os::unix::fs::symlink(target, &entry_path).ok(); | |
| std::os::unix::fs::symlink(target, &entry_path)?; |
|
|
||
| // Extract full tree to cache | ||
| if package_dir.exists() { | ||
| std::fs::remove_dir_all(&package_dir).ok(); |
There was a problem hiding this comment.
Silently ignoring the error from remove_dir_all with .ok() could hide underlying problems, such as file permission issues in the cache directory. This might lead to subsequent, harder-to-debug failures during extraction. It would be better to propagate the error.
| std::fs::remove_dir_all(&package_dir).ok(); | |
| std::fs::remove_dir_all(&package_dir)?; |
| ) | ||
| }) | ||
| .await | ||
| .context("git resolver task panicked")? |
There was a problem hiding this comment.
The error message "git resolver task panicked" is not entirely accurate. A JoinError from tokio::task::spawn_blocking can occur if the task is cancelled, not just when it panics. A more general message would be more appropriate.
| .context("git resolver task panicked")? | |
| .context("git resolver task failed")? |
There was a problem hiding this comment.
Pull request overview
This PR adds a native-git-gated, gix-powered git clone backend and wires non-registry dependency specs into the BFS resolver, using a shared clone-dedup cache and a configurable cache directory to persist git checkouts alongside registry tarball cache layout.
Changes:
- Introduce
resolver/git.rs(clone + cache + git/GitHub spec resolution) behind thenative-gitfeature flag. - Route non-registry specs through the git resolver during BFS dependency processing, and add
BuildDepsConfig.cache_dir. - Add optional
gix/tempfiledependencies and re-exportclone_repo(feature-gated) from the crate.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/ruborist/src/service/api.rs | Passes cache_dir into BuildDepsConfig for git clone caching. |
| crates/ruborist/src/resolver/registry.rs | Refines ResolveError::Git to include { url, message } for better error reporting. |
| crates/ruborist/src/resolver/mod.rs | Adds the new git resolver module under cfg(feature = "native-git"). |
| crates/ruborist/src/resolver/git.rs | Implements gix-based cloning, cache extraction, dedup via OnceCell, and manifest building from cached package.json. |
| crates/ruborist/src/resolver/builder.rs | Adds cache_dir to config and routes non-registry specs through git resolution in BFS. |
| crates/ruborist/src/lib.rs | Re-exports clone_repo under native-git. |
| crates/ruborist/Cargo.toml | Adds native-git feature and optional deps (gix, tempfile). |
|
|
||
| // Extract full tree to cache | ||
| if package_dir.exists() { | ||
| std::fs::remove_dir_all(&package_dir).ok(); |
There was a problem hiding this comment.
If an existing package_dir is present, remove_dir_all errors are ignored (.ok()). If removal fails (permissions, locks, partial state), the subsequent extraction can leave a corrupted cache directory. Consider propagating the error (or at least logging and bailing) so callers don't proceed with a potentially inconsistent cache entry.
| std::fs::remove_dir_all(&package_dir).ok(); | |
| std::fs::remove_dir_all(&package_dir).with_context(|| { | |
| format!( | |
| "failed to remove existing git package cache directory: {}", | |
| package_dir.display() | |
| ) | |
| })?; |
| .await?; | ||
|
|
||
| Ok(ResolvedPackage { | ||
| name: dep_name.to_string(), |
There was a problem hiding this comment.
ResolvedPackage.name is set to dep_name, whereas registry resolution returns the package's canonical name from the manifest. For consistency (and to avoid surprising downstream consumers of ResolvedPackage), consider setting ResolvedPackage.name to result.name (the package.json name) and treating dep_name purely as the edge/install name handled by the graph.
| name: dep_name.to_string(), | |
| name: result.name, |
| // Route non-registry specs (git, github, …) through the | ||
| // built-in git resolver; everything else goes to the registry. | ||
| let resolved = if is_non_registry_spec(&edge_info.spec) { | ||
| match resolve_non_registry_dep(&config.cache_dir, &edge_info.name, &edge_info.spec) | ||
| .await |
There was a problem hiding this comment.
is_non_registry_spec() includes local specs (file:/link:/workspace:/portal:). Routing all of these through resolve_non_registry_dep will currently bail as "Unsupported non-registry spec" and then be surfaced as ResolveError::Git, breaking valid workspace/local dependency specs. Consider narrowing the routing condition to only git-like specs (git+, git://, github:) or parsing with PackageSpec::parse() and only invoking the git resolver for PackageSpec::Git|GitHub; local specs should be handled separately (e.g., resolved to workspace/link nodes or skipped).
| // Route non-registry specs (git, github, …) through the | ||
| // built-in git resolver; everything else goes to the registry. | ||
| let resolved = if is_non_registry_spec(&edge_info.spec) { | ||
| match resolve_non_registry_dep(&config.cache_dir, &edge_info.name, &edge_info.spec) | ||
| .await | ||
| { | ||
| Ok(r) => r, | ||
| Err(_) if edge_info.edge_type == EdgeType::Optional => { | ||
| tracing::debug!( | ||
| "Skipped optional non-registry dependency {}@{}", | ||
| edge_info.name, | ||
| edge_info.spec | ||
| ); | ||
| return Ok(ProcessResult::Skipped); | ||
| } | ||
| Err(e) => { | ||
| return Err(ResolveError::Git { | ||
| url: edge_info.spec.clone(), | ||
| message: e.to_string(), | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
There are existing async tests in this file, but none cover the new non-registry routing logic in process_dependency (git spec -> git resolver, optional non-registry failures -> skipped, non-optional failures -> ResolveError::Git). Adding targeted tests here would help prevent regressions (including the stubbed behavior when native-git is disabled).
| let key = format!("{}#{}", url, commit_ish.unwrap_or("HEAD")); | ||
|
|
||
| let cell = { | ||
| let mut cache = GIT_CLONE_CACHE.lock(); | ||
| cache | ||
| .entry(key) | ||
| .or_insert_with(|| Arc::new(tokio::sync::OnceCell::new())) | ||
| .clone() | ||
| }; | ||
|
|
||
| let clone_url = url.strip_prefix("git+").unwrap_or(url).to_string(); | ||
| let original_url = url.to_string(); | ||
| let cache_dir = cache_dir.to_path_buf(); | ||
| let commit_ish_owned = commit_ish.map(|s| s.to_string()); |
There was a problem hiding this comment.
The clone dedup key is computed from the raw url (format!("{}#{}", url, ...)) but the actual clone uses clone_url with the git+ prefix stripped. This means callers using logically equivalent URLs (with vs without git+) won't dedup, and (more importantly) the key does not include cache_dir, so two callers using different cache roots in the same process can incorrectly share a cached result and receive a cache_path in the wrong directory. Consider canonicalizing the URL used for the key (strip git+, normalize) and including cache_dir (or otherwise scoping the cache per cache root).
| // Configure shallow fetch (depth 1) | ||
| prepare = prepare.with_shallow(gix::remote::fetch::Shallow::DepthAtRemote( | ||
| std::num::NonZeroU32::new(1).unwrap(), | ||
| )); | ||
|
|
||
| // If commit_ish looks like a branch/tag, configure the ref to fetch | ||
| let is_full_sha = | ||
| commit_ish.is_some_and(|c| c.len() == 40 && c.chars().all(|ch| ch.is_ascii_hexdigit())); | ||
|
|
||
| if let Some(ref_name) = commit_ish | ||
| && !is_full_sha | ||
| { | ||
| // Try as branch/tag ref | ||
| prepare = prepare.with_ref_name(Some(ref_name))?; | ||
| } | ||
|
|
||
| let (checkout, _outcome) = prepare | ||
| .fetch_only(gix::progress::Discard, &gix::interrupt::IS_INTERRUPTED) | ||
| .map_err(|e| { | ||
| let mut msg = format!("git fetch failed for '{}': {e}", clone_url); | ||
| if github_auth_token().is_none() | ||
| && (clone_url.contains("github.com") || clone_url.contains("gitlab.com")) | ||
| { | ||
| msg.push_str( | ||
| "\n\nTip: set GITHUB_TOKEN or GH_TOKEN for private repos / higher rate limits", | ||
| ); | ||
| } | ||
| anyhow!(msg) | ||
| })?; | ||
|
|
||
| // Resolve the commit | ||
| let commit_id = if is_full_sha { | ||
| let sha = commit_ish.unwrap(); | ||
| gix::ObjectId::from_hex(sha.as_bytes()) | ||
| .map_err(|e| anyhow!("invalid commit SHA '{}': {e}", sha))? | ||
| } else { | ||
| // HEAD of the fetched ref | ||
| checkout | ||
| .head_id() | ||
| .map_err(|e| anyhow!("could not resolve HEAD after fetch: {e}"))? | ||
| .detach() | ||
| }; |
There was a problem hiding this comment.
clone_repo_blocking always configures a shallow fetch with depth 1, but the code path for a full 40-hex commit_ish just constructs an ObjectId and then immediately tries to look it up in the repo. For non-HEAD commits this object typically won't be present in a depth-1 fetch, so resolving by full SHA will fail deterministically. Consider fetching the specific commit (or disabling shallow / increasing depth) when commit_ish is a full SHA, or otherwise documenting that only refs (branch/tag) are supported with shallow cloning.
| // Read name/version from package.json blob (cheap — single object read) | ||
| let (name, version) = read_pkg_name_version(&checkout, tree_id.into())?; | ||
|
|
||
| // Cache path: <cache_dir>/<name>/<commit_sha>/ | ||
| // Using the full commit SHA as the version directory avoids collisions | ||
| // with registry versions and between different commits of the same repo. | ||
| // This also matches the layout registry tarballs use (<name>/<version>/), | ||
| // so `utoo clean` works uniformly. | ||
| let package_dir = cache_dir.join(&name).join(&commit_hex); | ||
| let resolved_marker = package_dir.join("_resolved"); |
There was a problem hiding this comment.
package_dir is derived from the repo's package.json name and joined directly onto cache_dir. Since this value comes from untrusted content, a malicious package name containing path traversal (e.g. .. segments or platform separators) could escape the intended cache root and write files outside cache_dir. Consider validating/sanitizing the package name (e.g. enforce npm package name rules and reject .., absolute paths, Windows prefixes) before using it as a filesystem path component.
42fcfa3 to
e9b77e0
Compare
29a4461 to
f636c48
Compare
e9b77e0 to
dfda143
Compare
f636c48 to
23c8e43
Compare
dfda143 to
ad7962f
Compare
53c271e to
41d9077
Compare
ad7962f to
52e1b13
Compare
3be048c to
e07462e
Compare
Code Guard Review审查维度:类型建模、命名语义、match 穷举性、废弃策略、错误消息质量、项目约定。 🔴 Must Fix (3 项)1.
|
af81f2d to
9ff74f3
Compare
9ff74f3 to
ab5c43d
Compare
| /// re-reading the file from disk. | ||
| /// | ||
| /// [`VersionManifest`]: crate::model::manifest::VersionManifest | ||
| pub pkg_json: serde_json::Value, |
There was a problem hiding this comment.
禁止出现 serde_json::value,直接使用 versionManifest
同时需要注意 package.json 和 versionManifest 的区别
| let parsed_spec = PackageSpec::from(edge_info.spec.as_str()); | ||
| let resolved: ResolvedPackage = match &parsed_spec { | ||
| PackageSpec::Git { .. } | PackageSpec::GitHub { .. } => { | ||
| match resolve_non_registry_dep(config.cache_dir.as_deref(), &parsed_spec).await |
There was a problem hiding this comment.
resolve_non_registry_dep 内未来还需要实现 sepc => version 过期判断
目前 resolve_dependency 的命名不够语义化,没有体现缓存功能
| } | ||
| } | ||
| } | ||
| PackageSpec::Local { .. } => { |
| /// | ||
| /// Only reads the single `package.json` blob — does **not** extract the full | ||
| /// tree, so this is cheap even for large repositories. | ||
| fn read_pkg_json(repo: &gix::Repository, tree_id: gix::ObjectId) -> Result<serde_json::Value> { |
| @@ -0,0 +1,547 @@ | |||
| //! Git clone backend powered by `gix`. | |||
There was a problem hiding this comment.
git.rs 目前都是同步方法,目前 pipeline 模式都是异步运行时,看看有 async 的解决方案吗?
| /// 2. Clones the repository via [`ensure_repo_cached`]. | ||
| /// 3. Builds a [`VersionManifest`] from the cached `package.json` (no extra | ||
| /// disk I/O — the parsed JSON is already in [`GitCloneResult::pkg_json`]). | ||
| pub(crate) async fn resolve_non_registry_dep( |
There was a problem hiding this comment.
| pub(crate) async fn resolve_non_registry_dep( | |
| // TODO: refactor this to be more extendable | |
| pub(crate) async fn resolve_non_registry_dep( |
46218c2 to
d5364c7
Compare
- Add gix/tempfile as optional deps behind `native-git` feature flag
- Add resolver/git.rs with clone_repo (async), clone_repo_blocking,
resolve_non_registry_dep, build_manifest_from_git_cache
- Route non-registry specs through git resolver in BFS process_dependency
- Add GIT_CLONE_CACHE with tokio::sync::OnceCell for concurrent dedup
- Use tokio::fs for non-blocking package.json reads (arch review fix)
- Add BuildDepsConfig.cache_dir with builder pattern
- Update ResolveError::Git to structured { url, message } fields
- Keep BuildDepsOptions::new() (no public API removal)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d5364c7 to
0120a57
Compare
| url: &str, | ||
| commit_ish: Option<&str>, | ||
| ) -> Result<Arc<GitCloneResult>> { | ||
| // Normalize the dedup key: strip `git+` so `git+https://…` and `https://…` |
There was a problem hiding this comment.
clone_repo_blocking 每次都走网络,不读磁盘缓存
fn clone_repo_blocking(...) -> Result<GitCloneResult> {
let temp_dir = tempfile::tempdir()?; // ← tempdir 创建
let (checkout, _outcome) = prepare.fetch_only(...)?; // ← 网络 fetch 总是执行
// ... 计算 commit_hex, package_dir, resolved_marker ...
if resolved_marker.exists() { // ← 检查缓存在 fetch 之后!
return Ok(GitCloneResult { ... });
}
// ... extract ...
}
_resolved 标记用来做进程间原子性保证,但它是在完成网络 fetch 之后才检查的。这意味着每次 utoo install 运行,即使包已经在磁盘缓存里,都会重新 clone。
缓存路径在 fetch 前就完全确定(cache_dir///),可以提前检查。修复思路:
// 仅对全 SHA 可提前检查(branch/tag refs 必须 fetch 才能知道当前 HEAD)
if let Some(sha) = commit_ish.filter(|s| s.len() == 40 && s.chars().all(|c| c.is_ascii_hexdigit())) {
// 但 name 未知……需要从磁盘 package.json 读取
// 或者维护一个 url->name 的索引文件
}
Scan the disk cache before any network I/O when commit_ish is a full 40-char hex SHA. This avoids re-cloning git dependencies on every `utoo install` when they are already cached at `<cache_dir>/<name>/<sha>/`. Branch/tag refs still always fetch (the ref may have moved forward). On cache-read failure the code falls through to the normal network path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| Git { url: String, message: String }, | ||
| Git { | ||
| url: String, | ||
| source: Box<dyn std::error::Error + Send + Sync + 'static>, |
There was a problem hiding this comment.
ResolveError 其他 variant 用的是具体类型(deno_semver::VersionError、E),唯独 Git 用了 trait object。这破坏了 enum 的类型安全优势——Comprehensive Rust 的 Error trait 章节说过
"boxing errors gives up the ability to cleanly handle different error cases"。
建议: 用 #[from] 配合具体的 GitResolveError 类型,或至少用 anyhow::Error 替代裸 Box 以保持一致。
| /// same repo simultaneously, only one clone is performed; the others await | ||
| /// the result. Matches the `DOWNLOAD_CACHE` pattern in the PM crate's | ||
| /// `downloader.rs`. | ||
| type CloneCache = Mutex<HashMap<String, Arc<tokio::sync::OnceCell<Arc<GitCloneResult>>>>>; |
There was a problem hiding this comment.
type CloneCache = Mutex<HashMap<String, Arc<tokio::sync::OnceCell<Arc>>>>;
static GIT_CLONE_CACHE: LazyLock = LazyLock::new(|| Mutex::new(HashMap::new()));
Comprehensive Rust 并发章节强调在 async 上下文中要注意 mutex 的选择。这里用 parking_lot::Mutex(同步锁)来保护一个在 async 函数中使用的 HashMap。虽然锁持有时间很短(只是
entry().or_insert_with()),理论上不会 block executor,但有两个问题:
- 全局 static 无法被 test/clean/drop:进程生命周期内缓存永远增长,utoo clean 只清磁盘但内存 map 不清。
- 跨 test 泄漏:#[tokio::test] 之间共享全局状态。
建议: 将 CloneCache 作为 BuildDepsConfig 或 resolver context 的一个字段(Arc),而不是 static。
可以参照一下目前 VersionManifestCache 相关实现
| /// `<cache_dir>/@<scope>/<name>/<sha>/_resolved` for scoped packages. | ||
| /// Returns the package directory path if found. | ||
| fn find_cached_git_package(cache_dir: &Path, sha: &str) -> Option<PathBuf> { | ||
| let entries = std::fs::read_dir(cache_dir).ok()?; |
There was a problem hiding this comment.
为了找某个 SHA 的缓存目录,遍历了 cache_dir 下所有顶层目录,包括 scoped package 的二级目录。如果缓存目录有大量包,这会很慢。
其实可以直接读取 name 后判断,因为我们虽然不知道版本号,但是 name 是确定的
|
还需要审视一下目前异常处理、全局缓存方案,结构要保持一致。
|
…yle cleanup - Delete directory scan (find_cached_git_package), use direct path check with caller-supplied name instead - Move global static GIT_CLONE_CACHE to BuildDepsConfig field (async tokio::sync::Mutex, scoped to session lifetime) - ResolveError::Git source: Box<dyn Error> → anyhow::Error - Fix symlink escape: validate targets stay within extraction root - Fix path traversal: validate name before early cache lookup - Fix token injection: skip URLs with existing credentials - Fix staging dir race: use tempfile::tempdir_in() for unique dirs - Add PackageSpec::clone_url() to centralize git+ prefix stripping - Add GitCloneResult::new() constructor with debug_assert consistency - Unify tracing levels (warn → debug), divider style, doc headers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ResolveError::Git.source is now anyhow::Error, so the conversion from anyhow::Error is identity — clippy::useless_conversion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
715dfb4 to
4f6105d
Compare
Resolve conflicts: - Cargo.toml: add gix/tempfile workspace deps with next's formatting - crates/ruborist/Cargo.toml: keep native-only deps, fix tombi format - crates/ruborist/src/resolver/builder.rs: reconcile next's refactored manifest deref with PR's config.legacy_peer_deps, add missing HashMap import - Cargo.lock: preserve existing dependency versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4f6105d to
a2df188
Compare
…dback Address review comments from PR #2622/#2623: Phase 1 (correctness): - Add Local/Http variants to PackageSpec, unify is_non_registry_spec via parse delegation - Convert process_dependency to exhaustive match on PackageSpec variants - Fix TOCTOU race in cache writes with atomic tmp-dir + rename - Replace serde_json::Value with typed PkgNameVersion and direct VersionManifest deserialization Phase 2 (code quality): - Add thiserror derive for ResolveError, preserve error chain in Git variant - Convert GIT_CLONE_CACHE static to CloneCache instance variable in BuildDepsConfig - Consolidate cfg(native-git) checks inside git.rs, remove stubs from builder.rs - Move std::env reads out of ruborist into PM's git_resolver.rs (auth_token parameter) - Small fixes: NonZeroU32::MIN, with_cache_dir(PathBuf), cache_path→store_path, submodule skip trace, error message quote consistency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Split from #2603 — Part 2 of 3 for git dependency resolution. Depends on #2622.
resolver/git.rswith full gix clone implementation + BFS resolverGIT_CLONE_CACHEwithtokio::sync::OnceCellfor concurrent clone dedupnative-gitfeature flag (gix+tempfileoptional deps)process_dependencyBuildDepsConfig.cache_dirwith builder patternKey implementation details
clone_repo(async) wrapsclone_repo_blockingviaspawn_blockingresolve_non_registry_dep,build_manifest_from_git_cachelive inresolver/(nottraits/) per module boundary conventionbuild_manifest_from_git_cacheusestokio::fs::read_to_string(non-blocking I/O)BuildDepsOptions::new()preserved — no public API removalservice/api.rsunchangedCache layout
Git packages:
<cache_dir>/<name>/<commit_sha>/— same<name>/<key>/layout as registry tarballs, soutoo cleanworks uniformly.Test plan
cargo check -p utoo-ruboristpasses (no feature)cargo check -p utoo-ruborist --features native-gitpassescargo test -p utoo-ruborist— all tests pass🤖 Generated with Claude Code