Skip to content

Upgrading 18.0.x → 18.1.0 with a reused base graph triggers multi-hour queryTargetsDependingOnModules fan-out #335

@rdark

Description

@rdark

Summary

We upgraded bazel-diff 18.0.0 → 18.1.0. Our CI uses a base graph produced by a prior main commit (still on 18.0.0) and a head graph produced by the upgrading commit (now on 18.1.0). get-impacted-targets then spawned ~5,000 serial bazel query rdeps(//..., @@<repo>//...) subprocesses and took ~3 hours to complete. It was called twice per workflow, so CI took ~7 hours end-to-end.

The cause is #330's stderr redirect fix interacting with #314's module-change detection - not #331's exec-bit hash or the SourceFileHasher format change as I suggested earlier.

Root cause

#330 (shipped in v18.1.0) changed BazelModService:

-    stderr = Redirect.CAPTURE,
+    stderr = Redirect.SILENT,

That is correct — the old CAPTURE+CAPTURE combination hit captureAll in Process.kt, which calls redirectErrorStream(true), merging stderr into the captured stdout. So getModuleGraphJson() on 18.0.x returned "<INFO: Checking for file changes...>\n...<actual JSON>..." whenever bazel emitted anything on stderr during the mod graph --output=json call. The new version returns clean JSON.

The #330 commit message notes this explicitly: "Occasionally I see INFO: Checking for file changes...\n in the metadata when generating hashes, which is extra useless info that causes bazel-diff to think the whole graph has changed."

The fix shipped without a migration path for existing base graphs produced by the old version. In CalculateImpactedTargetsInteractor.execute():

  1. fromModuleGraphJson != toModuleGraphJson on line 44 > always true across the 18.0.x > 18.1.0 boundary (polluted string ≠ clean string).
  2. ModuleGraphParser.parseModuleGraph(fromGraph) - the stderr prefix makes the whole string invalid JSON;
try { ... } catch (e: Exception) {
  return emptyMap()
}
  1. parseModuleGraph(toGraph) parses cleanly to the full module map (~90 modules for us).
  2. findChangedModules(emptyMap, fullMap) reports every module in the new graph as "added".
  3. queryTargetsDependingOnModules(90 moduleKeys, allTargets) then substring-matches each module name against allTargets.keys at CalculateImpactedTargetsInteractor.kt:293. In a workspace with many module extensions, this is a very loose match: "aspect_bazel_lib".contains(...) matches aspect_bazel_lib+, aspect_bazel_lib++toolchains+bats_toolchains, aspect_bazel_lib++toolchains+..., etc. For us that produces ~5,000 matching repo patterns from ~4,500 materialised external repos.
  4. Each match spawns one bazel query rdeps(//..., @@<repo>//...) subprocess. JVM + bazel-client-connect ≈ 2s each, serial. ~5,000 × 2s = ~3h per get-impacted-targets call.

Most of those spawns log Loading: 0 packages loaded; the repo pattern doesn't resolve and the query fails instantly.

Potential fixes

In rough priority order:

  1. Make ModuleGraphParser.parseModuleGraph robust to non-JSON prefix/suffix. Strip or locate the JSON object within the captured string instead of failing the whole parse. This would have let 18.1.0 handle a polluted 18.0.x base graph correctly.
  2. Fall back to computeSimpleImpactedTargets when parse asymmetry is detected. If fromModuleGraphJson != toModuleGraphJson but one side parses to empty, treating every module as "added" is far more expensive than the per-target hash diff would have been. computeSimpleImpactedTargets on a near-total invalidation finished in 8 min in our workspace.
  3. Tighten the module > repo match at CalculateImpactedTargetsInteractor.kt:292-295. The current allTargets.keys.filter { it.contains(moduleName) } is O(modules x targets) with a very loose substring predicate. On workspaces with many module extensions (ours has ~4,500 canonical external repos for ~90 declared modules), a single "changed" module name can match thousands of canonical repo names. Match by canonical repo key or apparent name derived from bazel mod dump_repo_mapping instead. BazelQueryService.discoverRepoMapping already knows how to produce this.
  4. Union rdeps queries. If N modules need rdeps computed, one rdeps(//..., @@a//... + @@b//... + ... + @@n//...) costs roughly the same as one query; N separate subprocesses cost N × startup overhead.
  5. Add a formatVersion field to the hashes JSON metadata. Independent of the above, this would let future serialisation changes quickly fall back to computeSimpleImpactedTargets with a fail fast clear error or instead of silently triggering pathological code paths.

Environment

  • bazel-diff 18.0.0 → 18.1.0 (only change)
  • Bazel 8.6.0, bzlmod, RBE with --remote_download_minimal
  • ~95k targets in the graph, 90 declared bzlmod modules, 4,500+ canonical external repos materialised under $output_base/external/
  • Base graphs stored in cloud object storage between runs

## Misc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions