Upgrading 18.0.x → 18.1.0 with a reused base graph triggers multi-hour `queryTargetsDependingOnModules` fan-out

## Summary

We upgraded `bazel-diff` 18.0.0 → 18.1.0. Our CI uses a base graph produced by a prior main commit (still on 18.0.0) and a head graph produced by the upgrading commit (now on 18.1.0). `get-impacted-targets` then spawned ~5,000 serial `bazel query rdeps(//..., @@<repo>//...)` subprocesses and took ~3 hours to complete. It was called twice per workflow, so CI took ~7 hours end-to-end.

The cause is [#330](https://github.com/Tinder/bazel-diff/pull/330)'s stderr redirect fix interacting with [#314](https://github.com/Tinder/bazel-diff/pull/314)'s module-change detection - not #331's exec-bit hash or the `SourceFileHasher` format change as I suggested earlier.

## Root cause

[#330](https://github.com/Tinder/bazel-diff/pull/330) (shipped in v18.1.0) changed `BazelModService`:

```diff
-    stderr = Redirect.CAPTURE,
+    stderr = Redirect.SILENT,
```

That is correct — the old CAPTURE+CAPTURE combination hit captureAll in Process.kt, which calls `redirectErrorStream(true)`, merging stderr into the captured stdout. So `getModuleGraphJson()` on 18.0.x returned `"<INFO: Checking for file changes...>\n...<actual JSON>..."` whenever bazel emitted anything on stderr during the mod graph `--output=json` call. The new version returns clean JSON.

The #330 commit message notes this explicitly: "Occasionally I see INFO: Checking for file changes...\n in the metadata when generating hashes, which is extra useless info that causes bazel-diff to think the whole graph has changed."

The fix shipped without a migration path for existing base graphs produced by the old version. In `CalculateImpactedTargetsInteractor.execute()`:

1. fromModuleGraphJson != toModuleGraphJson on line 44 > always true across the 18.0.x > 18.1.0 boundary (polluted string ≠ clean string).
2. ModuleGraphParser.parseModuleGraph(fromGraph) - the stderr prefix makes the whole string invalid JSON;
```kotlin
try { ... } catch (e: Exception) {
  return emptyMap()
}
```
3. parseModuleGraph(toGraph) parses cleanly to the full module map (~90 modules for us).
4. findChangedModules(emptyMap, fullMap) reports every module in the new graph as "added".
5. queryTargetsDependingOnModules(90 moduleKeys, allTargets) then substring-matches each module name against allTargets.keys at CalculateImpactedTargetsInteractor.kt:293. In a workspace with many module extensions, this is a very loose match: "aspect_bazel_lib".contains(...) matches aspect_bazel_lib+, aspect_bazel_lib++toolchains+bats_toolchains, aspect_bazel_lib++toolchains+..., etc. For us that produces ~5,000 matching repo patterns from ~4,500 materialised external repos.
6. Each match spawns one `bazel query rdeps(//..., @@<repo>//...)` subprocess. JVM + bazel-client-connect ≈ 2s each, serial. ~5,000 × 2s = ~3h per get-impacted-targets call.

Most of those spawns log `Loading: 0 packages loaded`; the repo pattern doesn't resolve and the query fails instantly.

## Potential fixes

In rough priority order:

1. Make ModuleGraphParser.parseModuleGraph robust to non-JSON prefix/suffix. Strip or locate the JSON object within the captured string instead of failing the whole parse. This would have let 18.1.0 handle a polluted 18.0.x base graph correctly.
2. Fall back to computeSimpleImpactedTargets when parse asymmetry is detected. If `fromModuleGraphJson != toModuleGraphJson` but one side parses to empty, treating every module as "added" is far more expensive than the per-target hash diff would have been. computeSimpleImpactedTargets on a near-total invalidation finished in 8 min in our workspace.
3. Tighten the module > repo match at `CalculateImpactedTargetsInteractor.kt:292-295`. The current `allTargets.keys.filter { it.contains(moduleName) }` is O(modules x targets) with a very loose substring predicate. On workspaces with many module extensions (ours has ~4,500 canonical external repos for ~90 declared modules), a single "changed" module name can match thousands of canonical repo names. Match by canonical repo key or apparent name derived from bazel mod dump_repo_mapping instead. BazelQueryService.discoverRepoMapping already knows how to produce this.
4. Union rdeps queries. If N modules need rdeps computed, one rdeps(//..., @@a//... + @@b//... + ... + @@n//...) costs roughly the same as one query; N separate subprocesses cost N × startup overhead.
5. Add a formatVersion field to the hashes JSON metadata. Independent of the above, this would let future serialisation changes quickly fall back to `computeSimpleImpactedTargets` with a fail fast clear error or instead of silently triggering pathological code paths.

## Environment

- bazel-diff 18.0.0 → 18.1.0 (only change)
- Bazel 8.6.0, bzlmod, RBE with --remote_download_minimal
- ~95k targets in the graph, 90 declared bzlmod modules, 4,500+ canonical external repos materialised under `$output_base/external/`
- Base graphs stored in cloud object storage between runs

## Misc
- [reproduce stderr pollution](https://github.com/rdark/bazel-diff/commit/490ae2e860b819485bf173de15543872c7158457)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading 18.0.x → 18.1.0 with a reused base graph triggers multi-hour `queryTargetsDependingOnModules` fan-out #335

Summary

Root cause

Potential fixes

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Upgrading 18.0.x → 18.1.0 with a reused base graph triggers multi-hour queryTargetsDependingOnModules fan-out #335

Description

Summary

Root cause

Potential fixes

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Upgrading 18.0.x → 18.1.0 with a reused base graph triggers multi-hour `queryTargetsDependingOnModules` fan-out #335