fix(generation): stabilize prompt hashes across re-runs by dmikushin · Pull Request #62 · repowise-dev/repowise

dmikushin · 2026-04-10T12:28:08Z

Problem

Every repowise init re-generates all wiki pages from scratch, even when the codebase hasn't changed. The root cause is non-deterministic source_hash values: the SHA-256 is computed over the rendered Jinja2 prompt, and two context variables were unstable across runs.

Source of non-determinism 1: graph edge ordering

Files are parsed in parallel via ProcessPoolExecutor + as_completed, so the order in which nodes and edges are inserted into the NetworkX graph is non-deterministic. graph.predecessors() / graph.successors() return nodes in insertion order, so dependents and dependencies lists in FilePageContext shuffled between runs → different rendered prompt → different source_hash.

Fix: sort predecessors/successors before building FilePageContext in ContextAssembler.assemble_file_page.

Source of non-determinism 2: Louvain community IDs

nx.community.louvain_communities already receives seed=42, but the adjacency traversal order inside Louvain still depends on node insertion order (same root cause). Additionally, the community list returned by louvain_communities has no guaranteed order, so enumerate() assigned different integer IDs to the same community across runs.

Fix: before calling louvain_communities, rebuild a sorted copy of the undirected graph (g_stable) with nodes and edges added in alphabetical order. After the call, sort the returned community list by each community's lexicographically smallest member before enumerate().

Impact

These two fixes make source_hash stable across re-runs for unchanged files, enabling the DB content cache (_db_content_cache keyed by source_hash) to skip redundant LLM calls and save API costs.

Testing

Added scripts/diagnose_hash_mismatch.py — a diagnostic script that:

Calls betweenness_centrality() and community_detection() twice and reports any differing values
For each cached file_page in wiki.db, renders the prompt fresh and compares SHA-256 with the stored source_hash
Reports whether a mismatch is caused by dep_summaries or another factor, with a unified diff

Run from the target repo directory:

python3 scripts/diagnose_hash_mismatch.py /path/to/repo --max-pages 20

Checklist

context_assembler.py: sort predecessors/successors
graph.py: sorted graph copy + sorted community list before enumerate
scripts/diagnose_hash_mismatch.py: diagnostic tool

Graph edge ordering and community IDs were non-deterministic because files are parsed in parallel (ProcessPoolExecutor + as_completed), causing NetworkX node insertion order to vary between runs. Changes: - context_assembler: sort predecessors/successors before including them in FilePageContext so dependents/dependencies lists are identical across runs regardless of graph construction order - graph: rebuild a sorted copy of the undirected graph before passing it to louvain_communities so adjacency traversal order is reproducible; also sort the returned community list by each community's smallest member before assigning integer IDs via enumerate() Adds scripts/diagnose_hash_mismatch.py to verify the fix and identify any remaining sources of hash instability (dep_summaries, betweenness sampling, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

RaghavChamadiya · 2026-04-11T06:24:11Z

Nice analysis and clean fix. The sorted predecessors/successors and stabilized Louvain ordering both make sense, and the PR description is really well written.

A few things before I merge:

Betweenness centrality: for large repos (above the threshold), betweenness_centrality uses k=500 random samples without a seed. If betweenness values feed into the rendered prompt, that's a third source of non-determinism you haven't addressed. Can you check whether that's the case? If so, adding seed=42 there too would complete the fix.
Unit tests: the core changes (sorted edges, sorted communities) don't have tests that verify stability across different insertion orders. Something like building the same graph in two different orders and asserting identical community IDs would really lock this in and catch regressions in CI.
Diagnostic script: scripts/diagnose_hash_mismatch.py imports private internals like _run_ingestion which will break on refactors. Fine to keep it as an unsupported debugging tool in scripts/, but a proper unit test would be more valuable long term.

Happy to merge once (1) and (2) are addressed.

swati510

Good fix, the non-determinism was real. A couple of things worth looking at:

In graph.py, you rebuild the graph from sorted nodes/edges before Louvain. That works for stability but g_stable = nx.Graph() drops any node/edge attributes that were on the original graph. If downstream code reads attrs off these nodes after get_communities returns, it'll silently lose them. Safer to use nx.relabel_nodes on a copy, or sort in-place via a canonical representation. If no attrs matter here, a comment noting that would help future readers.
scripts/diagnose_hash_mismatch.py has hardcoded paths to ~/forge/free-code and ~/forge/repowise in the docstring usage example. Either parameterize via argparse (take --repo-root) or drop the script once the fix is verified. 226 lines is a lot of one-off debug tooling to keep in-tree.

swati510 · 2026-04-18T16:03:23Z

+            # via ProcessPoolExecutor + as_completed → non-deterministic insertion
+            # order in the main graph).
+            g_und = g.to_undirected()
+            g_stable = nx.Graph()


Confirm this is safe: nx.Graph() drops node/edge attributes present on g_und. If any downstream consumer reads attrs off these nodes after get_communities, this change silently loses them. If no attrs matter, worth a one-line comment saying so.

dmikushin requested review from RaghavChamadiya and swati510 as code owners April 10, 2026 12:28

dmikushin force-pushed the fix/stabilize-prompt-hashes branch from da425be to 6fdf4ee Compare April 10, 2026 12:32

swati510 reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(generation): stabilize prompt hashes across re-runs#62

fix(generation): stabilize prompt hashes across re-runs#62
dmikushin wants to merge 1 commit intorepowise-dev:mainfrom
dmikushin:fix/stabilize-prompt-hashes

dmikushin commented Apr 10, 2026

Uh oh!

RaghavChamadiya commented Apr 11, 2026

Uh oh!

swati510 left a comment

Uh oh!

swati510 Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dmikushin commented Apr 10, 2026

Problem

Source of non-determinism 1: graph edge ordering

Source of non-determinism 2: Louvain community IDs

Impact

Testing

Checklist

Uh oh!

RaghavChamadiya commented Apr 11, 2026

Uh oh!

swati510 left a comment

Choose a reason for hiding this comment

Uh oh!

swati510 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants