fix(generation): stabilize prompt hashes across re-runs#62
fix(generation): stabilize prompt hashes across re-runs#62dmikushin wants to merge 1 commit intorepowise-dev:mainfrom
Conversation
Graph edge ordering and community IDs were non-deterministic because files are parsed in parallel (ProcessPoolExecutor + as_completed), causing NetworkX node insertion order to vary between runs. Changes: - context_assembler: sort predecessors/successors before including them in FilePageContext so dependents/dependencies lists are identical across runs regardless of graph construction order - graph: rebuild a sorted copy of the undirected graph before passing it to louvain_communities so adjacency traversal order is reproducible; also sort the returned community list by each community's smallest member before assigning integer IDs via enumerate() Adds scripts/diagnose_hash_mismatch.py to verify the fix and identify any remaining sources of hash instability (dep_summaries, betweenness sampling, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
da425be to
6fdf4ee
Compare
|
Nice analysis and clean fix. The sorted predecessors/successors and stabilized Louvain ordering both make sense, and the PR description is really well written. A few things before I merge:
Happy to merge once (1) and (2) are addressed. |
swati510
left a comment
There was a problem hiding this comment.
Good fix, the non-determinism was real. A couple of things worth looking at:
-
In graph.py, you rebuild the graph from sorted nodes/edges before Louvain. That works for stability but g_stable = nx.Graph() drops any node/edge attributes that were on the original graph. If downstream code reads attrs off these nodes after get_communities returns, it'll silently lose them. Safer to use nx.relabel_nodes on a copy, or sort in-place via a canonical representation. If no attrs matter here, a comment noting that would help future readers.
-
scripts/diagnose_hash_mismatch.py has hardcoded paths to ~/forge/free-code and ~/forge/repowise in the docstring usage example. Either parameterize via argparse (take --repo-root) or drop the script once the fix is verified. 226 lines is a lot of one-off debug tooling to keep in-tree.
| # via ProcessPoolExecutor + as_completed → non-deterministic insertion | ||
| # order in the main graph). | ||
| g_und = g.to_undirected() | ||
| g_stable = nx.Graph() |
There was a problem hiding this comment.
Confirm this is safe: nx.Graph() drops node/edge attributes present on g_und. If any downstream consumer reads attrs off these nodes after get_communities, this change silently loses them. If no attrs matter, worth a one-line comment saying so.
Problem
Every
repowise initre-generates all wiki pages from scratch, even when the codebase hasn't changed. The root cause is non-deterministicsource_hashvalues: the SHA-256 is computed over the rendered Jinja2 prompt, and two context variables were unstable across runs.Source of non-determinism 1: graph edge ordering
Files are parsed in parallel via
ProcessPoolExecutor+as_completed, so the order in which nodes and edges are inserted into the NetworkX graph is non-deterministic.graph.predecessors()/graph.successors()return nodes in insertion order, sodependentsanddependencieslists inFilePageContextshuffled between runs → different rendered prompt → differentsource_hash.Fix: sort
predecessors/successorsbefore buildingFilePageContextinContextAssembler.assemble_file_page.Source of non-determinism 2: Louvain community IDs
nx.community.louvain_communitiesalready receivesseed=42, but the adjacency traversal order inside Louvain still depends on node insertion order (same root cause). Additionally, the community list returned bylouvain_communitieshas no guaranteed order, soenumerate()assigned different integer IDs to the same community across runs.Fix: before calling
louvain_communities, rebuild a sorted copy of the undirected graph (g_stable) with nodes and edges added in alphabetical order. After the call, sort the returned community list by each community's lexicographically smallest member beforeenumerate().Impact
These two fixes make
source_hashstable across re-runs for unchanged files, enabling the DB content cache (_db_content_cachekeyed bysource_hash) to skip redundant LLM calls and save API costs.Testing
Added
scripts/diagnose_hash_mismatch.py— a diagnostic script that:betweenness_centrality()andcommunity_detection()twice and reports any differing valuesfile_pageinwiki.db, renders the prompt fresh and compares SHA-256 with the storedsource_hashdep_summariesor another factor, with a unified diffRun from the target repo directory:
Checklist
context_assembler.py: sort predecessors/successorsgraph.py: sorted graph copy + sorted community list before enumeratescripts/diagnose_hash_mismatch.py: diagnostic tool