feat: LLM review stage for enhanced reachability detection#50
Conversation
Manual verificationOff by default. Requires API key. Cost note: enabling adds approximately one Opus call per 25 units.
|
Local test resultsReinstalled openant-core from this branch and ran Commands run: Outcome (against the manual-verification checklist):
Total cost: $0.024 (reachability $0.012 + analyze 1 unit $0.012). Well under the budget. |
|
🔴 High — Architecture · Issue: The LLM stage runs after What the stage does still deliver in current form:
What it cannot do:
|
The stage was running after the structural filter had already discarded units that weren't reachable from heuristic entry points — meaning the LLM could never find the missed entry points that are the feature's main value prop. Fix: - When --llm-reachability is set, parse with processing_level="all" so every unit is visible to the LLM. - After apply_signals promotes LLM-identified entry points, re-run the structural reachability filter (apply_reachability_filter) with those unit IDs added as extra BFS seeds. - The final dataset is filtered to the user's requested processing_level but expanded by any entry points the LLM found. Also: - Expose apply_reachability_filter as a public function in parser_adapter.py with an extra_entry_points parameter; preserve any is_entry_point=True already set by the LLM stage when re-stamping. - Update help text in cli.py and scan.go to reflect that cost scales with total repo size, not the filtered unit count. - Update llm_reachability.py docstring to document the correct pipeline ordering. Addresses review comment on PR knostic#50. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Good catch — fixed in faa8948. The stage was running against the already-filtered dataset, so any unit the structural pass dropped was gone before the LLM could see it. The fix:
All 29 existing tests still pass. |
|
@ar7casper responded to the comment |
ar7casper
left a comment
There was a problem hiding this comment.
Round-2 review after the faa8948 fix — the architectural rework for the Python path is correct. One High-severity finding I missed in the prior review, plus two Mediums and a few Lows.
🟠 High — Architecture · core/scanner.py:283-294 + core/parser_adapter.py:247-252
The post-LLM re-filter call to apply_reachability_filter requires call_graph.json on disk. Only the Python parser (parse_repository.py:173) and the Zig parser (zig/test_pipeline.py:123) write this file. JS, Go, C, Ruby, and PHP parsers do their reachability filtering internally inside their per-language test_pipeline.py and never persist call_graph.json.
When a user runs openant scan <ts-repo> --llm-reachability (default --level reachable):
effective_parse_level = "all"→ JS parser parses the full unfiltered codebase ✅- LLM reviews all units, promotes some via
is_entry_point=True✅ scanner.py:289callsapply_reachability_filter(dataset, output_dir, "reachable", extra_entry_points=...)❌- Function looks for
output_dir/call_graph.json→ not found parser_adapter.py:248-252logs[Warning] call_graph.json not found — skipping reachability filterto stderr and returns the dataset unchanged- Dataset remains the FULL unfiltered codebase
- Downstream stages (enhance, analyze) process every function in the repo
Net effect: --llm-reachability silently disables --level reachable on every non-Python (and non-Zig) codebase. A typical Express scan that should have processed ~30 reachable units processes ~500 instead — 10–20× cost spike in the analyze stage. The user has no easy way to know --level was ignored.
This is the same architectural class as the round-1 issue (LLM stage interacting with the structural filter), but the round-1 fix only addresses the Python path.
Suggestion (any one of these):
- Most thorough: make all parsers write
call_graph.jsontooutput_dir(uniform), soapply_reachability_filterworks for every language. - Pragmatic: skip the post-LLM re-filter for non-Python languages and print a loud warning that
--llm-reachabilitydoesn't yet support per-language re-filtering, including the cost implication. - Minimum acceptable: in
parser_adapter.py:247-252, raise aRuntimeError(or escalate the warning to a hard failure) when called with non-emptyextra_entry_pointsandcall_graph.jsonis missing — fail loudly so the bug is visible at scan time.
🟡 Medium — Quality · core/llm_reachability.py:51-53
MODEL_PRIMARY = "claude-opus-4-20250514" (Opus 4 from May 2025) is older than the codebase convention. analyzer.py, enhancer.py, reporter.py, finding_verifier.py all use the newer claude-opus-4-6. The source comment claims "matches the convention in core/analyzer.py / utilities/llm_client.py" but analyzer.py:316 is model_id = "claude-opus-4-6" if model == "opus" else "claude-sonnet-4-20250514" — the comment is misleading.
Suggestion: align MODEL_PRIMARY to claude-opus-4-6 to match the rest of the Opus-using stages, and fix the comment.
🟡 Medium — Coupling · core/scanner.py:267
analyze_reachability(..., max_units=limit) couples the LLM reachability stage's coverage to the analyze-stage --limit flag. A user running openant scan --limit 50 --llm-reachability to constrain analyze cost will silently limit LLM reachability to the first 50 units. If the entry point that would have promoted units 1–50 sits at unit 51, those units never enter the reachable set.
Suggestion: decouple — either a separate --llm-reachability-max-units flag, or don't apply limit to LLM reachability (let it always review the full dataset; cost is controlled by batch_size).
🟢 Lows (non-blocking, listed for completeness)
- Unused import —
core/llm_reachability.py:47:fieldimported but never used. PR 56's ruff rules don't catch this (F401 isn't enabled). Drop it. - Always-null prompt field —
core/llm_reachability.py:174:unit.get("reachable_from_entry")always returnsNonebecause nothing in the project writes that field. The reachability filter writesreachable: bool. Either remove from the prompt projection or rename toreachable. - Aggressive code trimming —
core/llm_reachability.py:61:MAX_CODE_BYTES = 1500(~30-50 lines) often misses entry-point indicators in handlers with embedded business logic. Consider 4-8 KB or env-var configurable. - Style inconsistency —
core/scanner.py:248,259,272,299: newwith open(..., encoding="utf-8") as f:calls work but don't follow PR 56's new convention ofread_json/write_jsonfromutilities.file_io. - Observability gap —
core/scanner.py:245:step_context inputs={"model": "opus"}is hardcoded string, not actual model ID. LLM-promoted units have emptyentry_point_reason(the structural detector only sets it for its own detections) — hard to tell from a unit's metadata that the LLM promoted it without cross-referencingllm_reachability_signals.
What's good
- Promote-only semantics with regression test against demotion is exactly right.
- Failure isolation per batch in
analyze_reachability(try/except per batch) is defensive in the right place. - 5 commits with self-iteration including 3 fixes from prior review feedback. Engineering hygiene.
- The architecture refactor commit (
3d83787e) self-caught a realapp_context_path = Noneordering bug. - Test coverage is excellent — 29 tests across 6 classes including the never-demote semantic and the non-positive batch-size guard.
- Help text honestly mentions cost scales with total repo size, not the filtered count.
Verdict: the High finding is the actionable ask — once that's resolved (or scoped to Python+Zig only with a loud warning), the rest are nits / follow-ups.
|
Addressed the
All five parsers that were missing
Also added
The
|
|
Addressed all findings from this review. Summary by severity: 🔴 High — Went with the most thorough fix: all five parsers now write
Added The 🟡 Medium — Fixed: 🟡 Medium — Fixed: removed 🟢 Lows
(Low 3 — |
Adds an opt-in LLM review stage (off by default, enabled via the new `--llm-reachability` flag on `openant scan`) that uses a strong model (Opus by default) to surface additional reachability signals beyond what the structural pass catches: - Likely entry points the structural analysis may miss (framework hooks, plugin/CLI registrations, message handlers). - External-input sites (HTTP request bodies, file/network reads, env/argv, stdin, untrusted IPC). - Cross-process / async data-flow indicators. Signals are advisory and *promote-only*: high-confidence entry-point signals can set `is_entry_point=True` on a unit, but no signal ever demotes a unit that the structural analysis already kept. This matches the "complements, does not replace" intent in issue #17. Output: - `llm_reachability.json` written to the scan dir with the full signal list. - Each unit gains an `llm_reachability_signals` array on the dataset. Cost & rate-limit safety: opt-in only, prompts are batched, and the client integration goes through the existing `AnthropicClient` (which respects `GlobalRateLimiter`). Refs #17.
The Python CLI defines --llm-reachability for the LLM reachability stage (issue #17), but the Go CLI proxy did not expose it. The test TestHelp::test_scan_help_advertises_llm_reachability inspects 'openant scan --help' (Go cobra output) and was failing on all 3 OS targets. Register --llm-reachability as a Bool flag on the Go scan command and pass it through to the Python invocation when set.
- scanner.py: forward-declare app_context_path before step 1.5 so the LLM reachability block doesn't hit a NameError when --llm-reachability is enabled (the block ran before the app-context step that defined it). - llm_reachability._chunk: non-positive batch_size used to reference an unbound loop variable; now collapses to a single batch covering all items. Adds a regression test. - Help text (Python CLI + Go CLI): note that --llm-reachability may incur additional LLM cost, per cost-safety review.
The LLM reachability stage threads app_context into its prompt to help the model reason about expected entry points (web_app vs cli_tool, etc). The previous ordering ran it before app-context generation, so the app_context_path was always None at the call site — the prompt threading silently no-op'd. Reordering the steps makes the threading actually work. This also retires the temporary forward-declaration introduced in the previous commit; app_context_path is now defined naturally by the preceding step before the LLM reachability block reads it.
The stage was running after the structural filter had already discarded units that weren't reachable from heuristic entry points — meaning the LLM could never find the missed entry points that are the feature's main value prop. Fix: - When --llm-reachability is set, parse with processing_level="all" so every unit is visible to the LLM. - After apply_signals promotes LLM-identified entry points, re-run the structural reachability filter (apply_reachability_filter) with those unit IDs added as extra BFS seeds. - The final dataset is filtered to the user's requested processing_level but expanded by any entry points the LLM found. Also: - Expose apply_reachability_filter as a public function in parser_adapter.py with an extra_entry_points parameter; preserve any is_entry_point=True already set by the LLM stage when re-stamping. - Update help text in cli.py and scan.go to reflect that cost scales with total repo size, not the filtered unit count. - Update llm_reachability.py docstring to document the correct pipeline ordering. Addresses review comment on PR knostic#50. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
High — JS/Go/C/Ruby/PHP don't persist call_graph.json so the post-LLM re-filter was silently falling back to returning the full unfiltered dataset. Guard the re-filter call with an os.path.exists check on call_graph.json; when absent, print a prominent WARNING with the unit count and cost implication so the user knows --level was not applied. Also add refilter_supported to the step report summary. Medium 1 — align MODEL_PRIMARY from "claude-opus-4-20250514" to "claude-opus-4-6" to match analyzer.py and the rest of the Opus-using stages. Fix the misleading comment. Medium 2 — remove max_units=limit from the analyze_reachability call. --limit governs the analyze stage; the LLM reachability pass must review the full codebase to find missed entry points. Coverage is controlled by batch_size. Low 1 — remove unused `field` import from llm_reachability.py. Low 2 — fix prompt projection: reachable_from_entry (never written by any path) → reachable (the actual field name written by the reachability filter). Low 5 — set entry_point_reason on LLM-promoted units in apply_signals so the promotion source is visible without cross-referencing llm_reachability_signals. Also surface the actual MODEL_PRIMARY constant in the step_context inputs instead of the hardcoded string "opus". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Enables the post-LLM reachability re-filter to work on JS, Go, C, Ruby,
and PHP repositories, not just Python and Zig.
C / Ruby / PHP (trivial — 3 lines each):
call_graph.json is written immediately after CallGraphBuilder.export()
in test_pipeline.py. graph_result already contains functions,
call_graph, and reverse_call_graph in the format apply_reachability_filter
expects.
JavaScript (easy):
call_graph.json is written at the end of run_typescript_analyzer(), right
after analyzer_output.json is available. Keys are normalised from
camelCase (callGraph, reverseCallGraph) to snake_case so the Python-side
filter can read them without any extra handling.
Go (moderate):
call_graph.json is written at the end of run_go_parser_all(), after both
analyzer_output.json and dataset.json are available. Functions are
normalised to the camelCase shape EntryPointDetector expects (same
conversion already done in apply_reachability_filter). Call graph edges
are reconstructed from unit metadata.direct_calls / direct_callers (the
same source apply_reachability_filter used).
Tests (test_call_graph_output.py):
- TestApplyReachabilityFilterPublicAPI (5 tests, always run): verifies
the public apply_reachability_filter API — basic filtering, extra
entry points expand the reachable set, is_entry_point stamping,
LLM-promoted flag preservation, missing-file graceful return.
- TestPythonCallGraphOutput (2 tests, always run): parse_repository with
processing_level=all and reachable both produce call_graph.json.
- TestJavaScript/Go/C/Ruby/PHP CallGraphOutput (2 tests each, skip-guarded):
same assertions, skip when the parser's runtime dependency is absent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Go toolchain setup and per-platform go_parser build steps to the python-tests CI job so call_graph.json tests for Go and JS don't silently skip on all platforms. Also harden _go_parser_available() to try-execute the binary (catching WinError 193 for cross-platform Linux ELF) and check both go_parser and go_parser.exe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
c515fa4 to
cf005c6
Compare
|
@ar7casper ready for any further review you want to do |
ar7casper
left a comment
There was a problem hiding this comment.
Round-3 review — thanks for the substantive response to round-2.5. You nailed the High + both Mediums + 3 of 5 Lows.
Verified each fix:
| Round-2.5 | Status |
|---|---|
| 🟠 High — multi-language correctness gap | ✅ Fixed via the most thorough option. All 5 missing parsers (C/Ruby/PHP/JS/Go) now write call_graph.json (2b0dddf). PLUS defensive os.path.exists guard in scanner.py with a loud cost-impact WARNING if file is ever missing (3df8f0b). 422-line test_call_graph_output.py with 17 cross-parser tests locks in the contract. Belt-and-suspenders. |
| 🟡 Medium — model version | ✅ MODEL_PRIMARY = "claude-opus-4-6", comment fixed |
🟡 Medium — --limit coupling |
✅ Removed max_units=limit. Comment explains why. |
🟢 Low — unused field import |
✅ Removed |
🟢 Low — always-null reachable_from_entry field |
✅ Prompt projection now uses reachable (actually written by filter) |
| 🟢 Low — observability | ✅ step_context uses MODEL_PRIMARY constant; LLM-promoted units get entry_point_reason: "llm_reachability: ..." |
🟢 Low — MAX_CODE_BYTES = 1500 |
⏭️ Deferred (carry-over) |
🟢 Low — bare with open(...) style |
⏭️ Deferred (carry-over) |
Plus the CI tweak (cf005c6) to ensure go_parser binary is built in the python-tests job — necessary for the new cross-parser tests to actually run. All 9 CI checks green.
The commit messages on 3df8f0b and 2b0dddf are exemplary — itemize every finding addressed and the per-parser approach. Easy to verify against round-2.5.
Three small Lows for awareness (none blocking):
🟢 Low — Import-before-sys-path-fix in C/Ruby/PHP test_pipeline.py
2b0dddf added from utilities.file_io import ... to parsers/{c,ruby,php}/test_pipeline.py to support the new call_graph.json writes. But the order is wrong — the import comes before the sys.path setup:
# parsers/php/test_pipeline.py (current)
from utilities.file_io import open_utf8, read_json, run_utf8, write_json # line 45 — import FIRST
# ...
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))) # line 48 — sys.path LATERCompare to JS/Go (which got the correct order in #56):
# parsers/javascript/test_pipeline.py
_core_root = _parser_dir.parent.parent
if str(_core_root) not in sys.path:
sys.path.insert(0, str(_core_root)) # sys.path FIRST
# ...
from utilities.file_io import open_utf8, read_json, run_utf8, write_json # import AFTERCI passes because the python-tests workflow does pip install -e . which makes utilities globally importable from the venv (subprocess inherits). Production also works for the same reason. But local dev without pip install -e . hits ModuleNotFoundError: No module named 'utilities.file_io' when the parser subprocess is spawned — I confirmed this locally; 6 cross-parser tests fail without PYTHONPATH=. set; pass with it.
Suggestion: swap the order in C/Ruby/PHP test_pipeline.py to match the JS/Go pattern. Trivial 3-line change per file. Low impact (works in CI / production already), but consistency with #56's JS/Go pattern would close the local-dev gap.
🟢 Low — Cross-parser test contract doesn't verify unit_type field shape
_assert_call_graph_valid in test_call_graph_output.py:88-100 only checks that call_graph.json exists with the 3 required top-level keys. It doesn't verify the functions value shape — JS and Go's unitType (camelCase) versus Python/C/Ruby/PHP's unit_type (snake_case) inconsistency persists and isn't caught.
This is the documented pre-existing parser-issues #23-26 / Room-for-Improvement #10 issue. PR 50 didn't introduce it, but the new cross-parser test contract was a natural place to lock down the schema. Doesn't block this PR — for the LLM reachability functionality, LLM-promoted IDs flow through correctly via extra_entry_points. The structural-detection gap means JS/Go scans get fewer Check-1 (unit_type) entry points than they should, same as before this PR.
Suggestion (post-merge follow-up): add a contract assertion that unit_type exists in each function entry, surfacing the camelCase/snake_case drift if it ever causes a test to fail. Or normalize at the test level.
🟢 Low — PR body stale
Body still describes the original scope. Doesn't mention:
- The cross-parser
call_graph.jsonexpansion (5 more parsers now write it) - The defensive WARNING for unsupported languages
- That parse runs with
processing_level="all"when--llm-reachabilityis set
Suggestion: add a "Round-2.5 fix summary" section or rewrite the Output bullet list to include the new behavior.
Verdict: ✅ ready to merge. The 3 Lows are housekeeping. Approving via UI.
Covers the seven PRs in this release: - #35: parse --level default → reachable (CLI consistency with scan + Python CLI) - #36: auto-detect dep changes via ~/.openant/venv/.deps-hash - #37: lazy JS parser npm bootstrap on first use - #39: TypeScript/NestJS DI-aware call resolution (constructor + field + functional inject()) - #40: --language auto opt-in for openant init + non-git path support + shared config/languages.json - #49: Express anonymous route handler extraction (route_handler / route_middleware) - #50: --llm-reachability opt-in stage + cross-parser call_graph.json contract Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds a new LLM review stage (off by default, enabled via
--llm-reachabilityonopenant scan) that uses a strong model (Opus by default) to surface additional reachability signals beyond what the structural analysis catches:Signals are advisory and only promote a unit's reachability — they never demote one that the structural analysis already kept. This matches the issue's "complements, not replaces" intent.
Output:
llm_reachability.jsonin the scan dir with the full signal list.llm_reachability_signals: [...]field on each unit indataset.json.entry_pointsignals setis_entry_point: trueon the target unit.Cost & rate-limit safety: reuses the existing
GlobalRateLimiterviaAnthropicClient, opt-in only, and prompts are batched (default 25 units/call).Addresses #17 (does not close — let the maintainer review the prompt + heuristics first).
Test plan
analyze_reachabilitywith mocked LLM: fixed JSON, malformed JSON, exception in client, app_context threading, batch chunking.apply_signals: promote-only semantics (high-confidence promotes, medium does not, never demotes), per-unit signal accumulation, unknown-id rejection.--llm-reachabilityappears inopenant scan --help; default does not pass the flag through toscan_repository; setting it threadsllm_reachability=True.