Skip to content

feat(dataflow): interprocedural variable-level model — P0-P6 complete, all 34 languages#1612

Merged
carlos-alm merged 4 commits into
mainfrom
feat/dataflow-vertex-schema-p0
Jun 19, 2026
Merged

feat(dataflow): interprocedural variable-level model — P0-P6 complete, all 34 languages#1612
carlos-alm merged 4 commits into
mainfrom
feat/dataflow-vertex-schema-p0

Conversation

@carlos-alm

@carlos-alm carlos-alm commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements the complete interprocedural dataflow vertex model across all 34 supported languages.

Phases

  • P0: Schema — dataflow_vertices, dataflow_summary, new edge kinds (def_use, arg_in, return_out)
  • P1: Vertex extraction — param/return/local vertices + intra def_use edges per file
  • P2: Interprocedural stitching — arg_in (caller → callee param) and return_out (callee return → caller capture) edges; dataflow_summary per param (Closes feat(dataflow): P4 — incremental re-stitch when callee file changes #1609)
  • P3: C/C++ parser tests (29 tests)
  • P4: Incremental re-stitch — rebuilt arg_in edges when callee file changes without full rebuild
  • P5 B1: C-family dataflow rules (C, C++, ObjC, CUDA)
  • P5 B2–B5: DATAFLOW_RULES for remaining 18 languages — Kotlin, Swift, Scala, Dart, Groovy, Lua, R, Julia, Bash, Haskell, OCaml, F#, Gleam, Elixir, Erlang, Clojure, Zig, Solidity (Closes feat(dataflow): P5 — DATAFLOW_RULES for 26 new languages (B1–B5 batches) #1610)
  • P6: Vertex extraction on native orchestrator path — re-runs Rust dataflow visitor post-orchestrator so dataflow_vertices is populated regardless of engine; v19 migration sentinel forces one-time backfill of existing DBs
  • P6 parity: Added generator_function_declaration / generator_function to JS + Rust functionNodes to fix 3 df-vertex divergences in jelly-micro fixture

Also includes:

Parity

scripts/parity-compare.mjs --dataflow36/36 fixtures OK, both engines identical including df-vertices.

Performance

Full build regression (~3s → ~11s on 707 files) is expected from:

  • P6 runDataflowVertexPass re-runs Rust visitor per file (~2-3s)
  • ~10,737 interprocedural edge insertions
  • ~10,000+ vertex row insertions

1-file rebuild: 108ms → 207ms (P4 incremental re-stitch reads caller files from disk).
No-op rebuild: unchanged (22ms).

Optimization opportunity (filed): have Rust pipeline return DataflowResult per file so P6 doesn't need a second visitor pass.

Test plan

  • npx vitest run tests/integration/dataflow*.test.ts — 40 passing
  • node scripts/parity-compare.mjs --dataflow — 36/36 OK
  • node scripts/parity-compare.mjs — 36/36 OK (no node/edge regressions)
  • npm run lint — clean
  • npx tsc --noEmit — clean
  • bench-check initial baseline saved

Open follow-ups

@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR completes the interprocedural variable-level dataflow model (P0–P5) across 20+ languages, adding schema vertices/edges, per-file param/return extraction, interprocedural stitching, incremental re-stitch (P4), the C-family rules, and rules for 18 additional languages spanning JVM, scripting, functional, and systems families. The push/pop asymmetry fix for Elixir/Clojure scopes (using a parallel pushRecord boolean stack) and the SQLite chunking for SQLITE_MAX_VARIABLE_NUMBER are both well-executed.

  • P4 incremental re-stitch (dataflow.ts): adds collectCallerStitchCandidates with 500-ID chunked IN queries and a full-build short-circuit guard, correctly re-stitching arg_in edges for unchanged callers when only a callee file changes.
  • Language rules (b2–b5): 18 new configs registered in DATAFLOW_RULES; most are well-documented with graceful fallbacks for non-standard tree-sitter fields. Two configs in b4.ts have a structural conflict: Elixir (callNode: 'call') and Clojure (callNode: 'list_lit') use the same node type for both functionNodes and callNode, which causes dispatchDataflowNode to return early before ever reaching call-processing logic, silently disabling all argument-flow tracking for those languages.
  • Erlang and OCaml both include a parent and its direct child node type in functionNodes (fun_decl/function_clause and value_definition/let_binding respectively), causing duplicate parameter registration per function.

Confidence Score: 3/5

Safe to merge for the languages not affected by the call-tracking bug, but Elixir and Clojure will silently emit zero argument-flow data until the callNode conflict is resolved.

The Elixir and Clojure configs set callNode to the same type already listed in functionNodes. The walker's dispatcher short-circuits on all such nodes before the call-handling branch is reached, so handleCallExpr is never invoked for either language — a silent correctness gap with no error signal. The Erlang and OCaml double-scope issue is a second independent concern that could produce duplicate or contradictory param vertices for those languages.

src/ast-analysis/rules/b4.ts — the Elixir, Clojure, Erlang, and OCaml config blocks need the most attention before those languages produce correct interprocedural dataflow output.

Important Files Changed

Filename Overview
src/ast-analysis/rules/b4.ts New language rules for Haskell, OCaml, F#, Gleam, Elixir, Erlang, Clojure. Two concrete bugs: Elixir and Clojure set callNode to the same type as functionNodes, making call tracking completely dead. Erlang and OCaml both include parent and child node types in functionNodes, causing double-scope and duplicate param registration.
src/ast-analysis/rules/b2.ts New language rules for Kotlin, Swift, Scala, Dart, Groovy. Well-commented with appropriate fallbacks. No issues found.
src/ast-analysis/rules/b3.ts New language rules for Lua, R, Julia, Bash. No critical issues found.
src/ast-analysis/rules/b5.ts New language rules for Zig and Solidity. Well-configured with confirmed field names.
src/ast-analysis/rules/index.ts Registers 20 new language configs correctly in DATAFLOW_RULES.
src/ast-analysis/visitors/dataflow-visitor.ts Fixes push/pop asymmetry via parallel pushRecord boolean stack. Correct and symmetric.
src/features/dataflow.ts Adds P4 incremental re-stitch logic with chunked SQLite queries and full-build guard. Well-structured.
tests/integration/dataflow-incremental.test.ts New integration test for P4 re-stitch covering param vertex rebuild and arg_in edge creation.

Fix All in Claude Code

Reviews (6): Last reviewed commit: "fix(dataflow): guard exitFunction pop ag..." | Re-trigger Greptile

Comment thread src/features/dataflow.ts Outdated
Comment on lines +628 to +637
const placeholders = changedFuncIds.map(() => '?').join(',');
const callerFileRows = db
.prepare(
`SELECT DISTINCT n.file AS caller_file
FROM dataflow d
JOIN nodes n ON n.id = d.source_id
WHERE d.target_id IN (${placeholders})
AND d.kind = 'flows_to'`,
)
.all(...changedFuncIds) as { caller_file: string }[];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 SQLite variable-limit risk in dynamic IN clause

changedFuncIds is spread into .all(...changedFuncIds), which passes each element as a separate positional bind parameter. SQLite enforces SQLITE_MAX_VARIABLE_NUMBER at statement-prepare time; the historical default is 999 and the modern ceiling is 32 766 (SQLite ≥ 3.32). On a large incremental build — e.g. a batch commit touching many files where the combined function/method count in fileSymbols exceeds that limit — db.prepare(...) or .all(...) throws SQLITE_ERROR: too many SQL variables, propagating as an uncaught exception that aborts the entire build.

Consider batching the query in chunks (e.g. 500 IDs at a time) and collecting the distinct caller files across all batches, or using a temporary table to hold the IDs.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the existing commit via chunked queries. The clause is now issued in batches of 500 IDs, collecting distinct caller files across all batches. This keeps the variable count well within SQLite's SQLITE_MAX_VARIABLE_NUMBER on all SQLite versions (999 legacy, 32766 on 3.32+).

Comment thread src/features/dataflow.ts Outdated
Comment on lines +873 to +887
if (vstmts.available) {
const changedRelPaths = new Set<string>(fileSymbols.keys());
const changedFuncIds = collectFuncIdsForFiles(db, changedRelPaths);
const extra = await collectCallerStitchCandidates(
db,
changedFuncIds,
changedRelPaths,
rootDir,
extToLang,
parsers,
getParserFn,
);
allCandidates.push(...extra.candidates);
allCaptures.push(...extra.captures);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 P4 runs unconditionally after every JS-path call, even full builds

vstmts.available only guards for the absence of the dataflow_vertices table — it does not distinguish full builds from incremental ones. For a full build, changedRelPaths equals all files in fileSymbols, so collectCallerStitchCandidates correctly returns empty after the filter. However, collectFuncIdsForFiles still issues one SELECT per changed file before discovering there is nothing to do. On a large initial build with thousands of files this adds noticeable overhead before the early-return in collectCallerStitchCandidates. A lightweight guard — e.g. skipping P4 entirely when fileSymbols.size equals the total file count in the DB — would avoid the extra queries on full builds.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a full-build guard before the P4 block: a single COUNT(DISTINCT file) query checks whether fileSymbols covers all DB files, and if so P4 is skipped entirely (no per-file SELECTs). For an incremental build the guard costs one fast count query; for a full build it avoids N per-file SELECTs with no result.

Comment thread src/features/dataflow.ts
Comment on lines +660 to +671
for (const callerFile of callerFiles) {
// Read the caller file from disk without touching its existing DB rows.
const stub: FileSymbolsDataflow = { definitions: [], _langId: null, _tree: null };
const data = getDataflowForFile(
stub,
callerFile,
rootDir,
extToLang,
activeParsers,
activeGetParserFn,
);
if (!data) continue;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 definitions: [] stub silently skips callee-context-aware resolution

getDataflowForFile passes stub.definitions (always []) to extractDataflow. Today _definitions is unused, so this is harmless. But the coupling is implicit: if a future refactor makes extractDataflow use definitions to restrict which function bodies are walked (a natural optimisation), P4 will silently emit no stitch candidates for caller files whose definitions were not pre-loaded. A small comment here noting the intentional stub would make this safe to refactor.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment explaining the intentional stub: extractDataflow does not currently use _definitions, so passing [] is safe. The comment notes exactly what must change if a future refactor makes extractDataflow use definitions to filter which function bodies are walked.

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Codegraph Impact Analysis

29 functions changed15 callers affected across 5 files

  • getKotlinParamListNode in src/ast-analysis/rules/b2.ts:14 (0 transitive callers)
  • extractKotlinParamName in src/ast-analysis/rules/b2.ts:21 (0 transitive callers)
  • getSwiftParamListNode in src/ast-analysis/rules/b2.ts:59 (0 transitive callers)
  • extractSwiftParamName in src/ast-analysis/rules/b2.ts:69 (0 transitive callers)
  • getScalaParamListNode in src/ast-analysis/rules/b2.ts:109 (0 transitive callers)
  • extractScalaParamName in src/ast-analysis/rules/b2.ts:118 (0 transitive callers)
  • extractDartParamName in src/ast-analysis/rules/b2.ts:159 (0 transitive callers)
  • getGroovyParamListNode in src/ast-analysis/rules/b2.ts:209 (0 transitive callers)
  • extractGroovyParamName in src/ast-analysis/rules/b2.ts:220 (0 transitive callers)
  • extractHaskellParamName in src/ast-analysis/rules/b4.ts:12 (0 transitive callers)
  • getHaskellParamListNode in src/ast-analysis/rules/b4.ts:28 (0 transitive callers)
  • extractOCamlParamName in src/ast-analysis/rules/b4.ts:62 (0 transitive callers)
  • getOCamlParamListNode in src/ast-analysis/rules/b4.ts:74 (0 transitive callers)
  • extractFSharpParamName in src/ast-analysis/rules/b4.ts:115 (0 transitive callers)
  • getFSharpParamListNode in src/ast-analysis/rules/b4.ts:126 (0 transitive callers)
  • extractGleamParamName in src/ast-analysis/rules/b4.ts:161 (0 transitive callers)
  • extractElixirFunctionName in src/ast-analysis/rules/b4.ts:205 (0 transitive callers)
  • extractErlangParamName in src/ast-analysis/rules/b4.ts:259 (0 transitive callers)
  • getErlangParamListNode in src/ast-analysis/rules/b4.ts:266 (0 transitive callers)
  • extractClojureFunctionName in src/ast-analysis/rules/b4.ts:317 (0 transitive callers)

@carlos-alm carlos-alm changed the title feat(dataflow): P4 — incremental re-stitch when callee file changes feat(dataflow): interprocedural variable-level model — P0-P5 complete Jun 19, 2026
During an incremental build only the changed files are in fileSymbols.
When a callee function's file changes its param vertices are purged and
rebuilt, but callers' files are not re-parsed — leaving their arg_in
edges unrecreated (target vertex was deleted by the purge).

Fix: after the per-file tx() loop, collectCallerStitchCandidates()
queries flows_to edges that target any function in the changed file set
and finds the distinct caller files NOT already in fileSymbols. Each
caller file is read from disk and parsed to extract VisitorArgFlow data.
StitchCandidates are built from those flows (filtered to only those
calling a changed callee) and merged into allCandidates before the
existing buildInterproceduralStitch post-pass runs.

This preserves the full vertex-level source/target_vertex pointers on
arg_in edges: the caller's existing param/local vertex (untouched by the
purge) is the source, and the callee's freshly inserted param vertex is
the target. No caller DB rows are written — only reads.

Closes #1609.

docs check acknowledged
Adds DATAFLOW_RULES entries (function scope, param extraction, return/var/call
tracking) for all remaining languages:

- lc(B)2 (JVM/mobile): Kotlin, Swift, Scala, Dart, Groovy
- lc(B)3 (scripting): Lua, R, Julia, Bash
- lc(B)4 (functional): Haskell, OCaml, F#, Gleam, Elixir, Erlang, Clojure
- lc(B)5 (systems/blockchain): Zig, Solidity

Also fixes enterFunctionScope to early-return when nameExtractor rejects a node
(prevents Elixir/Clojure call/list_lit functionNodes from pushing spurious
scopes for non-def expressions).

Also fixes collectCallerStitchCandidates to chunk IN queries at 500 to avoid
SQLITE_MAX_VARIABLE_NUMBER on large repos (Fixes #1613). Skips lc(P)4 on full builds
where fileSymbols covers every file in the DB.

Closes #1610
@carlos-alm carlos-alm force-pushed the feat/dataflow-vertex-schema-p0 branch from 63bb901 to 0d8039c Compare June 19, 2026 03:56
@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

// When nameExtractor is set it acts as a gate: null means this node is not a function
// definition. Needed for languages (Elixir, Clojure) where functionNodes includes generic
// node types (call/list_lit) that are only sometimes function definitions.
if (rules.nameExtractor && !rules.nameExtractor(funcNode)) return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Asymmetric push/pop corrupts the scope stack for Elixir and Clojure

enterFunctionScope now returns early (no push) when nameExtractor says a node is not a function definition. But the exitFunction callback always calls scopeStack.pop() unconditionally (line ~547). In Elixir (functionNodes: ['call']) and Clojure (functionNodes: ['list_lit']), every call expression inside a function body triggers an enterFunction/exitFunction cycle. Because enterFunctionScope does not push for these non-def nodes, the subsequent pop() removes the enclosing def/defn scope instead.

For a function body containing N call expressions, the enclosing scope is popped N times. After the first call expression, currentScope(scopeStack) returns undefined, so all variable references, return tracking, and parameter registrations that follow are dropped. The fix should mirror the gate: either track which nodes actually pushed (e.g. a WeakSet<TreeSitterNode>) and only pop for those, or apply the same nameExtractor check inside exitFunction.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — enterFunctionScope now returns boolean (true when it pushed a scope frame, false on early return). A parallel pushRecord: boolean[] stack inside createDataflowVisitor mirrors the scope stack: enterFunction pushes the result, and exitFunction calls scopeStack.pop() only when pushRecord.pop() is true. This makes push/pop symmetric for all languages including Elixir and Clojure.

carlos-alm and others added 2 commits June 18, 2026 22:12
…ionScope

enterFunctionScope returns early (no push) when nameExtractor rejects a node.
The unconditional scopeStack.pop() in exitFunction caused asymmetric push/pop
for languages like Elixir and Clojure whose functionNodes include generic node
types (call/list_lit) that only sometimes represent function definitions.

Fix: enterFunctionScope now returns boolean (true = pushed). A parallel
pushRecord boolean stack in createDataflowVisitor records the result of each
enterFunction call; exitFunction pops scopeStack only when pushRecord.pop()
is true.
@carlos-alm

Copy link
Copy Markdown
Contributor Author

Addressed Greptile P1 (comment #3440188988): guarded exitFunction's scopeStack.pop() against early-return enterFunctionScope to prevent asymmetric push/pop on Elixir/Clojure. enterFunctionScope now returns boolean; a parallel pushRecord stack in createDataflowVisitor ensures pop only fires when a frame was actually pushed.

@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 6560a69 into main Jun 19, 2026
26 checks passed
@carlos-alm carlos-alm deleted the feat/dataflow-vertex-schema-p0 branch June 19, 2026 05:30
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 19, 2026
@carlos-alm carlos-alm changed the title feat(dataflow): interprocedural variable-level model — P0-P5 complete feat(dataflow): interprocedural variable-level model — P0-P6 complete, all 34 languages Jun 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

1 participant