Fix static-tool runners to pass the strict scorer (col_offset emission + Jedi correctness) by bgoconnor · Pull Request #15 · secure-software-engineering/TypeEvalPy

bgoconnor · 2026-05-28T02:58:18Z

Hi TypeEvalPy maintainers — we've been using the benchmark and noticed that the static-tool runners can't pass the strict scorer (is_same_element, added in 2f7c6056) because they don't emit col_offset. This PR fixes that for HeaderGen, Jedi, and Scalpel. Before: 0 strict matches across all three on micro-benchmark. After: 580/476/180 respectively.

Changes

Jedi (src/target_tools/jedi/src/jedi_type_inference.py)

Emit col_offset in the four _info dict construction sites (Jedi's API already provides column; the runner was collecting it via pos["column"] and then discarding it).
Rewrite get_function_name to walk parent scopes — the old split(".", 1)[-1] only stripped one dotted component, producing names like assignments.chained.main.func1 instead of func1 outside Docker's shallow file layouts. Also correctly produces qualified names for nested functions (func.dec).
Replace parent.name == parent.module_name with parent.type == "module" — the old check added a spurious function key to module-level variables.
Distinguish function-definition site from function-reference site, so a = func1 types as callable rather than func1's return type.

HeaderGen (src/target_tools/headergen/src/translator.py + 1 line in runner.py)

The headergen server doesn't return position info. The translator now parses the source with stdlib ast, builds a (name, line) → col_offset map, and attaches columns to the server's entries. Subscript/attribute expressions use the base name's column; nested functions use the inner name's. Ambiguous lookups (>1 candidate on the same line, e.g. x = lambda x: x) are skipped rather than guessed — 5 of 853 entries across the benchmark.

Scalpel (src/target_tools/scalpel/src/translator.py + 2 lines in runner.py)

Same source-parsed enrichment as HeaderGen. Zero ambiguous lookups across the benchmark.

HeaderGen Dockerfile

Install g++ in addition to gcc. HeaderGen's transitive dep line-profiler has a Cython C++ extension; on arm64 Linux + Python 3.10 (no prebuilt wheel) pip builds from source and fails without g++.

Results (Docker, micro-benchmark, `main_analyze_results.py` Total line)

Tool	Lenient before	Lenient after	Strict after
Jedi	414/850	479	476
HeaderGen	591/850	601	580
Scalpel	183/850	187	180

Lenient also rises modestly, mostly from Jedi's three non-col_offset fixes (which help under both scorers).

Methodology note

The HeaderGen and Scalpel col_offset values come from parsing the source with stdlib ast, not from the tool's inference output. We validated empirically that this is recovery, not synthesis: 805/853 HeaderGen entries and 369/369 Scalpel entries have a uniquely-determined source position from the (name, line_number) the tool already emits. The translator skips ambiguous cases. The +1 adjustment for 1-indexed columns follows the existing convention used by Pyright's and RightTyper's runners.

Happy to iterate.

The Jedi runner had four pre-existing bugs that caused every prediction to fail the strict scorer (`is_same_element`, added in 2f7c605): 1. Predictions did not emit col_offset, which strict matching requires. 2. get_function_name only stripped the first dotted component, so paths deeper than two levels produced over-long names like "assignments.chained.main.func1" instead of "func1". The runner silently broke outside of Docker's shallow path layout. 3. Module-level variables received a spurious "function" key because the parent-is-module check compared parent.name to parent.module_name instead of checking parent.type == "module". 4. Function-reference assignments (e.g., `a = func1`) were typed as func1's return type instead of callable, because the runner ran find_types_by_execute on every inferred function without checking whether the position was the function's own definition site. Result on micro-benchmark: 5/850 (local, old) and 414/850 (Docker, old) both rise to 433/850 under both lenient and strict scorers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The HeaderGen server (headergen PyPI package) returns inference results without col_offset, so every prediction failed the strict scorer (`is_same_element`). HeaderGen's runner is a thin HTTP client and does not build its own dicts, so the fix lives in translator.py. Approach: after receiving the server response, parse the source with Jedi to build a (name, line) -> col_offset map, then look up each entry's position. Subscript and attribute expressions reported as "a[0]" or "self.x" use the base name's column; nested functions reported as "outer.inner" use the inner name's column. Result on micro-benchmark: 0/850 strict rises to 603/850 under strict. Lenient is essentially unchanged (612 baseline -> 611). The 8-entry gap between lenient (611) and strict (603) is line_number mismatches between HeaderGen and GT that the lenient scorer silently accepts (line_number checks are commented out in large_scale_analysis.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Scalpel's runner builds dicts in-process but does not include col_offset, so every prediction failed the strict scorer (`is_same_element`). The fix mirrors the HeaderGen approach: parse the source with Jedi to build a position map and look up each entry's column in translator.py, called once after process_file in the runner. Result on micro-benchmark: 0/845 strict rises to 179/845 under strict. Lenient is preserved at 182/845 (Docker baseline 183/850 - close enough for path/version drift). One file fails Scalpel inference; not addressed here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…guous The previous version of these translators used jedi for the (name, line) -> col_offset position lookup. That has two issues: 1. Architectural smell: Scalpel's runner would have to depend on jedi (a sibling tool being evaluated). HeaderGen happened to have jedi as a transitive dep so the import worked, but the principle was wrong. 2. Synthesis vs recovery: when multiple names share the same (name, line) key (e.g., `x = lambda x: x` has three `x`s on one line), the previous code silently picked the first via setdefault. That risks attaching a col_offset the tool didn't actually intend. This commit replaces jedi with stdlib `ast` (same lookup, no extra dep, no cross-tool entanglement) and skips col_offset emission entirely when the lookup is ambiguous. Empirically across the full micro-benchmark: - HeaderGen: 805/853 entries have a unique position, 5 are ambiguous, 43 are unfindable in source (the latter are HeaderGen's `ClassName.attr` style which differs from GT's `self.attr` convention; unmatchable regardless). - Scalpel: 369/369 unique, zero ambiguous. So the col_offset enrichment is recovery (the position is determined by what the tool already emitted) for >94% of HeaderGen output and 100% of Scalpel output. The remaining ambiguous entries are now correctly handled by NOT attaching a position rather than guessing. Docker results after this change: - HeaderGen: 0 -> 580/850 strict (591 -> 601 lenient) - Scalpel: 0 -> 180/850 strict (183 -> 187 lenient) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The HeaderGen Docker image previously installed only gcc, not g++. HeaderGen's transitive dependency `line-profiler` has a Cython C++ extension; on platforms where no prebuilt wheel exists (e.g., arm64 Linux + Python 3.10), pip falls back to building from source, which requires g++. The build then fails with: g++ -... -c line_profiler/_line_profiler.cpp -o ... error: command 'g++' failed: No such file or directory ERROR: Failed building wheel for line-profiler This change adds g++ to the apt-get install line. The image now builds cleanly on arm64 hosts as well as amd64. (Builds may have appeared to succeed previously on hosts where a line-profiler wheel was already cached from an earlier build with g++ present; fresh builds without the cache hit the source-compile path and surface the issue.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings the col_offset emission patches for the static runners (jedi, scalpel, headergen) onto our fork's main. These commits were originally opened as PR secure-software-engineering#15 to secure-software-engineering/TypeEvalPy on 2026-05-28 and have been sitting unreviewed; since our fork controls its own main, we merge here to unblock fork-internal use without waiting on upstream. Specifically: 2451f95 jedi: emit col_offset and fix runner correctness bugs 82b5ffb headergen: emit col_offset via source-parsed enrichment 4ce1198 scalpel: emit col_offset via source-parsed enrichment 5130d1a headergen, scalpel: use stdlib ast for col_offset recovery, skip ambiguous e4b7b97 headergen: install g++ in Dockerfile to enable line-profiler build Without these patches, jedi/scalpel/headergen produce *_result.json without col_offset; the strict scorer added in TypeEvalPy commit 2f7c605 (Oct 2025) then rejects every prediction and scores them all at 0.

bgoconnor and others added 5 commits May 27, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix static-tool runners to pass the strict scorer (col_offset emission + Jedi correctness)#15

Fix static-tool runners to pass the strict scorer (col_offset emission + Jedi correctness)#15
bgoconnor wants to merge 5 commits into
secure-software-engineering:mainfrom
Archway-Labs-AI:fix/static-runners-emit-col-offset

bgoconnor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bgoconnor commented May 28, 2026

Changes

Results (Docker, micro-benchmark, main_analyze_results.py Total line)

Methodology note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results (Docker, micro-benchmark, `main_analyze_results.py` Total line)