Fix static-tool runners to pass the strict scorer (col_offset emission + Jedi correctness)#15
Open
bgoconnor wants to merge 5 commits into
Conversation
The Jedi runner had four pre-existing bugs that caused every prediction to fail the strict scorer (`is_same_element`, added in 2f7c605): 1. Predictions did not emit col_offset, which strict matching requires. 2. get_function_name only stripped the first dotted component, so paths deeper than two levels produced over-long names like "assignments.chained.main.func1" instead of "func1". The runner silently broke outside of Docker's shallow path layout. 3. Module-level variables received a spurious "function" key because the parent-is-module check compared parent.name to parent.module_name instead of checking parent.type == "module". 4. Function-reference assignments (e.g., `a = func1`) were typed as func1's return type instead of callable, because the runner ran find_types_by_execute on every inferred function without checking whether the position was the function's own definition site. Result on micro-benchmark: 5/850 (local, old) and 414/850 (Docker, old) both rise to 433/850 under both lenient and strict scorers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HeaderGen server (headergen PyPI package) returns inference results without col_offset, so every prediction failed the strict scorer (`is_same_element`). HeaderGen's runner is a thin HTTP client and does not build its own dicts, so the fix lives in translator.py. Approach: after receiving the server response, parse the source with Jedi to build a (name, line) -> col_offset map, then look up each entry's position. Subscript and attribute expressions reported as "a[0]" or "self.x" use the base name's column; nested functions reported as "outer.inner" use the inner name's column. Result on micro-benchmark: 0/850 strict rises to 603/850 under strict. Lenient is essentially unchanged (612 baseline -> 611). The 8-entry gap between lenient (611) and strict (603) is line_number mismatches between HeaderGen and GT that the lenient scorer silently accepts (line_number checks are commented out in large_scale_analysis.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scalpel's runner builds dicts in-process but does not include col_offset, so every prediction failed the strict scorer (`is_same_element`). The fix mirrors the HeaderGen approach: parse the source with Jedi to build a position map and look up each entry's column in translator.py, called once after process_file in the runner. Result on micro-benchmark: 0/845 strict rises to 179/845 under strict. Lenient is preserved at 182/845 (Docker baseline 183/850 - close enough for path/version drift). One file fails Scalpel inference; not addressed here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…guous The previous version of these translators used jedi for the (name, line) -> col_offset position lookup. That has two issues: 1. Architectural smell: Scalpel's runner would have to depend on jedi (a sibling tool being evaluated). HeaderGen happened to have jedi as a transitive dep so the import worked, but the principle was wrong. 2. Synthesis vs recovery: when multiple names share the same (name, line) key (e.g., `x = lambda x: x` has three `x`s on one line), the previous code silently picked the first via setdefault. That risks attaching a col_offset the tool didn't actually intend. This commit replaces jedi with stdlib `ast` (same lookup, no extra dep, no cross-tool entanglement) and skips col_offset emission entirely when the lookup is ambiguous. Empirically across the full micro-benchmark: - HeaderGen: 805/853 entries have a unique position, 5 are ambiguous, 43 are unfindable in source (the latter are HeaderGen's `ClassName.attr` style which differs from GT's `self.attr` convention; unmatchable regardless). - Scalpel: 369/369 unique, zero ambiguous. So the col_offset enrichment is recovery (the position is determined by what the tool already emitted) for >94% of HeaderGen output and 100% of Scalpel output. The remaining ambiguous entries are now correctly handled by NOT attaching a position rather than guessing. Docker results after this change: - HeaderGen: 0 -> 580/850 strict (591 -> 601 lenient) - Scalpel: 0 -> 180/850 strict (183 -> 187 lenient) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HeaderGen Docker image previously installed only gcc, not g++.
HeaderGen's transitive dependency `line-profiler` has a Cython C++
extension; on platforms where no prebuilt wheel exists (e.g., arm64
Linux + Python 3.10), pip falls back to building from source, which
requires g++. The build then fails with:
g++ -... -c line_profiler/_line_profiler.cpp -o ...
error: command 'g++' failed: No such file or directory
ERROR: Failed building wheel for line-profiler
This change adds g++ to the apt-get install line. The image now builds
cleanly on arm64 hosts as well as amd64.
(Builds may have appeared to succeed previously on hosts where a
line-profiler wheel was already cached from an earlier build with g++
present; fresh builds without the cache hit the source-compile path
and surface the issue.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi TypeEvalPy maintainers — we've been using the benchmark and noticed that the static-tool runners can't pass the strict scorer (
is_same_element, added in2f7c6056) because they don't emitcol_offset. This PR fixes that for HeaderGen, Jedi, and Scalpel. Before: 0 strict matches across all three on micro-benchmark. After: 580/476/180 respectively.Changes
Jedi (
src/target_tools/jedi/src/jedi_type_inference.py)col_offsetin the four_infodict construction sites (Jedi's API already providescolumn; the runner was collecting it viapos["column"]and then discarding it).get_function_nameto walk parent scopes — the oldsplit(".", 1)[-1]only stripped one dotted component, producing names likeassignments.chained.main.func1instead offunc1outside Docker's shallow file layouts. Also correctly produces qualified names for nested functions (func.dec).parent.name == parent.module_namewithparent.type == "module"— the old check added a spuriousfunctionkey to module-level variables.a = func1types ascallablerather than func1's return type.HeaderGen (
src/target_tools/headergen/src/translator.py+ 1 line inrunner.py)headergenserver doesn't return position info. The translator now parses the source with stdlibast, builds a(name, line) → col_offsetmap, and attaches columns to the server's entries. Subscript/attribute expressions use the base name's column; nested functions use the inner name's. Ambiguous lookups (>1 candidate on the same line, e.g.x = lambda x: x) are skipped rather than guessed — 5 of 853 entries across the benchmark.Scalpel (
src/target_tools/scalpel/src/translator.py+ 2 lines inrunner.py)HeaderGen Dockerfile
g++in addition togcc. HeaderGen's transitive depline-profilerhas a Cython C++ extension; on arm64 Linux + Python 3.10 (no prebuilt wheel) pip builds from source and fails without g++.Results (Docker, micro-benchmark,
main_analyze_results.pyTotal line)Lenient also rises modestly, mostly from Jedi's three non-col_offset fixes (which help under both scorers).
Methodology note
The HeaderGen and Scalpel
col_offsetvalues come from parsing the source with stdlibast, not from the tool's inference output. We validated empirically that this is recovery, not synthesis: 805/853 HeaderGen entries and 369/369 Scalpel entries have a uniquely-determined source position from the(name, line_number)the tool already emits. The translator skips ambiguous cases. The+1adjustment for 1-indexed columns follows the existing convention used by Pyright's and RightTyper's runners.Happy to iterate.