Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[Unreleased]

Added

labeille tsan-run command for running extension test suites under ThreadSanitizer-enabled free-threaded Python. Captures data race reports for use with ft-review-toolkit's tsan-report-analyzer agent. Includes --test-script for running custom concurrent stress tests instead of the package's test suite, --quick mode for CI, --stress N for repeated runs, and bundled CPython TSan suppressions.
labeille cext-build command for generating compile_commands.json compilation databases from C extension packages. Uses Bear to intercept compiler invocations during the build, with automatic fallback to build-system-specific mechanisms for Meson and CMake projects.
CextBuildConfig, CextBuildResult, CextBuildMeta dataclasses in cext_build.py.
detect_bear() for auto-detecting bear installation and version.
detect_build_system() for identifying the build system from project files (meson, cmake, setuptools, flit, hatch, pdm).
extract_build_requires() for reading [build-system].requires from pyproject.toml to support --no-build-isolation builds.
find_compile_db() for searching repo and build directories for generated compilation databases.
postprocess_compile_db() for fixing source file paths in generated compilation databases.
Output includes per-package compile_commands.json, repo symlinks, build logs, and JSONL result summaries.
Designed for integration with cext-review-toolkit Tier 2 analysis (clang-tidy).

Enhanced

Extract _build_run_config from run_cmd in cli.py to separate config validation/construction from CLI execution.
Extract _md_table helper from export_registry_report_md in analyze_cli.py to deduplicate 8 markdown table constructions.
Extract _prepare_source, _classify_install_result, _check_import_result from _survey_package in compat.py, reducing complexity from 23 branches to ~8 per function.
Split bench_cli.py:compare into _compare_intra_run, _compare_cross_run, and _report_anomalies, eliminating CC=62.
Extract _collect_package_data, _compute_trends, _classify_trends, _build_median_dicts from analyze_series_trends in bench/trends.py (CC=48 → ~10 per helper).
Extract _infer_field_type from batch_set_field in registry_ops.py to deduplicate type inference.
Type FTRunMeta.system_profile and python_profile as SystemProfile and PythonProfile instead of dict[str, Any].
Extract _validate_package_file from validate_registry in registry_ops.py (CC=59 → ~15 per function).
Extract SEVERITY_LABELS constant from 4 inline duplications in bench/display.py and bench/export.py.
Extract _make_anomaly helper in bench/anomaly.py to deduplicate 9 identical PackageAnomaly constructions in detect_condition_anomalies.
bench run and ft run now use shared setup_logging() for consistent log formatting.
Simplify _run_package_inner in runner.py by extracting 4 helper functions: _align_sdist_version, _setup_venv, _install_in_venv, _check_import.
Split _check_import_and_extras into _check_import and _install_extra_deps for single-responsibility.
Rename compat diff to compat compare for CLI vocabulary consistency.
Extract parse_env_pairs() and parse_csv_list() CLI helpers to eliminate duplicated parsing across cli.py, registry_cli.py, compat_cli.py, bench_cli.py, and ft_cli.py.
Decompose generate_registry_report (190 lines) into focused helpers: _accumulate_package_stats, _analyze_compat_blockers, _analyze_per_version, _analyze_download_tiers.
Decompose _print_run_summary (180 lines) into 7 section formatters for readability.
Unify run ID generation with generate_run_id() in io_utils.py — all subsystems now use UTC with consistent format.
Extract run_in_process_group() into io_utils.py as a shared subprocess lifecycle utility, used by both runner.py and bench/timing.py.
bench/results.py append_package_result now uses shared append_jsonl instead of raw open().
Replace RepoHostStats, InstallComplexity, CompatBlockers dataclasses with dict[str, int] counters on RegistryReport, eliminating getattr/setattr usage.
Replace Any type annotations with concrete types in ft/export.py, ft/compare.py, ft/display.py.
Simplify _build_bench_config by splitting into _build_base_config and _apply_config_overrides, eliminating locals() dict unpacking via @click.pass_context.
Use clean_env() in bench/system.py instead of inline env dict construction for pattern consistency.
Use append_jsonl() in migrations.py instead of raw open()/json.dumps for pattern consistency.

Fixed

Fix bench compare Click decorator chain pointing at wrong function after complexity split, restoring multi-directory comparison and --metric option.
Log PyPI metadata extraction failures in ft/runner.py instead of silently swallowing (KeyError, TypeError).
Wrap load_package in filter_packages with try/except to skip bad YAML instead of crashing the entire batch run.
Add logging to bisect.py git helpers (_resolve_rev, _get_commit_info) on failure.
Add OSError handler to _check_import_result in compat.py for missing venv python.
Use dataclass_from_dict for ErrorMatch deserialization in compat.py for forward compatibility.
Remove stale RunOutput as RunOutput re-export from runner.py.
Narrow FieldFilter.op to Literal type in registry_ops.py.
Wrap checkout_matching_tag git fetch --tags in try/except to prevent unhandled TimeoutExpired/OSError crashes.
Make load_ft_run lenient for missing ft_results.jsonl (returns empty list like load_bench_run).
Add missing installer_backend field to analyze.PackageResult to match runner_models.PackageResult.
Narrow IndexEntry.extension_type from str to Literal["pure", "extensions", "unknown"].
Narrow PackageResult.install_from and installer_backend from str to Literal types.
Fix --no-shallow / --clone-depth=0 to produce actual full clones (no --depth flag) instead of silently defaulting to depth=1.
Add warning logs to 4 silent return-None paths in repo_ops.py (clone_repo, pull_repo, checkout_revision, fetch_latest_pypi_version).
Promote pull_repo git reset/clean failure logs from DEBUG to WARNING for visibility.
Log warning for corrupted run_meta.json in analyze.py instead of silently returning empty metadata.
Add ignore_errors=True to shutil.rmtree in bench/runner.py venv refresh to prevent crashes on permission errors.
Standardize exception chaining from from None to from exc in 3 bench_cli.py load_series handlers.
Type FTPackageResult.extension_compat as ExtensionCompat | None instead of dict[str, Any] | None, eliminating untyped dict access across ft subsystem.
Replace _setup_package return type dict[str, Any] with typed _PackageSetup dataclass in bench/runner.py.
Log ValueError/OSError in ft/runner.py stderr and stdout reader threads instead of silently ignoring.
Remove thin extract_minor_version wrapper from analyze.py; callers now use io_utils.extract_minor_version directly.
Replace ProgressCallback = Any with Callable[[BenchProgress], None] | None in bench/runner.py.
Replace list[Any] with list[PackageEntry] in bench/runner.py _run_sequential and _run_interleaved.
Replace cond: Any with BenchConditionResult in bench/compare.py _collect_call_durations.
Replace Any return/param types with BenchConfig in bench_cli.py _build_base_config and _apply_config_overrides.
Add Literal types for BisectStep.status, BisectConfig.installer, PackageAnomaly.anomaly_type/severity, ValidationError.severity, BenchConfig.installer, and analyze.PackageResult.status.
Log warning in runner.py:list_installed_packages when pip exits with non-zero code instead of silently returning empty.
Raise RegistrySchemaError in check_registry_schema when schema.yaml exists but is corrupt instead of silently ignoring.
Set import_ok = False in ft/runner.py when extension compat check raises instead of leaving default True.
Remove dead hasattr(result, "returncode") guard in bench/runner.py — install_with_fallback always returns CompletedProcess.
Use parse_env_pairs in ft_cli.py and bench_cli.py instead of inline env parsing that silently drops malformed pairs.
Use load_jsonl in migrations.py:read_migration_log instead of hand-rolled JSONL parsing.
Add missing docstrings to 8 to_dict methods in ft/analysis.py, ft/compare.py, ft/compat.py.
Replace 21 assertTrue(len(...)) with assertGreater/assertGreaterEqual in tests for better failure messages.
Convert mkdtemp() to TemporaryDirectory() in test_ft_compat.py for consistency.
Type pkg: Any parameters as PackageEntry in bench/runner.py and ft/runner.py for type safety.
Fix IndexEntry.package attribute access (should be .name) in ft/runner.py:_select_packages.
Add missing encoding="utf-8" to bench_cli.py export write_text() call.
Import check OSError now sets install_error status instead of silently continuing.
Standardize exception variable naming from as e: to as exc: in ft/compat.py, registry_ops.py, and registry_cli.py (5 instances).
Raise compat survey clone/build failure log levels from debug to warning for visibility.
Rename from_dict parameter d → data in CompatResult and CompatMeta for consistency.
Bisect extra deps install failure now returns skip step instead of silently continuing.
Narrow except Exception to specific types in analyze_cli.py and ft/compat.py.
Add warning logs for silent failures in runner.py (JIT check), scan_deps.py (dir scan), and bisect.py (deps install).
Extension probe script now reports walk_error and skipped_modules instead of silently passing.
Guard all yaml.safe_load call sites against YAMLError: add safe_load_yaml utility to io_utils.py, protect registry.py, registry_ops.py, analyze_cli.py, migrations.py, and bench/config.py.
Add 120-second timeout to git subprocess calls in registry sync to prevent indefinite hangs.
Add ignore_errors=True to shutil.rmtree in venv refresh to prevent cleanup failures masking results.
Log PermissionError at WARNING in kill_process_group instead of silently ignoring.
Add from None to exception chains in bench_cli.py for cleaner tracebacks.
Include exception details in connection error log in resolve.py.
Guard rename in shield_source_dir finally block against missing source file.
Add load_json_file utility to io_utils.py and guard 5 unprotected json.loads call sites against JSONDecodeError.
Add JSONDecodeError to exception handling in bench/system.py Python profile probe.
Add KeyError handling for missing probe output fields in ft/compat.py.
Include exception details in schema.yaml parse warning in registry.py.
Catch ValueError from malformed tracking.json in bench_cli.py track commands.
Surface skipped-package counts in ft/runner.py, bench/runner.py, and bench/tracking.py so users know when results are incomplete.
Extract dataclass_from_dict utility to io_utils.py and deduplicate 13 identical from_dict implementations across bench/ and ft/ modules.
Use atomic_write_text for ft/results.py JSONL output and runner.py summary file to prevent corruption on interruption.
Remove 3 unnecessary private re-exports (_EXTRAS_RE, _SELF_INSTALL_RE, _TAG_PATTERNS) from runner.py.
Move dataclass_from_dict imports from method bodies to module level across 9 files.
Deduplicate extract_minor_version: move canonical implementation to io_utils.py, delegate from runner.py and analyze.py.
Remove deprecated extract_python_minor_version wrapper from runner.py; callers in cli.py and compat_cli.py now use io_utils.extract_minor_version directly.
Delegate format_signal_name in formatting.py to crash.signal_name instead of duplicating signal conversion logic.
Replace inline JSON write in bench/tracking.py with write_meta_json.
Guard _read_lines call sites in registry_ops.py against OSError/UnicodeDecodeError to prevent batch operations from crashing on corrupt files.
Promote list_installed_packages error log from info to warning in runner.py.
Promote iter_jsonl per-line error log from debug to warning in io_utils.py.
Guard build log write_text in compat.py against OSError to prevent filesystem errors from aborting runs.
Catch CalledProcessError and TimeoutExpired from git clone/fetch in bisect.py to provide clear error messages.
Include exception details in malformed series warning in bench/tracking.py.
Replace getattr with direct attribute access on typed dataclass objects in analyze.py, summary.py, bench/runner.py, and ft/runner.py.
Migrate registry_cli.py from sys.exit(1) to raise click.ClickException(...) for validation errors and sync failures, matching the rest of the CLI surface.
Extract _HOST_LABELS and _INSTALL_LABELS to module-level constants in analyze_cli.py, removing duplicates between terminal and markdown formatters.
Migrate RunMeta.from_dict and PackageResult.from_dict in analyze.py to use dataclass_from_dict utility.
Use safe_load_yaml in migrations.py instead of duplicating the YAML load-and-validate pattern.
Wire up setup_logging for --verbose flags in analyze registry, analyze run, and registry sync commands that previously accepted but ignored the flag.
Add load_yaml_strict utility to io_utils.py and migrate 4 inline YAML load-and-validate patterns in registry.py, compat.py, and bench/config.py.
Use atomic_write_text for bench/results.py JSONL batch write and remove unused to_jsonl_line/from_jsonl_line methods.
Use dataclass_from_dict in BenchIteration.from_dict instead of inline field-filtering reimplementation.
Use load_json_file in bench/tracking.py load_series instead of inline JSON parsing.
Add exc_info=True to 14 log.error() calls inside except blocks across runner.py, resolve.py, bench/runner.py, and ft/runner.py to preserve tracebacks for debugging.
Explicitly set import_ok = False in ft/compat.py JSONDecodeError, KeyError, and OSError handlers for robustness.

Documentation

Add missing help= text to 21 CLI options across cli.py, analyze_cli.py, ft_cli.py, compat_cli.py, and bench_cli.py.

Tests

Add 11 tests for compat survey execution pipeline: _prepare_source, _classify_install_result, _check_import_result.
Add 5 tests for run_ft orchestrator and _select_packages (was 0% coverage).
Add 3 tests for _survey_package integration (build_ok, build_fail, timeout paths).
Add 3 tests for _align_sdist_version (source mode, sdist with tag, sdist without tag).
Add 26 tests for bench track subcommands: init, add, show, pin, unpin, list, trend, alert.
Add CLI tests for registry subcommands: rename-field, set-field (--all, --where, --packages), validate (--packages filter), migrate (list, unknown, missing name), sync (clone, pull, failure, non-git), add-index-field, remove-index-field, and group help.
Add test_cli_utils.py with 14 tests for parse_env_pairs and parse_csv_list.
Add 12 behavioral tests for ft compare, ft report, and ft export CLI commands.
Add test_repo_ops.py with 31 tests for clone_repo, pull_repo, checkout_revision, parse_package_specs, and parse_repo_overrides.
Add 35 tests to test_io_utils.py for load_yaml_strict, iter_jsonl, load_jsonl, append_jsonl, dataclass_from_dict, extract_minor_version, generate_run_id, and write_meta_json.
Add 11 tests for kill_process_group and run_in_process_group covering process group kills, timeout handling, and fallback to proc.kill().
Add test_cli.py with 21 tests for resolve, run, bisect, and scan-deps CLI commands covering parameter validation, error paths, and output formatting.

Documentation

Update CLAUDE.md: fix stale test count (546 → 2068) and add 13 missing modules to architecture section.

Added

labeille analyze registry now generates a comprehensive three-tier report: summary (default), detailed (--detail), and verbose (--detail --verbose).
--export-markdown flag for analyze registry generates a Markdown document suitable for inclusion in a repository.
-o/--output flag to write report output to a file.
--python-version is now repeatable for multi-version analysis.
New report sections: enrichment progress, per-version readiness, compatibility blockers (PyO3, Cython, Meson, CMake, Fortran, removed C API), repository hosting distribution, install command complexity, and download tier coverage.
RegistryReport dataclass with sub-reports: EnrichmentProgress, VersionAnalysis, RepoHostStats, InstallComplexity, CompatBlockers, DownloadTierCoverage.
generate_registry_report() function for comprehensive registry analysis in a single pass.
Helper classifiers: _classify_repo_host(), _classify_install_complexity(), _classify_compat_blocker().
labeille registry sync command to clone or update the laruche registry into the default location.
Schema version checking via schema.yaml at the registry root. labeille checks this at load time and gives an actionable error if the registry schema is incompatible.
default_registry_dir() and LABEILLE_REGISTRY_DIR environment variable for configuring the registry location.
RegistrySchemaError exception for incompatible registry schemas.
--adaptive flag for labeille bench run: stop iterating early when wall-time measurements converge (RSE below threshold).
--adaptive-threshold option (default 0.005 = 0.5% RSE) and --adaptive-min-iterations option (default 5) for fine-tuning convergence behavior.
adaptive, adaptive_threshold, and adaptive_min_iterations fields in YAML benchmark profiles.
converged_early field on BenchConditionResult, recorded in bench_results.jsonl.
relative_standard_error() function in bench/stats.py.
Adaptive convergence support in all three execution strategies (block, alternating, interleaved).
Quick mode (--quick) now enables adaptive convergence by default.
Convergence indicators in benchmark display: checkmark in table, count in quality summary, config line.
--trust-ft-wheels flag for labeille ft run: packages with free-threaded wheels (cpXYt ABI tag) for the target Python version are classified as compatible_by_wheel without running tests.
--trust-ft-wheels-any-version flag for labeille ft run: like --trust-ft-wheels but trusts free-threaded wheels built for any Python version. Implies --trust-ft-wheels.
COMPATIBLE_BY_WHEEL category in FailureCategory with ⊕ symbol and severity 0.
has_ft_wheel() function in classifier.py to detect free-threaded wheels in PyPI release metadata.
ft_wheel_found and ft_wheel_version fields on FTPackageResult for provenance tracking.
trust_ft_wheels and trust_ft_wheels_any_version fields on FTRunConfig, recorded in ft_meta.json.
FT wheel check reuses PyPI metadata for sdist version lookup when both --trust-ft-wheels and --install-from sdist are active.
--install-from {source|sdist} option for labeille run and labeille ft run: install packages from PyPI source distributions while running tests from cloned git repos.
Sdist version alignment: fetch_latest_pypi_version() queries PyPI, checkout_matching_tag() aligns the repo to the matching release tag.
Source directory shielding: shield_source_dir() temporarily renames flat-layout source dirs to prevent local imports from shadowing the sdist-installed package.
Install command splitting: split_install_command() and build_sdist_install_commands() separate self-install segments from test dependency segments.
install_from, sdist_version, and sdist_tag_matched fields on PackageResult, FTPackageResult, and analysis PackageResult.
labeille compat command group for C extension compatibility surveys: survey, show, diff, and patterns subcommands.
~30 built-in error classification patterns across 10+ categories (removed_c_api, cython_incompatible, pyo3_incompatible, numpy_c_api, missing_system_lib, etc.).
YAML-based custom error pattern support with override semantics.
Survey diff for tracking regressions and fixes between Python versions.
Markdown export for sharing compatibility survey results.
Optional uv integration for faster venv creation and package installation via --installer flag (auto/uv/pip).
InstallerBackend enum, detect_uv(), resolve_installer(), and _rewrite_install_command() in runner.py.
install_with_fallback() for automatic pip fallback when uv install fails.
installer_backend field on PackageResult and in run metadata.
--installer CLI option on run, bench run, and bisect commands.
Scaled registry from ~350 to 1500 packages: 720 active, 362 skip_versions (3.15 blockers), 418 fully skipped, with 86.4% working test harness coverage.
5-tier test directory detection in _auto_detect_test_dirs(): standard dirs (t/, spec/), package-named/internal dirs, monorepo subdirs, root-level test files, and scattered test files in package source.
Multi-forge URL normalization via _normalize_forge_url() supporting GitHub, GitLab, Bitbucket, and Codeberg.
Expanded extract_repo_url() with all-values project_urls scan and description field scanning as last resort.
recover-no-tests-found and recover-no-repo-url registry migrations for recovering falsely skipped packages.
trends.py module in bench subpackage with PackageTrend, RegressionAlert, and SeriesTrend dataclasses for longitudinal benchmark analysis.
compute_package_trend() with configurable regression/trend thresholds and sustained-change detection.
analyze_series_trends() for full series analysis: loads all runs, computes per-package trends, generates regression alerts.
Five alert types: new_regression, sustained_regression, recovery, new_instability, new_improvement.
labeille bench track trend command for viewing trend analysis with table, CSV, and Markdown output.
labeille bench track alert command for viewing regression alerts since the last run.
export_trend_markdown() and export_trend_csv() in bench/export.py for trend report generation.
format_series_trend() and format_regression_alerts() in bench/display.py for terminal output.
tracking.py module in bench subpackage with TrackingSeries and TrackingRunEntry dataclasses for longitudinal benchmark tracking.
compute_config_fingerprint() for comparing benchmark configurations across runs (ignores package list and system profile).
Series management: init_series(), add_run_to_series(), pin_baseline(), unpin_baseline(), load_series(), save_series(), list_series().
labeille bench track CLI subgroup with init, add, show, list, pin, and unpin commands.
Symlink-based run storage within tracking directories to avoid data duplication.
constraints.py module in bench subpackage with ResourceConstraints dataclass and ulimit/taskset command wrapping.
Resource constraints as part of the condition abstraction: --memory-limit, --cpu-affinity, and --cpu-time-limit CLI flags for labeille bench run.
Per-condition constraint specification in YAML benchmark profiles.
Inline condition constraint parsing (e.g. --condition "constrained:memory_limit=1024,cpu_affinity=0+1").
OOM detection via detect_oom_from_result() with new "oom" iteration status.
constraints_applied and oom_detected fields on BenchIteration.
constraints field on ConditionDef for per-condition resource limits.
cache.py module in bench subpackage for filesystem cache management during benchmarks.
--drop-caches flag for labeille bench run to drop filesystem caches between iterations for cold-cache benchmarking.
--warm-vs-cold flag to automatically compare warm-cache and cold-cache performance.
--run-dangerously-as-root safety flag — labeille refuses to run as root without it.
labeille bench setup-cache-drop command showing setup instructions for the sudoers-based cache-dropping helper.
generate_drop_caches_script() and format_setup_instructions() helpers for cache-drop setup.
caches_dropped field on BenchIteration to record cache state per iteration.
Per-test timing capture via pytest --durations=0 output, enabled with --per-test-timing flag on labeille bench run.
TestTiming and PerTestTimings dataclasses in bench/timing.py with pytest output parser.
compare_per_test() in bench/compare.py for per-test overhead analysis between conditions.
--per-test <package> option for labeille bench show and labeille bench compare to display per-test timing breakdown.
anomaly.py module in bench subpackage with PackageAnomaly and AnomalyReport dataclasses for proactive measurement-quality assessment.
detect_anomalies() with five anomaly types: high_cv, bimodal, outlier_heavy, status_mixed, and trend.
is_bimodal() gap-analysis heuristic and has_monotonic_trend() Spearman rank correlation for pure-Python distribution analysis.
--anomalies flag for labeille bench show to display measurement anomaly report.
Anomaly summary in labeille bench compare output when anomalies are detected.
## Anomalies section in Markdown export when anomalies are present.
ft/compare.py module with PackageTransition, FTComparisonResult dataclasses and compare_ft_runs() for cross-run comparison: category transitions, pass rate changes, new/resolved crashes, aggregate deltas.
ft/export.py module with export_csv(), export_json(), and generate_report() for CSV, JSON, and markdown report export of free-threading results.
ft/display.py module with terminal formatting for free-threading results: compatibility summaries, package tables, flakiness profiles, triage lists, GIL comparison reports, and progress output.
ft_cli.py module with labeille ft CLI subgroup: run, show, compare, compat, flaky, report, export commands for free-threading compatibility testing.
ft/analysis.py module with FlakyTest, FlakinessProfile, GILComparisonResult, TriageEntry, DurationAnomaly, and FTAnalysisReport dataclasses for free-threading result analysis.
analyze_flakiness() for detailed flakiness profiling with failure pattern classification and consecutive streak detection.
compare_gil_modes() for GIL-enabled vs free-threaded result comparison.
prioritize_triage() severity-scored triage with extension and TSAN bonuses.
detect_duration_anomalies() using statistical outlier detection from bench/stats.py.
analyze_ft_run() full analysis pipeline producing FTAnalysisReport.
ft/runner.py module with FTRunConfig, OutputMonitor, run_single_iteration(), run_package_ft(), and run_ft() for free-threading test execution with crash/deadlock/TSAN detection and pytest output parsing.
ft/results.py module with FailureCategory enum, IterationOutcome, FTPackageResult, FTRunMeta, and FTRunSummary dataclasses for free-threading result storage, categorization, and JSONL/JSON serialization.
categorize_package() priority-ordered classification (install failure > import failure > deadlock > crash > TSAN > GIL fallback > compatible > incompatible > intermittent).
ft subpackage with compat.py module for extension GIL compatibility detection: runtime probe via sys._is_gil_enabled() and source scan for Py_mod_gil declarations.
ExtensionInfo, SourceScanResult, ModGilDeclaration, and ExtensionCompat dataclasses with JSON serialization.
probe_gil_fallback() runtime probe, scan_source_for_mod_gil() source scanner, assess_extension_compat() combined assessment, and format_extension_compat() display helper.
guess_import_name() with _IMPORT_NAME_OVERRIDES table for PyPI-to-import name resolution.
bench subpackage with system.py module for capturing system profiles (CPU, RAM, OS, disk) and Python interpreter profiles (version, JIT/GIL state, build flags) for benchmark reproducibility.
SystemProfile, PythonProfile, StabilityCheck, and SystemSnapshot dataclasses with JSON serialization and terminal display formatting.
check_stability() pre-benchmark validation (load average, available RAM).
stats.py module in bench subpackage with pure-Python statistical functions: describe(), welch_ttest(), cohens_d(), bootstrap_ci(), detect_outliers(), and compute_overhead() for benchmark comparison.
DescriptiveStats, TTestResult, EffectSize, BootstrapCI, and OverheadResult dataclasses with scipy fallback for t-test p-values.
timing.py module in bench subpackage with run_timed() and run_timed_in_venv() for capturing wall time, CPU time (via resource.getrusage delta), and peak RSS (via GNU /usr/bin/time with ru_maxrss fallback).
results.py module in bench subpackage with BenchIteration, BenchConditionResult, BenchPackageResult, ConditionDef, and BenchMeta dataclasses for the full benchmark result hierarchy, plus JSONL/JSON serialization via save_bench_run(), load_bench_run(), and append_package_result().
config.py module in bench subpackage with BenchConfig dataclass, YAML profile loading, inline condition parsing, test command resolution, environment/deps merging, and configuration validation.
runner.py module in bench subpackage with BenchRunner class orchestrating the full benchmark lifecycle: system profiling, stability checks, package setup (clone/venv/install per condition), timed iteration execution, and incremental JSONL result writing. Supports block, alternating, and interleaved execution strategies with progress callbacks.
BenchProgress dataclass and quick_config() helper for rapid iteration during development.
display.py module in bench subpackage with terminal formatting for benchmark results: per-package timing tables, multi-condition comparison tables with overhead/CI/significance, measurement quality summaries, and aggregate comparison summaries.
compare.py module in bench subpackage with structured comparison analysis: PackageOverhead and ComparisonReport dataclasses with anomaly flags (high CV, status mismatch, outliers), compare_conditions() for within-run comparison, and compare_runs() for cross-run comparison.
export.py module in bench subpackage with CSV (long-format per-iteration and summary) and Markdown export for external analysis tools and reports.
bench_cli.py module with labeille bench CLI subgroup: run (execute benchmarks from profiles or inline conditions), show (display saved results), compare (compare conditions within or across runs), system (print system characterization), and export (CSV/Markdown/CSV-summary export).
labeille bisect command to binary-search a package's git history and find the first commit that introduced a crash.
bisect.py module with BisectConfig, BisectStep, BisectResult dataclasses and the run_bisect algorithm with skip-neighbor handling for unbuildable commits.
Commit-aware run comparison: analyze compare and analyze run show git commit changes alongside status changes with heuristic annotations (e.g. "unchanged — likely a CPython/JIT regression").
PackageComparison dataclass with commit_changed/commit_unchanged properties for per-package comparison data.
New crash summary statistics in compare output showing repo unchanged/changed/unknown counts.

Changed

The package registry has been moved to its own repository: laruche. Use labeille registry sync to fetch it.
Default --registry-dir changed from registry/ (local) to ~/.local/share/labeille/registry/ (user-level). Override with LABEILLE_REGISTRY_DIR env var.
All documentation updated to reflect the split. Enrichment docs now live in laruche.

Enhanced

Added Literal types to 6 string-constrained fields: IterationOutcome.status, RunnerConfig.installer/install_from, PackageEntry.extension_type/install_method/test_framework. Mypy now catches invalid values at type-check time.
Extracted shared kill_process_group() into io_utils.py, replacing 3 independent implementations in runner.py, bench/timing.py, and ft/runner.py. The FT runner now correctly uses os.getpgid() and signal.SIGKILL instead of raw os.killpg(pid, 9).
Extracted _build_bench_config() helper from bench_cli.run, reducing the command body from ~130 lines to ~15 lines. Organized 35 Click options into labeled sections (profile, execution, package selection, paths, stability, adaptive, advanced).
Added utc_now_iso() helper to io_utils.py, unifying 15+ timestamp generation sites across 8 modules to a single canonical UTC format with Z suffix.
Registry save_index() and save_package() now use atomic_write_text() for crash-safe writes, preventing corruption of the most sensitive files.
Added Literal types to PackageResult.status, ResolveResult.action, BenchIteration.status, and EffectSize.classification for compile-time typo detection.
Made DescriptiveStats, TTestResult, EffectSize, BootstrapCI, OverheadResult, and CrashInfo dataclasses frozen (immutable).
Narrowed 23 except Exception handlers in bench/system.py to _PROBE_ERRORS tuple, preventing accidental swallowing of programming errors while preserving best-effort system probing.
Added logging to 8 previously silent exception handlers in ft/runner.py, bisect.py, cli.py, runner.py, and bench/timing.py.
Added error handling to save_crash_stderr() with mkdir for the crashes directory.
Fixed ScanResult forward reference in cli.py — now uses TYPE_CHECKING import, removing 4 suppression markers and 2 runtime asserts.
Standardized --output options to click.Path(path_type=Path) in ft_cli.py and analyze_cli.py.
Added type=int to ft run --timeout for consistency with other commands.
Added docstrings to 7 undocumented public APIs in compat.py (properties, to_dict, from_dict).
Fixed cli.py module docstring to list all 9 subcommand groups.
Added test_logging.py (8 tests) and test_io_utils.py (10 tests) for previously untested foundation modules.
Derived _KNOWN_FIELDS and _FIELD_TYPES in registry_ops.py from PackageEntry dataclass metadata, preventing drift.
Added PackageResult.to_dict() method, simplifying append_result() from 22-field manual dict to asdict().
Eliminated redundant _atomic_write wrappers in registry_ops.py and migrations.py.
Added encoding="utf-8" to bench/runner.py mid-run metadata write.
Extracted shared atomic_write_text() utility in io_utils.py, replacing duplicate implementations in registry_ops.py, migrations.py, and bench/tracking.py.
Promoted _clean_env() to public API as clean_env(), replacing inline env sanitization in ft/compat.py and bench/runner.py.
Unified logger acquisition: all 20 bench/ and ft/ modules now use get_logger() with per-module names (e.g., bench.runner, ft.runner) for filterable log output.
Added encoding="utf-8" to all bench/ file I/O calls for cross-platform consistency.
Added show_default=True to --timeout options in bench_cli.py and ft_cli.py.
Standardized click.Path(path_type=Path) across bench_cli.py and ft_cli.py, removing manual Path() wrapping.
Added write_meta_json(), append_jsonl(), load_jsonl(), and iter_jsonl() utilities to io_utils.py, unifying persistence patterns across all four subsystems (runner, bench, ft, compat).
All meta.json writes now use atomic_write_text() for crash safety (previously only registry files were atomic).
All JSONL loads now use streaming iteration with error tolerance for malformed trailing lines.
Extracted _ensure_repo(), _run_install(), and _analyze_test_result() from _run_package_inner() in runner.py, reducing the 420-line monolith to ~180 lines and eliminating duplicated install error handling.
Narrowed remaining except Exception blocks: ft/runner.py:397 to OSError, ft/runner.py:700 to (TimeoutExpired, SubprocessError, OSError), bisect.py to (FileNotFoundError, OSError), cli.py to (OSError, ValueError, KeyError, TypeError).
Merged duplicate load_package() calls in bisect.py into a single call with warning-level logging.
Raised get_installed_packages logging from debug to info for better diagnostics.
Added from __future__ import annotations to all test files for consistency with source modules.
labeille analyze registry shows percentages alongside all counts.
Download tier coverage (top 100, 500, 1000, 2000) shows what fraction of the most-downloaded packages are testable.
Version readiness section shows per-Python-version active/skipped counts with top skip reasons.
--format counts is preserved as a backward-compatible alias for the summary format.
Updated 63 registry packages with accurate skip reasons from compat analysis: cleared vague 3.15 skip_versions, added specific failure categories (Meson, CMake, removed APIs), reclassified non-3.15 issues as skip with precise reasons, and unskipped 5 packages that now build on 3.15.
BenchRunner._run_iteration() applies resource constraints via command wrapping before execution.
BenchRunner.run() now checks for root execution and refuses unless --run-dangerously-as-root is passed.
macOS support for system profiling: CPU info from sysctl, memory from vm_stat, OS from sw_vers, disk type from diskutil. All existing Linux code paths preserved unchanged.
check_stability() and SystemSnapshot.capture() now use cross-platform _get_available_ram_gb() helper instead of Linux-only /proc/meminfo.
format_system_profile() no longer hardcodes "Linux" in the OS line; shows platform-appropriate output.
Switched build backend from setuptools to hatchling for better src layout support and lighter build dependencies.
Added minimum version pins to runtime dependencies (click>=8.0, pyyaml>=6.0, requests>=2.28).
Added py.typed marker for PEP 561 type checker support.
Added sdist/wheel exclusion rules to keep distribution lean (no tests, registry, results, or docs).
Added Installation section to README with pipx, pip, and from-source instructions.
Added Environment :: Console and Topic :: Software Development :: Quality Assurance classifiers.
Renamed Issues URL key to Bug Tracker in project metadata for PyPI display consistency.
Replaced raise SystemExit(130) # noqa: B904 with raise SystemExit(130) from None in bench_cli.py, removing the suppression.
Narrowed except Exception in bench/system.py JIT detection to except (AttributeError, TypeError).
Added explanatory comment for except BaseException in io_utils.py atomic write.
Narrowed 5 except Exception catches in bench/runner.py and 1 in ft/runner.py to specific exception tuples (OSError, subprocess.SubprocessError, ValueError, KeyError), removing all noqa: BLE001 suppressions.
Restructured cli.py subgroup registration into _register_subcommands() function, eliminating 5 noqa: E402 suppressions.
Fixed type: ignore[operator] in ft/display.py by adding explicit None guard, in bench_cli.py by restructuring conditionals, and type: ignore[no-any-return] in resolve.py by using intermediate variable.
Extracted 7 helper functions from run_package_ft in ft/runner.py (360→~80 lines): _check_ft_wheel_trust, _clone_and_align_ft, _create_venv_and_install_ft, _install_sdist_mode, _install_source_mode, _run_ft_iterations, _run_gil_comparison.
Split runner.py (1972→1445 lines): extracted data models to runner_models.py and git/repo/sdist operations to repo_ops.py, with re-exports preserving all existing imports.
Reduced type: ignore markers in test files from 59 to 2 by typing builder kwargs as Any, using assert x is not None for type narrowing, adding explicit generic parameters, and typing mock parameters as MagicMock.

Removed

Dead code: _log2() from bisect.py, RegistryStats/analyze_registry() from analyze.py (superseded by RegistryReport/generate_registry_report()), load_ft_summary() from ft/results.py, format_progress()/format_gil_comparison() from ft/display.py, unused _MOD_GIL_MENTION_PATTERN regex from ft/compat.py.
Removed 12 unused log = get_logger(...) variables and their imports from modules that had logger scaffolding but no log calls.

Fixed

Narrowed 3 bare except Exception handlers: ft/compat.py GIL probe (to ImportError, OSError, AttributeError), submodule import loop (to ImportError, OSError), and registry.py schema parsing (to yaml.YAMLError, OSError). Programming errors now propagate instead of being silently swallowed.
bench/timing.py now sanitizes subprocess environment via clean_env(), preventing PYTHONHOME/PYTHONPATH pollution in benchmark runs.
ft/runner.py now uses start_new_session=True instead of deprecated preexec_fn=os.setpgrp, fixing a thread-safety hazard with ThreadPoolExecutor.
Top-level error handlers in runner.py, ft/runner.py, and resolve.py now preserve tracebacks via exc_info=True.
Replaced SystemExit(1) with click.ClickException in bench_cli.py for consistent error formatting.
Manual review of all 1,798 enriched registry packages: corrected invalid extension_type values, added missing -p no:xdist flags, fixed inconsistent skip states, corrected repo URLs, collapsed multiline YAML, and fixed test commands.
update_index_from_packages() no longer crashes when skip_versions is None.
Bench runner install_package now receives a complete environment (starting from os.environ) instead of bare condition env vars, fixing install failures when build backends need PATH to find tools like git.
run_meta.json now stores actual CLI argument strings (sys.argv[1:]) instead of parameter names, making runs reproducible from metadata.
build_reproduce_command uses export PATH for venv activation instead of fragile .venv/bin/ prefix string replacement.
Deduplicated _signal_name (3 copies → format_signal_name in formatting.py).
Deduplicated _result_detail (3 copies → public result_detail in analyze.py).
Made _extract_minor_version public as extract_minor_version in analyze.py.
Removed redundant zero-check in compare_runs duration percentage calculation.
Fixed timeout documentation (300s → 600s) in doc/enrichment.md.
_quote_yaml_scalar now quotes all numeric strings (integers, scientific notation, octal-like), tilde, and additional YAML special characters.
find_field_extent no longer consumes trailing blank lines after scalar fields, fixing insert_field_after placement near blank lines.
Rewrote batch_set_field to use line-level manipulation instead of PyYAML round-trip, preserving YAML formatting.
Added set_field_value to yaml_lines.py for in-place field value replacement.
format_yaml_value and parse_default_value now handle None/null values.
_is_version_specific_skip now uses word-boundary regex patterns to prevent false positives (e.g. "trust" no longer matches the "rust" pattern).
scan-deps now warns about namespace packages (google, azure, zope, etc.) where pip resolution is uncertain, and tries full import paths before falling back to top-level modules.
IndexEntry now tracks skip_versions_keys for fast version-skip filtering without loading full YAML files.
filter_packages uses index-level skip_versions_keys to skip packages before loading YAML.
_dict_to_package coerces notes: null to empty string for type safety.
_dict_to_package logs unknown YAML keys at debug level to surface typos.
validate_registry checks uses_xdist/-p no:xdist consistency in both directions.
check_jit_enabled now uses explicit sys.flags.jit check instead of nonexistent sys._jit, with exact stdout comparison.
_parse_install_packages now handles python -m pip install, python3 -m pip install, and path-qualified pip invocations.
_package_to_dict accepts omit_defaults parameter to exclude default-valued fields from output.
run_test_command and install_package now kill the entire process tree on timeout via os.killpg, preventing orphaned grandchild processes from accumulating during batch runs.
RunData.result_for() now uses O(1) dict lookup instead of O(N) linear scan, with lazily-built _results_by_pkg cache.
compare_runs and _compute_status_changes use result_for() instead of building ad-hoc dicts.
Subprocess helpers (build_env, check_jit_enabled, create_venv, validate_target_python) now strip PYTHONHOME/PYTHONPATH via _clean_env() to prevent environment pollution.
CLI warns when only one of --repos-dir/--venvs-dir is set, since the other will use a temporary directory.
update_index_from_packages accepts optional modified_packages set to avoid O(N) disk reads when only a few packages changed.
_PLATFORM_INDICATORS now detects bare linux_x86_64/linux_aarch64 wheels in addition to manylinux/musllinux.
fetch_pypi_metadata and resolve_package accept an optional requests.Session for connection reuse; resolve_all uses shared/thread-local sessions.
_is_import_error_handler no longer treats except Exception as an import error handler, reducing false conditional import flags.
_parse_install_packages uses a regex instead of chained .split() to handle all PEP 440 specifiers (~=, !=, ; markers).
pull_repo uses git fetch + reset --hard FETCH_HEAD + clean -fdx instead of git pull --ff-only, handling dirty working trees left by test suites.

Added

--extra-deps option to inject additional packages into every venv after the package's own dependencies.
--test-command-override option to replace the test command for all packages in a run.
--test-command-suffix option to append flags to each package's existing test command.
--repo-override PKG=URL option (repeatable) to test forks or PR branches without modifying registry.
--clone-depth and --no-shallow CLI options to override per-package clone depth; --clone-depth=0 or --no-shallow for full clones.
Per-package git revision support via --packages=pkg@revision syntax; accepts commit hashes, branches, tags, or relative refs like HEAD~10.
checkout_revision helper for checking out specific git refs after cloning.
parse_package_specs function for parsing name@revision package spec syntax.
requested_revision field in PackageResult to distinguish explicitly requested revisions from HEAD.
350 enriched package configurations with full test commands, install commands, and metadata.
Applied skip-to-skip-versions migration on 36 packages (PyO3, maturin, Cython, JIT crashes).
Config fixes for python-dateutil, pyyaml, msgpack, hatchling, openai, numpy, pytz, sqlalchemy, and 3 archived google packages.
registry/migrations.log tracking applied migration history.
labeille registry migrate command with a migration framework for registry schema transformations.
skip-to-skip-versions migration to convert 3.15-specific skip:true entries to skip_versions["3.15"].
Migration log (migrations.log) to track applied migrations and prevent re-application.
Dry-run support for migrations with preview of affected packages.
labeille scan-deps command for static test dependency discovery via AST-based import analysis.
import_map.py module with 100+ import-name-to-pip-package mappings for common mismatches (PIL->Pillow, yaml->PyYAML, etc.).
Three output formats for scan-deps: human-readable (default), JSON, and pip (for direct shell use).
Automatic test directory detection and local module filtering in scan-deps.
Comparison against existing install_command to identify missing deps.
Registry cross-referencing for import_name resolution in scan-deps.
labeille analyze CLI subgroup with five subcommands: registry, run, compare, history, package.
formatting.py shared formatting module (tables, histograms, sparklines, duration, status icons).
analyze.py data loading and analysis module (run data, registry stats, comparison, flaky detection).
labeille registry CLI subgroup for batch registry management (add-field, remove-field, rename-field, set-field, validate, add-index-field, remove-index-field).
Line-level YAML manipulation (yaml_lines.py) preserving exact formatting.
Batch operations module (registry_ops.py) with filtering, atomic writes, and dry-run previews.
Registry validation against the PackageEntry schema with labeille registry validate.
skip_versions registry field for per-Python-version skip reasons (e.g. 3.15: "PyO3 not supported").
--force-run flag to override skip and skip_versions for debugging.
--workers N option for parallel package testing in labeille run.
--workers N option for parallel PyPI resolution in labeille resolve.
Cancellation support for --stop-after-crash in parallel mode.
clone_depth registry field for packages needing git tags (e.g. setuptools-scm).
import_name registry field for packages whose import name differs from PyPI name.
summary.py module for formatting run results.
Enrichment best practices documented in CONTRIBUTING.md.
--refresh-venvs flag to delete and recreate existing venvs, ensuring updated install commands take effect.
Initial project scaffolding.
CLI skeleton with resolve and run subcommands.
Registry schema and data structures.

Documentation

Standalone guides for resolve-run workflow, benchmarking, free-threaded testing, and compatibility analysis (doc/workflow.md, doc/benchmarking.md, doc/free-threaded.md, doc/compat.md).
README sections for benchmarking, free-threaded testing, and compatibility analysis features with command examples and links to standalone guides.
Updated README Status and Project Structure sections to reflect bench, ft, and compat features.
Added Anthropic support acknowledgment section to README.md.
Added security warnings to README.md, runner.py module docstring, and CLAUDE.md.
Added Gemini acknowledgment to CREDITS.md.
Comprehensive enrichment guide with manual workflow, troubleshooting, and Claude Code prompts (doc/enrichment.md).
Updated README with enrichment overview and link to guide.
Parallel execution guidance, resource considerations, and ASAN vs non-ASAN trade-offs.

Enhanced

Refactored summary.py to use shared formatters from formatting.py.
Improved repo URL resolution with secondary keys (bug tracker, issues, changelog) and legacy field fallbacks (home_page, download_url).
Run summary shows version-skipped count separately when skip_versions is active.
Progress reporting adapted for parallel execution with per-completion status lines.
Rich end-of-run summary with per-package table, timing stats, and crash details.
Quiet mode shows only crash information; default mode hides passing packages.
Post-install import validation catches broken installs before running tests.
Add --work-dir, --repos-dir, and --venvs-dir options to run for persistent clone/venv directories that survive across runs.
Reuse existing repo clones (pull instead of re-clone) and venvs (skip create+install).
Log repo and venv paths in default output for each package.
Verbose mode (-v) now shows test subprocess stdout/stderr, resolved commands, install output, installed dependency list, git operations, and per-phase timing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

[Unreleased]

Added

Enhanced

Fixed

Documentation

Tests

Documentation

Added

Changed

Enhanced

Removed

Fixed

Added

Documentation

Enhanced

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[Unreleased]

Added

Enhanced

Fixed

Documentation

Tests

Documentation

Added

Changed

Enhanced

Removed

Fixed

Added

Documentation

Enhanced