Skip to content

Latest commit

 

History

History
462 lines (445 loc) · 51.5 KB

File metadata and controls

462 lines (445 loc) · 51.5 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[Unreleased]

Added

  • labeille tsan-run command for running extension test suites under ThreadSanitizer-enabled free-threaded Python. Captures data race reports for use with ft-review-toolkit's tsan-report-analyzer agent. Includes --test-script for running custom concurrent stress tests instead of the package's test suite, --quick mode for CI, --stress N for repeated runs, and bundled CPython TSan suppressions.
  • labeille cext-build command for generating compile_commands.json compilation databases from C extension packages. Uses Bear to intercept compiler invocations during the build, with automatic fallback to build-system-specific mechanisms for Meson and CMake projects.
  • CextBuildConfig, CextBuildResult, CextBuildMeta dataclasses in cext_build.py.
  • detect_bear() for auto-detecting bear installation and version.
  • detect_build_system() for identifying the build system from project files (meson, cmake, setuptools, flit, hatch, pdm).
  • extract_build_requires() for reading [build-system].requires from pyproject.toml to support --no-build-isolation builds.
  • find_compile_db() for searching repo and build directories for generated compilation databases.
  • postprocess_compile_db() for fixing source file paths in generated compilation databases.
  • Output includes per-package compile_commands.json, repo symlinks, build logs, and JSONL result summaries.
  • Designed for integration with cext-review-toolkit Tier 2 analysis (clang-tidy).

Enhanced

  • Extract _build_run_config from run_cmd in cli.py to separate config validation/construction from CLI execution.
  • Extract _md_table helper from export_registry_report_md in analyze_cli.py to deduplicate 8 markdown table constructions.
  • Extract _prepare_source, _classify_install_result, _check_import_result from _survey_package in compat.py, reducing complexity from 23 branches to ~8 per function.
  • Split bench_cli.py:compare into _compare_intra_run, _compare_cross_run, and _report_anomalies, eliminating CC=62.
  • Extract _collect_package_data, _compute_trends, _classify_trends, _build_median_dicts from analyze_series_trends in bench/trends.py (CC=48 → ~10 per helper).
  • Extract _infer_field_type from batch_set_field in registry_ops.py to deduplicate type inference.
  • Type FTRunMeta.system_profile and python_profile as SystemProfile and PythonProfile instead of dict[str, Any].
  • Extract _validate_package_file from validate_registry in registry_ops.py (CC=59 → ~15 per function).
  • Extract SEVERITY_LABELS constant from 4 inline duplications in bench/display.py and bench/export.py.
  • Extract _make_anomaly helper in bench/anomaly.py to deduplicate 9 identical PackageAnomaly constructions in detect_condition_anomalies.
  • bench run and ft run now use shared setup_logging() for consistent log formatting.
  • Simplify _run_package_inner in runner.py by extracting 4 helper functions: _align_sdist_version, _setup_venv, _install_in_venv, _check_import.
  • Split _check_import_and_extras into _check_import and _install_extra_deps for single-responsibility.
  • Rename compat diff to compat compare for CLI vocabulary consistency.
  • Extract parse_env_pairs() and parse_csv_list() CLI helpers to eliminate duplicated parsing across cli.py, registry_cli.py, compat_cli.py, bench_cli.py, and ft_cli.py.
  • Decompose generate_registry_report (190 lines) into focused helpers: _accumulate_package_stats, _analyze_compat_blockers, _analyze_per_version, _analyze_download_tiers.
  • Decompose _print_run_summary (180 lines) into 7 section formatters for readability.
  • Unify run ID generation with generate_run_id() in io_utils.py — all subsystems now use UTC with consistent format.
  • Extract run_in_process_group() into io_utils.py as a shared subprocess lifecycle utility, used by both runner.py and bench/timing.py.
  • bench/results.py append_package_result now uses shared append_jsonl instead of raw open().
  • Replace RepoHostStats, InstallComplexity, CompatBlockers dataclasses with dict[str, int] counters on RegistryReport, eliminating getattr/setattr usage.
  • Replace Any type annotations with concrete types in ft/export.py, ft/compare.py, ft/display.py.
  • Simplify _build_bench_config by splitting into _build_base_config and _apply_config_overrides, eliminating locals() dict unpacking via @click.pass_context.
  • Use clean_env() in bench/system.py instead of inline env dict construction for pattern consistency.
  • Use append_jsonl() in migrations.py instead of raw open()/json.dumps for pattern consistency.

Fixed

  • Fix bench compare Click decorator chain pointing at wrong function after complexity split, restoring multi-directory comparison and --metric option.
  • Log PyPI metadata extraction failures in ft/runner.py instead of silently swallowing (KeyError, TypeError).
  • Wrap load_package in filter_packages with try/except to skip bad YAML instead of crashing the entire batch run.
  • Add logging to bisect.py git helpers (_resolve_rev, _get_commit_info) on failure.
  • Add OSError handler to _check_import_result in compat.py for missing venv python.
  • Use dataclass_from_dict for ErrorMatch deserialization in compat.py for forward compatibility.
  • Remove stale RunOutput as RunOutput re-export from runner.py.
  • Narrow FieldFilter.op to Literal type in registry_ops.py.
  • Wrap checkout_matching_tag git fetch --tags in try/except to prevent unhandled TimeoutExpired/OSError crashes.
  • Make load_ft_run lenient for missing ft_results.jsonl (returns empty list like load_bench_run).
  • Add missing installer_backend field to analyze.PackageResult to match runner_models.PackageResult.
  • Narrow IndexEntry.extension_type from str to Literal["pure", "extensions", "unknown"].
  • Narrow PackageResult.install_from and installer_backend from str to Literal types.
  • Fix --no-shallow / --clone-depth=0 to produce actual full clones (no --depth flag) instead of silently defaulting to depth=1.
  • Add warning logs to 4 silent return-None paths in repo_ops.py (clone_repo, pull_repo, checkout_revision, fetch_latest_pypi_version).
  • Promote pull_repo git reset/clean failure logs from DEBUG to WARNING for visibility.
  • Log warning for corrupted run_meta.json in analyze.py instead of silently returning empty metadata.
  • Add ignore_errors=True to shutil.rmtree in bench/runner.py venv refresh to prevent crashes on permission errors.
  • Standardize exception chaining from from None to from exc in 3 bench_cli.py load_series handlers.
  • Type FTPackageResult.extension_compat as ExtensionCompat | None instead of dict[str, Any] | None, eliminating untyped dict access across ft subsystem.
  • Replace _setup_package return type dict[str, Any] with typed _PackageSetup dataclass in bench/runner.py.
  • Log ValueError/OSError in ft/runner.py stderr and stdout reader threads instead of silently ignoring.
  • Remove thin extract_minor_version wrapper from analyze.py; callers now use io_utils.extract_minor_version directly.
  • Replace ProgressCallback = Any with Callable[[BenchProgress], None] | None in bench/runner.py.
  • Replace list[Any] with list[PackageEntry] in bench/runner.py _run_sequential and _run_interleaved.
  • Replace cond: Any with BenchConditionResult in bench/compare.py _collect_call_durations.
  • Replace Any return/param types with BenchConfig in bench_cli.py _build_base_config and _apply_config_overrides.
  • Add Literal types for BisectStep.status, BisectConfig.installer, PackageAnomaly.anomaly_type/severity, ValidationError.severity, BenchConfig.installer, and analyze.PackageResult.status.
  • Log warning in runner.py:list_installed_packages when pip exits with non-zero code instead of silently returning empty.
  • Raise RegistrySchemaError in check_registry_schema when schema.yaml exists but is corrupt instead of silently ignoring.
  • Set import_ok = False in ft/runner.py when extension compat check raises instead of leaving default True.
  • Remove dead hasattr(result, "returncode") guard in bench/runner.pyinstall_with_fallback always returns CompletedProcess.
  • Use parse_env_pairs in ft_cli.py and bench_cli.py instead of inline env parsing that silently drops malformed pairs.
  • Use load_jsonl in migrations.py:read_migration_log instead of hand-rolled JSONL parsing.
  • Add missing docstrings to 8 to_dict methods in ft/analysis.py, ft/compare.py, ft/compat.py.
  • Replace 21 assertTrue(len(...)) with assertGreater/assertGreaterEqual in tests for better failure messages.
  • Convert mkdtemp() to TemporaryDirectory() in test_ft_compat.py for consistency.
  • Type pkg: Any parameters as PackageEntry in bench/runner.py and ft/runner.py for type safety.
  • Fix IndexEntry.package attribute access (should be .name) in ft/runner.py:_select_packages.
  • Add missing encoding="utf-8" to bench_cli.py export write_text() call.
  • Import check OSError now sets install_error status instead of silently continuing.
  • Standardize exception variable naming from as e: to as exc: in ft/compat.py, registry_ops.py, and registry_cli.py (5 instances).
  • Raise compat survey clone/build failure log levels from debug to warning for visibility.
  • Rename from_dict parameter ddata in CompatResult and CompatMeta for consistency.
  • Bisect extra deps install failure now returns skip step instead of silently continuing.
  • Narrow except Exception to specific types in analyze_cli.py and ft/compat.py.
  • Add warning logs for silent failures in runner.py (JIT check), scan_deps.py (dir scan), and bisect.py (deps install).
  • Extension probe script now reports walk_error and skipped_modules instead of silently passing.
  • Guard all yaml.safe_load call sites against YAMLError: add safe_load_yaml utility to io_utils.py, protect registry.py, registry_ops.py, analyze_cli.py, migrations.py, and bench/config.py.
  • Add 120-second timeout to git subprocess calls in registry sync to prevent indefinite hangs.
  • Add ignore_errors=True to shutil.rmtree in venv refresh to prevent cleanup failures masking results.
  • Log PermissionError at WARNING in kill_process_group instead of silently ignoring.
  • Add from None to exception chains in bench_cli.py for cleaner tracebacks.
  • Include exception details in connection error log in resolve.py.
  • Guard rename in shield_source_dir finally block against missing source file.
  • Add load_json_file utility to io_utils.py and guard 5 unprotected json.loads call sites against JSONDecodeError.
  • Add JSONDecodeError to exception handling in bench/system.py Python profile probe.
  • Add KeyError handling for missing probe output fields in ft/compat.py.
  • Include exception details in schema.yaml parse warning in registry.py.
  • Catch ValueError from malformed tracking.json in bench_cli.py track commands.
  • Surface skipped-package counts in ft/runner.py, bench/runner.py, and bench/tracking.py so users know when results are incomplete.
  • Extract dataclass_from_dict utility to io_utils.py and deduplicate 13 identical from_dict implementations across bench/ and ft/ modules.
  • Use atomic_write_text for ft/results.py JSONL output and runner.py summary file to prevent corruption on interruption.
  • Remove 3 unnecessary private re-exports (_EXTRAS_RE, _SELF_INSTALL_RE, _TAG_PATTERNS) from runner.py.
  • Move dataclass_from_dict imports from method bodies to module level across 9 files.
  • Deduplicate extract_minor_version: move canonical implementation to io_utils.py, delegate from runner.py and analyze.py.
  • Remove deprecated extract_python_minor_version wrapper from runner.py; callers in cli.py and compat_cli.py now use io_utils.extract_minor_version directly.
  • Delegate format_signal_name in formatting.py to crash.signal_name instead of duplicating signal conversion logic.
  • Replace inline JSON write in bench/tracking.py with write_meta_json.
  • Guard _read_lines call sites in registry_ops.py against OSError/UnicodeDecodeError to prevent batch operations from crashing on corrupt files.
  • Promote list_installed_packages error log from info to warning in runner.py.
  • Promote iter_jsonl per-line error log from debug to warning in io_utils.py.
  • Guard build log write_text in compat.py against OSError to prevent filesystem errors from aborting runs.
  • Catch CalledProcessError and TimeoutExpired from git clone/fetch in bisect.py to provide clear error messages.
  • Include exception details in malformed series warning in bench/tracking.py.
  • Replace getattr with direct attribute access on typed dataclass objects in analyze.py, summary.py, bench/runner.py, and ft/runner.py.
  • Migrate registry_cli.py from sys.exit(1) to raise click.ClickException(...) for validation errors and sync failures, matching the rest of the CLI surface.
  • Extract _HOST_LABELS and _INSTALL_LABELS to module-level constants in analyze_cli.py, removing duplicates between terminal and markdown formatters.
  • Migrate RunMeta.from_dict and PackageResult.from_dict in analyze.py to use dataclass_from_dict utility.
  • Use safe_load_yaml in migrations.py instead of duplicating the YAML load-and-validate pattern.
  • Wire up setup_logging for --verbose flags in analyze registry, analyze run, and registry sync commands that previously accepted but ignored the flag.
  • Add load_yaml_strict utility to io_utils.py and migrate 4 inline YAML load-and-validate patterns in registry.py, compat.py, and bench/config.py.
  • Use atomic_write_text for bench/results.py JSONL batch write and remove unused to_jsonl_line/from_jsonl_line methods.
  • Use dataclass_from_dict in BenchIteration.from_dict instead of inline field-filtering reimplementation.
  • Use load_json_file in bench/tracking.py load_series instead of inline JSON parsing.
  • Add exc_info=True to 14 log.error() calls inside except blocks across runner.py, resolve.py, bench/runner.py, and ft/runner.py to preserve tracebacks for debugging.
  • Explicitly set import_ok = False in ft/compat.py JSONDecodeError, KeyError, and OSError handlers for robustness.

Documentation

  • Add missing help= text to 21 CLI options across cli.py, analyze_cli.py, ft_cli.py, compat_cli.py, and bench_cli.py.

Tests

  • Add 11 tests for compat survey execution pipeline: _prepare_source, _classify_install_result, _check_import_result.
  • Add 5 tests for run_ft orchestrator and _select_packages (was 0% coverage).
  • Add 3 tests for _survey_package integration (build_ok, build_fail, timeout paths).
  • Add 3 tests for _align_sdist_version (source mode, sdist with tag, sdist without tag).
  • Add 26 tests for bench track subcommands: init, add, show, pin, unpin, list, trend, alert.
  • Add CLI tests for registry subcommands: rename-field, set-field (--all, --where, --packages), validate (--packages filter), migrate (list, unknown, missing name), sync (clone, pull, failure, non-git), add-index-field, remove-index-field, and group help.
  • Add test_cli_utils.py with 14 tests for parse_env_pairs and parse_csv_list.
  • Add 12 behavioral tests for ft compare, ft report, and ft export CLI commands.
  • Add test_repo_ops.py with 31 tests for clone_repo, pull_repo, checkout_revision, parse_package_specs, and parse_repo_overrides.
  • Add 35 tests to test_io_utils.py for load_yaml_strict, iter_jsonl, load_jsonl, append_jsonl, dataclass_from_dict, extract_minor_version, generate_run_id, and write_meta_json.
  • Add 11 tests for kill_process_group and run_in_process_group covering process group kills, timeout handling, and fallback to proc.kill().
  • Add test_cli.py with 21 tests for resolve, run, bisect, and scan-deps CLI commands covering parameter validation, error paths, and output formatting.

Documentation

  • Update CLAUDE.md: fix stale test count (546 → 2068) and add 13 missing modules to architecture section.

Added

  • labeille analyze registry now generates a comprehensive three-tier report: summary (default), detailed (--detail), and verbose (--detail --verbose).
  • --export-markdown flag for analyze registry generates a Markdown document suitable for inclusion in a repository.
  • -o/--output flag to write report output to a file.
  • --python-version is now repeatable for multi-version analysis.
  • New report sections: enrichment progress, per-version readiness, compatibility blockers (PyO3, Cython, Meson, CMake, Fortran, removed C API), repository hosting distribution, install command complexity, and download tier coverage.
  • RegistryReport dataclass with sub-reports: EnrichmentProgress, VersionAnalysis, RepoHostStats, InstallComplexity, CompatBlockers, DownloadTierCoverage.
  • generate_registry_report() function for comprehensive registry analysis in a single pass.
  • Helper classifiers: _classify_repo_host(), _classify_install_complexity(), _classify_compat_blocker().
  • labeille registry sync command to clone or update the laruche registry into the default location.
  • Schema version checking via schema.yaml at the registry root. labeille checks this at load time and gives an actionable error if the registry schema is incompatible.
  • default_registry_dir() and LABEILLE_REGISTRY_DIR environment variable for configuring the registry location.
  • RegistrySchemaError exception for incompatible registry schemas.
  • --adaptive flag for labeille bench run: stop iterating early when wall-time measurements converge (RSE below threshold).
  • --adaptive-threshold option (default 0.005 = 0.5% RSE) and --adaptive-min-iterations option (default 5) for fine-tuning convergence behavior.
  • adaptive, adaptive_threshold, and adaptive_min_iterations fields in YAML benchmark profiles.
  • converged_early field on BenchConditionResult, recorded in bench_results.jsonl.
  • relative_standard_error() function in bench/stats.py.
  • Adaptive convergence support in all three execution strategies (block, alternating, interleaved).
  • Quick mode (--quick) now enables adaptive convergence by default.
  • Convergence indicators in benchmark display: checkmark in table, count in quality summary, config line.
  • --trust-ft-wheels flag for labeille ft run: packages with free-threaded wheels (cpXYt ABI tag) for the target Python version are classified as compatible_by_wheel without running tests.
  • --trust-ft-wheels-any-version flag for labeille ft run: like --trust-ft-wheels but trusts free-threaded wheels built for any Python version. Implies --trust-ft-wheels.
  • COMPATIBLE_BY_WHEEL category in FailureCategory with symbol and severity 0.
  • has_ft_wheel() function in classifier.py to detect free-threaded wheels in PyPI release metadata.
  • ft_wheel_found and ft_wheel_version fields on FTPackageResult for provenance tracking.
  • trust_ft_wheels and trust_ft_wheels_any_version fields on FTRunConfig, recorded in ft_meta.json.
  • FT wheel check reuses PyPI metadata for sdist version lookup when both --trust-ft-wheels and --install-from sdist are active.
  • --install-from {source|sdist} option for labeille run and labeille ft run: install packages from PyPI source distributions while running tests from cloned git repos.
  • Sdist version alignment: fetch_latest_pypi_version() queries PyPI, checkout_matching_tag() aligns the repo to the matching release tag.
  • Source directory shielding: shield_source_dir() temporarily renames flat-layout source dirs to prevent local imports from shadowing the sdist-installed package.
  • Install command splitting: split_install_command() and build_sdist_install_commands() separate self-install segments from test dependency segments.
  • install_from, sdist_version, and sdist_tag_matched fields on PackageResult, FTPackageResult, and analysis PackageResult.
  • labeille compat command group for C extension compatibility surveys: survey, show, diff, and patterns subcommands.
  • ~30 built-in error classification patterns across 10+ categories (removed_c_api, cython_incompatible, pyo3_incompatible, numpy_c_api, missing_system_lib, etc.).
  • YAML-based custom error pattern support with override semantics.
  • Survey diff for tracking regressions and fixes between Python versions.
  • Markdown export for sharing compatibility survey results.
  • Optional uv integration for faster venv creation and package installation via --installer flag (auto/uv/pip).
  • InstallerBackend enum, detect_uv(), resolve_installer(), and _rewrite_install_command() in runner.py.
  • install_with_fallback() for automatic pip fallback when uv install fails.
  • installer_backend field on PackageResult and in run metadata.
  • --installer CLI option on run, bench run, and bisect commands.
  • Scaled registry from ~350 to 1500 packages: 720 active, 362 skip_versions (3.15 blockers), 418 fully skipped, with 86.4% working test harness coverage.
  • 5-tier test directory detection in _auto_detect_test_dirs(): standard dirs (t/, spec/), package-named/internal dirs, monorepo subdirs, root-level test files, and scattered test files in package source.
  • Multi-forge URL normalization via _normalize_forge_url() supporting GitHub, GitLab, Bitbucket, and Codeberg.
  • Expanded extract_repo_url() with all-values project_urls scan and description field scanning as last resort.
  • recover-no-tests-found and recover-no-repo-url registry migrations for recovering falsely skipped packages.
  • trends.py module in bench subpackage with PackageTrend, RegressionAlert, and SeriesTrend dataclasses for longitudinal benchmark analysis.
  • compute_package_trend() with configurable regression/trend thresholds and sustained-change detection.
  • analyze_series_trends() for full series analysis: loads all runs, computes per-package trends, generates regression alerts.
  • Five alert types: new_regression, sustained_regression, recovery, new_instability, new_improvement.
  • labeille bench track trend command for viewing trend analysis with table, CSV, and Markdown output.
  • labeille bench track alert command for viewing regression alerts since the last run.
  • export_trend_markdown() and export_trend_csv() in bench/export.py for trend report generation.
  • format_series_trend() and format_regression_alerts() in bench/display.py for terminal output.
  • tracking.py module in bench subpackage with TrackingSeries and TrackingRunEntry dataclasses for longitudinal benchmark tracking.
  • compute_config_fingerprint() for comparing benchmark configurations across runs (ignores package list and system profile).
  • Series management: init_series(), add_run_to_series(), pin_baseline(), unpin_baseline(), load_series(), save_series(), list_series().
  • labeille bench track CLI subgroup with init, add, show, list, pin, and unpin commands.
  • Symlink-based run storage within tracking directories to avoid data duplication.
  • constraints.py module in bench subpackage with ResourceConstraints dataclass and ulimit/taskset command wrapping.
  • Resource constraints as part of the condition abstraction: --memory-limit, --cpu-affinity, and --cpu-time-limit CLI flags for labeille bench run.
  • Per-condition constraint specification in YAML benchmark profiles.
  • Inline condition constraint parsing (e.g. --condition "constrained:memory_limit=1024,cpu_affinity=0+1").
  • OOM detection via detect_oom_from_result() with new "oom" iteration status.
  • constraints_applied and oom_detected fields on BenchIteration.
  • constraints field on ConditionDef for per-condition resource limits.
  • cache.py module in bench subpackage for filesystem cache management during benchmarks.
  • --drop-caches flag for labeille bench run to drop filesystem caches between iterations for cold-cache benchmarking.
  • --warm-vs-cold flag to automatically compare warm-cache and cold-cache performance.
  • --run-dangerously-as-root safety flag — labeille refuses to run as root without it.
  • labeille bench setup-cache-drop command showing setup instructions for the sudoers-based cache-dropping helper.
  • generate_drop_caches_script() and format_setup_instructions() helpers for cache-drop setup.
  • caches_dropped field on BenchIteration to record cache state per iteration.
  • Per-test timing capture via pytest --durations=0 output, enabled with --per-test-timing flag on labeille bench run.
  • TestTiming and PerTestTimings dataclasses in bench/timing.py with pytest output parser.
  • compare_per_test() in bench/compare.py for per-test overhead analysis between conditions.
  • --per-test <package> option for labeille bench show and labeille bench compare to display per-test timing breakdown.
  • anomaly.py module in bench subpackage with PackageAnomaly and AnomalyReport dataclasses for proactive measurement-quality assessment.
  • detect_anomalies() with five anomaly types: high_cv, bimodal, outlier_heavy, status_mixed, and trend.
  • is_bimodal() gap-analysis heuristic and has_monotonic_trend() Spearman rank correlation for pure-Python distribution analysis.
  • --anomalies flag for labeille bench show to display measurement anomaly report.
  • Anomaly summary in labeille bench compare output when anomalies are detected.
  • ## Anomalies section in Markdown export when anomalies are present.
  • ft/compare.py module with PackageTransition, FTComparisonResult dataclasses and compare_ft_runs() for cross-run comparison: category transitions, pass rate changes, new/resolved crashes, aggregate deltas.
  • ft/export.py module with export_csv(), export_json(), and generate_report() for CSV, JSON, and markdown report export of free-threading results.
  • ft/display.py module with terminal formatting for free-threading results: compatibility summaries, package tables, flakiness profiles, triage lists, GIL comparison reports, and progress output.
  • ft_cli.py module with labeille ft CLI subgroup: run, show, compare, compat, flaky, report, export commands for free-threading compatibility testing.
  • ft/analysis.py module with FlakyTest, FlakinessProfile, GILComparisonResult, TriageEntry, DurationAnomaly, and FTAnalysisReport dataclasses for free-threading result analysis.
  • analyze_flakiness() for detailed flakiness profiling with failure pattern classification and consecutive streak detection.
  • compare_gil_modes() for GIL-enabled vs free-threaded result comparison.
  • prioritize_triage() severity-scored triage with extension and TSAN bonuses.
  • detect_duration_anomalies() using statistical outlier detection from bench/stats.py.
  • analyze_ft_run() full analysis pipeline producing FTAnalysisReport.
  • ft/runner.py module with FTRunConfig, OutputMonitor, run_single_iteration(), run_package_ft(), and run_ft() for free-threading test execution with crash/deadlock/TSAN detection and pytest output parsing.
  • ft/results.py module with FailureCategory enum, IterationOutcome, FTPackageResult, FTRunMeta, and FTRunSummary dataclasses for free-threading result storage, categorization, and JSONL/JSON serialization.
  • categorize_package() priority-ordered classification (install failure > import failure > deadlock > crash > TSAN > GIL fallback > compatible > incompatible > intermittent).
  • ft subpackage with compat.py module for extension GIL compatibility detection: runtime probe via sys._is_gil_enabled() and source scan for Py_mod_gil declarations.
  • ExtensionInfo, SourceScanResult, ModGilDeclaration, and ExtensionCompat dataclasses with JSON serialization.
  • probe_gil_fallback() runtime probe, scan_source_for_mod_gil() source scanner, assess_extension_compat() combined assessment, and format_extension_compat() display helper.
  • guess_import_name() with _IMPORT_NAME_OVERRIDES table for PyPI-to-import name resolution.
  • bench subpackage with system.py module for capturing system profiles (CPU, RAM, OS, disk) and Python interpreter profiles (version, JIT/GIL state, build flags) for benchmark reproducibility.
  • SystemProfile, PythonProfile, StabilityCheck, and SystemSnapshot dataclasses with JSON serialization and terminal display formatting.
  • check_stability() pre-benchmark validation (load average, available RAM).
  • stats.py module in bench subpackage with pure-Python statistical functions: describe(), welch_ttest(), cohens_d(), bootstrap_ci(), detect_outliers(), and compute_overhead() for benchmark comparison.
  • DescriptiveStats, TTestResult, EffectSize, BootstrapCI, and OverheadResult dataclasses with scipy fallback for t-test p-values.
  • timing.py module in bench subpackage with run_timed() and run_timed_in_venv() for capturing wall time, CPU time (via resource.getrusage delta), and peak RSS (via GNU /usr/bin/time with ru_maxrss fallback).
  • results.py module in bench subpackage with BenchIteration, BenchConditionResult, BenchPackageResult, ConditionDef, and BenchMeta dataclasses for the full benchmark result hierarchy, plus JSONL/JSON serialization via save_bench_run(), load_bench_run(), and append_package_result().
  • config.py module in bench subpackage with BenchConfig dataclass, YAML profile loading, inline condition parsing, test command resolution, environment/deps merging, and configuration validation.
  • runner.py module in bench subpackage with BenchRunner class orchestrating the full benchmark lifecycle: system profiling, stability checks, package setup (clone/venv/install per condition), timed iteration execution, and incremental JSONL result writing. Supports block, alternating, and interleaved execution strategies with progress callbacks.
  • BenchProgress dataclass and quick_config() helper for rapid iteration during development.
  • display.py module in bench subpackage with terminal formatting for benchmark results: per-package timing tables, multi-condition comparison tables with overhead/CI/significance, measurement quality summaries, and aggregate comparison summaries.
  • compare.py module in bench subpackage with structured comparison analysis: PackageOverhead and ComparisonReport dataclasses with anomaly flags (high CV, status mismatch, outliers), compare_conditions() for within-run comparison, and compare_runs() for cross-run comparison.
  • export.py module in bench subpackage with CSV (long-format per-iteration and summary) and Markdown export for external analysis tools and reports.
  • bench_cli.py module with labeille bench CLI subgroup: run (execute benchmarks from profiles or inline conditions), show (display saved results), compare (compare conditions within or across runs), system (print system characterization), and export (CSV/Markdown/CSV-summary export).
  • labeille bisect command to binary-search a package's git history and find the first commit that introduced a crash.
  • bisect.py module with BisectConfig, BisectStep, BisectResult dataclasses and the run_bisect algorithm with skip-neighbor handling for unbuildable commits.
  • Commit-aware run comparison: analyze compare and analyze run show git commit changes alongside status changes with heuristic annotations (e.g. "unchanged — likely a CPython/JIT regression").
  • PackageComparison dataclass with commit_changed/commit_unchanged properties for per-package comparison data.
  • New crash summary statistics in compare output showing repo unchanged/changed/unknown counts.

Changed

  • The package registry has been moved to its own repository: laruche. Use labeille registry sync to fetch it.
  • Default --registry-dir changed from registry/ (local) to ~/.local/share/labeille/registry/ (user-level). Override with LABEILLE_REGISTRY_DIR env var.
  • All documentation updated to reflect the split. Enrichment docs now live in laruche.

Enhanced

  • Added Literal types to 6 string-constrained fields: IterationOutcome.status, RunnerConfig.installer/install_from, PackageEntry.extension_type/install_method/test_framework. Mypy now catches invalid values at type-check time.
  • Extracted shared kill_process_group() into io_utils.py, replacing 3 independent implementations in runner.py, bench/timing.py, and ft/runner.py. The FT runner now correctly uses os.getpgid() and signal.SIGKILL instead of raw os.killpg(pid, 9).
  • Extracted _build_bench_config() helper from bench_cli.run, reducing the command body from ~130 lines to ~15 lines. Organized 35 Click options into labeled sections (profile, execution, package selection, paths, stability, adaptive, advanced).
  • Added utc_now_iso() helper to io_utils.py, unifying 15+ timestamp generation sites across 8 modules to a single canonical UTC format with Z suffix.
  • Registry save_index() and save_package() now use atomic_write_text() for crash-safe writes, preventing corruption of the most sensitive files.
  • Added Literal types to PackageResult.status, ResolveResult.action, BenchIteration.status, and EffectSize.classification for compile-time typo detection.
  • Made DescriptiveStats, TTestResult, EffectSize, BootstrapCI, OverheadResult, and CrashInfo dataclasses frozen (immutable).
  • Narrowed 23 except Exception handlers in bench/system.py to _PROBE_ERRORS tuple, preventing accidental swallowing of programming errors while preserving best-effort system probing.
  • Added logging to 8 previously silent exception handlers in ft/runner.py, bisect.py, cli.py, runner.py, and bench/timing.py.
  • Added error handling to save_crash_stderr() with mkdir for the crashes directory.
  • Fixed ScanResult forward reference in cli.py — now uses TYPE_CHECKING import, removing 4 suppression markers and 2 runtime asserts.
  • Standardized --output options to click.Path(path_type=Path) in ft_cli.py and analyze_cli.py.
  • Added type=int to ft run --timeout for consistency with other commands.
  • Added docstrings to 7 undocumented public APIs in compat.py (properties, to_dict, from_dict).
  • Fixed cli.py module docstring to list all 9 subcommand groups.
  • Added test_logging.py (8 tests) and test_io_utils.py (10 tests) for previously untested foundation modules.
  • Derived _KNOWN_FIELDS and _FIELD_TYPES in registry_ops.py from PackageEntry dataclass metadata, preventing drift.
  • Added PackageResult.to_dict() method, simplifying append_result() from 22-field manual dict to asdict().
  • Eliminated redundant _atomic_write wrappers in registry_ops.py and migrations.py.
  • Added encoding="utf-8" to bench/runner.py mid-run metadata write.
  • Extracted shared atomic_write_text() utility in io_utils.py, replacing duplicate implementations in registry_ops.py, migrations.py, and bench/tracking.py.
  • Promoted _clean_env() to public API as clean_env(), replacing inline env sanitization in ft/compat.py and bench/runner.py.
  • Unified logger acquisition: all 20 bench/ and ft/ modules now use get_logger() with per-module names (e.g., bench.runner, ft.runner) for filterable log output.
  • Added encoding="utf-8" to all bench/ file I/O calls for cross-platform consistency.
  • Added show_default=True to --timeout options in bench_cli.py and ft_cli.py.
  • Standardized click.Path(path_type=Path) across bench_cli.py and ft_cli.py, removing manual Path() wrapping.
  • Added write_meta_json(), append_jsonl(), load_jsonl(), and iter_jsonl() utilities to io_utils.py, unifying persistence patterns across all four subsystems (runner, bench, ft, compat).
  • All meta.json writes now use atomic_write_text() for crash safety (previously only registry files were atomic).
  • All JSONL loads now use streaming iteration with error tolerance for malformed trailing lines.
  • Extracted _ensure_repo(), _run_install(), and _analyze_test_result() from _run_package_inner() in runner.py, reducing the 420-line monolith to ~180 lines and eliminating duplicated install error handling.
  • Narrowed remaining except Exception blocks: ft/runner.py:397 to OSError, ft/runner.py:700 to (TimeoutExpired, SubprocessError, OSError), bisect.py to (FileNotFoundError, OSError), cli.py to (OSError, ValueError, KeyError, TypeError).
  • Merged duplicate load_package() calls in bisect.py into a single call with warning-level logging.
  • Raised get_installed_packages logging from debug to info for better diagnostics.
  • Added from __future__ import annotations to all test files for consistency with source modules.
  • labeille analyze registry shows percentages alongside all counts.
  • Download tier coverage (top 100, 500, 1000, 2000) shows what fraction of the most-downloaded packages are testable.
  • Version readiness section shows per-Python-version active/skipped counts with top skip reasons.
  • --format counts is preserved as a backward-compatible alias for the summary format.
  • Updated 63 registry packages with accurate skip reasons from compat analysis: cleared vague 3.15 skip_versions, added specific failure categories (Meson, CMake, removed APIs), reclassified non-3.15 issues as skip with precise reasons, and unskipped 5 packages that now build on 3.15.
  • BenchRunner._run_iteration() applies resource constraints via command wrapping before execution.
  • BenchRunner.run() now checks for root execution and refuses unless --run-dangerously-as-root is passed.
  • macOS support for system profiling: CPU info from sysctl, memory from vm_stat, OS from sw_vers, disk type from diskutil. All existing Linux code paths preserved unchanged.
  • check_stability() and SystemSnapshot.capture() now use cross-platform _get_available_ram_gb() helper instead of Linux-only /proc/meminfo.
  • format_system_profile() no longer hardcodes "Linux" in the OS line; shows platform-appropriate output.
  • Switched build backend from setuptools to hatchling for better src layout support and lighter build dependencies.
  • Added minimum version pins to runtime dependencies (click>=8.0, pyyaml>=6.0, requests>=2.28).
  • Added py.typed marker for PEP 561 type checker support.
  • Added sdist/wheel exclusion rules to keep distribution lean (no tests, registry, results, or docs).
  • Added Installation section to README with pipx, pip, and from-source instructions.
  • Added Environment :: Console and Topic :: Software Development :: Quality Assurance classifiers.
  • Renamed Issues URL key to Bug Tracker in project metadata for PyPI display consistency.
  • Replaced raise SystemExit(130) # noqa: B904 with raise SystemExit(130) from None in bench_cli.py, removing the suppression.
  • Narrowed except Exception in bench/system.py JIT detection to except (AttributeError, TypeError).
  • Added explanatory comment for except BaseException in io_utils.py atomic write.
  • Narrowed 5 except Exception catches in bench/runner.py and 1 in ft/runner.py to specific exception tuples (OSError, subprocess.SubprocessError, ValueError, KeyError), removing all noqa: BLE001 suppressions.
  • Restructured cli.py subgroup registration into _register_subcommands() function, eliminating 5 noqa: E402 suppressions.
  • Fixed type: ignore[operator] in ft/display.py by adding explicit None guard, in bench_cli.py by restructuring conditionals, and type: ignore[no-any-return] in resolve.py by using intermediate variable.
  • Extracted 7 helper functions from run_package_ft in ft/runner.py (360→~80 lines): _check_ft_wheel_trust, _clone_and_align_ft, _create_venv_and_install_ft, _install_sdist_mode, _install_source_mode, _run_ft_iterations, _run_gil_comparison.
  • Split runner.py (1972→1445 lines): extracted data models to runner_models.py and git/repo/sdist operations to repo_ops.py, with re-exports preserving all existing imports.
  • Reduced type: ignore markers in test files from 59 to 2 by typing builder kwargs as Any, using assert x is not None for type narrowing, adding explicit generic parameters, and typing mock parameters as MagicMock.

Removed

  • Dead code: _log2() from bisect.py, RegistryStats/analyze_registry() from analyze.py (superseded by RegistryReport/generate_registry_report()), load_ft_summary() from ft/results.py, format_progress()/format_gil_comparison() from ft/display.py, unused _MOD_GIL_MENTION_PATTERN regex from ft/compat.py.
  • Removed 12 unused log = get_logger(...) variables and their imports from modules that had logger scaffolding but no log calls.

Fixed

  • Narrowed 3 bare except Exception handlers: ft/compat.py GIL probe (to ImportError, OSError, AttributeError), submodule import loop (to ImportError, OSError), and registry.py schema parsing (to yaml.YAMLError, OSError). Programming errors now propagate instead of being silently swallowed.
  • bench/timing.py now sanitizes subprocess environment via clean_env(), preventing PYTHONHOME/PYTHONPATH pollution in benchmark runs.
  • ft/runner.py now uses start_new_session=True instead of deprecated preexec_fn=os.setpgrp, fixing a thread-safety hazard with ThreadPoolExecutor.
  • Top-level error handlers in runner.py, ft/runner.py, and resolve.py now preserve tracebacks via exc_info=True.
  • Replaced SystemExit(1) with click.ClickException in bench_cli.py for consistent error formatting.
  • Manual review of all 1,798 enriched registry packages: corrected invalid extension_type values, added missing -p no:xdist flags, fixed inconsistent skip states, corrected repo URLs, collapsed multiline YAML, and fixed test commands.
  • update_index_from_packages() no longer crashes when skip_versions is None.
  • Bench runner install_package now receives a complete environment (starting from os.environ) instead of bare condition env vars, fixing install failures when build backends need PATH to find tools like git.
  • run_meta.json now stores actual CLI argument strings (sys.argv[1:]) instead of parameter names, making runs reproducible from metadata.
  • build_reproduce_command uses export PATH for venv activation instead of fragile .venv/bin/ prefix string replacement.
  • Deduplicated _signal_name (3 copies → format_signal_name in formatting.py).
  • Deduplicated _result_detail (3 copies → public result_detail in analyze.py).
  • Made _extract_minor_version public as extract_minor_version in analyze.py.
  • Removed redundant zero-check in compare_runs duration percentage calculation.
  • Fixed timeout documentation (300s → 600s) in doc/enrichment.md.
  • _quote_yaml_scalar now quotes all numeric strings (integers, scientific notation, octal-like), tilde, and additional YAML special characters.
  • find_field_extent no longer consumes trailing blank lines after scalar fields, fixing insert_field_after placement near blank lines.
  • Rewrote batch_set_field to use line-level manipulation instead of PyYAML round-trip, preserving YAML formatting.
  • Added set_field_value to yaml_lines.py for in-place field value replacement.
  • format_yaml_value and parse_default_value now handle None/null values.
  • _is_version_specific_skip now uses word-boundary regex patterns to prevent false positives (e.g. "trust" no longer matches the "rust" pattern).
  • scan-deps now warns about namespace packages (google, azure, zope, etc.) where pip resolution is uncertain, and tries full import paths before falling back to top-level modules.
  • IndexEntry now tracks skip_versions_keys for fast version-skip filtering without loading full YAML files.
  • filter_packages uses index-level skip_versions_keys to skip packages before loading YAML.
  • _dict_to_package coerces notes: null to empty string for type safety.
  • _dict_to_package logs unknown YAML keys at debug level to surface typos.
  • validate_registry checks uses_xdist/-p no:xdist consistency in both directions.
  • check_jit_enabled now uses explicit sys.flags.jit check instead of nonexistent sys._jit, with exact stdout comparison.
  • _parse_install_packages now handles python -m pip install, python3 -m pip install, and path-qualified pip invocations.
  • _package_to_dict accepts omit_defaults parameter to exclude default-valued fields from output.
  • run_test_command and install_package now kill the entire process tree on timeout via os.killpg, preventing orphaned grandchild processes from accumulating during batch runs.
  • RunData.result_for() now uses O(1) dict lookup instead of O(N) linear scan, with lazily-built _results_by_pkg cache.
  • compare_runs and _compute_status_changes use result_for() instead of building ad-hoc dicts.
  • Subprocess helpers (build_env, check_jit_enabled, create_venv, validate_target_python) now strip PYTHONHOME/PYTHONPATH via _clean_env() to prevent environment pollution.
  • CLI warns when only one of --repos-dir/--venvs-dir is set, since the other will use a temporary directory.
  • update_index_from_packages accepts optional modified_packages set to avoid O(N) disk reads when only a few packages changed.
  • _PLATFORM_INDICATORS now detects bare linux_x86_64/linux_aarch64 wheels in addition to manylinux/musllinux.
  • fetch_pypi_metadata and resolve_package accept an optional requests.Session for connection reuse; resolve_all uses shared/thread-local sessions.
  • _is_import_error_handler no longer treats except Exception as an import error handler, reducing false conditional import flags.
  • _parse_install_packages uses a regex instead of chained .split() to handle all PEP 440 specifiers (~=, !=, ; markers).
  • pull_repo uses git fetch + reset --hard FETCH_HEAD + clean -fdx instead of git pull --ff-only, handling dirty working trees left by test suites.

Added

  • --extra-deps option to inject additional packages into every venv after the package's own dependencies.
  • --test-command-override option to replace the test command for all packages in a run.
  • --test-command-suffix option to append flags to each package's existing test command.
  • --repo-override PKG=URL option (repeatable) to test forks or PR branches without modifying registry.
  • --clone-depth and --no-shallow CLI options to override per-package clone depth; --clone-depth=0 or --no-shallow for full clones.
  • Per-package git revision support via --packages=pkg@revision syntax; accepts commit hashes, branches, tags, or relative refs like HEAD~10.
  • checkout_revision helper for checking out specific git refs after cloning.
  • parse_package_specs function for parsing name@revision package spec syntax.
  • requested_revision field in PackageResult to distinguish explicitly requested revisions from HEAD.
  • 350 enriched package configurations with full test commands, install commands, and metadata.
  • Applied skip-to-skip-versions migration on 36 packages (PyO3, maturin, Cython, JIT crashes).
  • Config fixes for python-dateutil, pyyaml, msgpack, hatchling, openai, numpy, pytz, sqlalchemy, and 3 archived google packages.
  • registry/migrations.log tracking applied migration history.
  • labeille registry migrate command with a migration framework for registry schema transformations.
  • skip-to-skip-versions migration to convert 3.15-specific skip:true entries to skip_versions["3.15"].
  • Migration log (migrations.log) to track applied migrations and prevent re-application.
  • Dry-run support for migrations with preview of affected packages.
  • labeille scan-deps command for static test dependency discovery via AST-based import analysis.
  • import_map.py module with 100+ import-name-to-pip-package mappings for common mismatches (PIL->Pillow, yaml->PyYAML, etc.).
  • Three output formats for scan-deps: human-readable (default), JSON, and pip (for direct shell use).
  • Automatic test directory detection and local module filtering in scan-deps.
  • Comparison against existing install_command to identify missing deps.
  • Registry cross-referencing for import_name resolution in scan-deps.
  • labeille analyze CLI subgroup with five subcommands: registry, run, compare, history, package.
  • formatting.py shared formatting module (tables, histograms, sparklines, duration, status icons).
  • analyze.py data loading and analysis module (run data, registry stats, comparison, flaky detection).
  • labeille registry CLI subgroup for batch registry management (add-field, remove-field, rename-field, set-field, validate, add-index-field, remove-index-field).
  • Line-level YAML manipulation (yaml_lines.py) preserving exact formatting.
  • Batch operations module (registry_ops.py) with filtering, atomic writes, and dry-run previews.
  • Registry validation against the PackageEntry schema with labeille registry validate.
  • skip_versions registry field for per-Python-version skip reasons (e.g. 3.15: "PyO3 not supported").
  • --force-run flag to override skip and skip_versions for debugging.
  • --workers N option for parallel package testing in labeille run.
  • --workers N option for parallel PyPI resolution in labeille resolve.
  • Cancellation support for --stop-after-crash in parallel mode.
  • clone_depth registry field for packages needing git tags (e.g. setuptools-scm).
  • import_name registry field for packages whose import name differs from PyPI name.
  • summary.py module for formatting run results.
  • Enrichment best practices documented in CONTRIBUTING.md.
  • --refresh-venvs flag to delete and recreate existing venvs, ensuring updated install commands take effect.
  • Initial project scaffolding.
  • CLI skeleton with resolve and run subcommands.
  • Registry schema and data structures.

Documentation

  • Standalone guides for resolve-run workflow, benchmarking, free-threaded testing, and compatibility analysis (doc/workflow.md, doc/benchmarking.md, doc/free-threaded.md, doc/compat.md).
  • README sections for benchmarking, free-threaded testing, and compatibility analysis features with command examples and links to standalone guides.
  • Updated README Status and Project Structure sections to reflect bench, ft, and compat features.
  • Added Anthropic support acknowledgment section to README.md.
  • Added security warnings to README.md, runner.py module docstring, and CLAUDE.md.
  • Added Gemini acknowledgment to CREDITS.md.
  • Comprehensive enrichment guide with manual workflow, troubleshooting, and Claude Code prompts (doc/enrichment.md).
  • Updated README with enrichment overview and link to guide.
  • Parallel execution guidance, resource considerations, and ASAN vs non-ASAN trade-offs.

Enhanced

  • Refactored summary.py to use shared formatters from formatting.py.
  • Improved repo URL resolution with secondary keys (bug tracker, issues, changelog) and legacy field fallbacks (home_page, download_url).
  • Run summary shows version-skipped count separately when skip_versions is active.
  • Progress reporting adapted for parallel execution with per-completion status lines.
  • Rich end-of-run summary with per-package table, timing stats, and crash details.
  • Quiet mode shows only crash information; default mode hides passing packages.
  • Post-install import validation catches broken installs before running tests.
  • Add --work-dir, --repos-dir, and --venvs-dir options to run for persistent clone/venv directories that survive across runs.
  • Reuse existing repo clones (pull instead of re-clone) and venvs (skip create+install).
  • Log repo and venv paths in default output for each package.
  • Verbose mode (-v) now shows test subprocess stdout/stderr, resolved commands, install output, installed dependency list, git operations, and per-phase timing.