Free-threaded CPython (PEP 703) removes the Global Interpreter Lock (GIL), enabling true parallelism but introducing new failure modes: data races, deadlocks, and crashes in code that relied on the GIL for thread safety.
labeille ft tests packages against free-threaded CPython builds by running each
package's test suite multiple times with PYTHON_GIL=0. Multiple iterations catch
intermittent failures — race conditions don't reproduce deterministically.
Each package is classified into a compatibility category: fully compatible, compatible with GIL fallback, intermittent failures, incompatible, crash, or deadlock.
- A free-threaded CPython build (
./configure --disable-gil) - Optional: a TSAN build for race condition detection
(
./configure --disable-gil --with-thread-sanitizer) - Packages enriched in the registry (see workflow.md)
# Run each package 10 times with PYTHON_GIL=0
labeille ft run --target-python ~/cpython-ft/python \
--work-dir ~/ft-work \
--top 50
# View results
labeille ft show results/ft_*Stop after the first passing iteration per package:
labeille ft run --target-python ~/cpython-ft/python \
--work-dir ~/ft-work \
--stop-on-first-pass --top 100labeille ft run --target-python ~/cpython-ft/python \
--work-dir ~/ft-work \
--compare-with-gil --iterations 10 --top 50labeille ft run --target-python ~/cpython-tsan/python \
--work-dir ~/ft-work \
--tsan --iterations 5 --top 30Each package is classified based on its behavior across all iterations:
| Category | Description |
|---|---|
compatible |
All iterations pass — fully compatible with free-threading |
compatible_gil_fallback |
Passes, but the GIL is re-enabled at runtime (C extension without Py_mod_gil) |
intermittent |
Some iterations pass, some fail — likely race conditions |
incompatible |
All iterations fail consistently (not a race, just broken) |
crash |
Segfault, abort, or assertion failure under free-threading |
deadlock |
Test suite stalls with no output (detected via stall threshold) |
tsan_warnings |
TSAN detects data races (may still pass functionally) |
install_failure |
Could not install the package |
import_failure |
Package installs but fails to import |
unknown |
Uncategorizable result |
Free-threading bugs are often non-deterministic. A single run may pass even if the code has race conditions. The default of 10 iterations provides reasonable detection probability:
- A race that manifests ~50% of the time will be caught with >99.9% probability in 10 runs
- A race that manifests ~10% of the time will be caught with ~65% probability
Increase --iterations for rare races, or use --stop-on-first-pass for quick
coverage scans where you only need to know if it can pass.
--stall-threshold SECONDS (default: 60) monitors test suite output. If no output
is produced for the threshold duration, the iteration is classified as a deadlock
and the process is killed.
Increase the threshold for packages with naturally slow tests (heavy computation, network waits).
--compare-with-gil runs each package twice:
- With
PYTHON_GIL=0(free-threaded) - With
PYTHON_GIL=1(GIL enabled)
This isolates free-threading-specific failures: if a package fails with both GIL=0 and GIL=1, the failure is unrelated to free-threading (e.g., a pre-existing test bug). Only failures unique to GIL=0 are true free-threading issues.
When --detect-extensions is enabled (default), labeille probes each package's
C extensions for free-threading compatibility:
- Checks if extensions declare
Py_mod_gil(the opt-in for free-threading) - Monitors
sys._is_gil_enabled()at runtime to detect GIL fallback - Reports which packages trigger GIL re-enablement due to incompatible extensions
--packages requests,click # Specific packages
--top 50 # Top N by downloads--iterations 10 # Measured iterations per package (default: 10)
--stop-on-first-pass # Stop after first passing iteration--timeout 600 # Per-iteration timeout in seconds (default: 600)
--stall-threshold 60 # Seconds without output before deadlock (default: 60)--extra-deps "trustme,uvicorn" # Inject deps into every venv
--test-command-suffix "--tb=short" # Append to test commands
--test-command-override "python -m pytest tests/" # Replace all test commands--env PYTHONMALLOC=debug # Extra env vars (repeatable)
--env PYTHONTRACEMALLOC=5PYTHON_GIL=0 is set automatically — you don't need to specify it.
--repos-dir ~/repos # Reuse repo clones
--venvs-dir ~/venvs # Reuse venvs
--results-dir results # Output directory (default: results)| Option | Description |
|---|---|
--target-python PATH |
Free-threaded Python build (required) |
--iterations N |
Iterations per package (default: 10) |
--timeout SECONDS |
Per-iteration timeout (default: 600) |
--stall-threshold SECONDS |
Deadlock detection threshold (default: 60) |
--packages CSV |
Comma-separated package filter |
--top N |
Top N packages by downloads |
--compare-with-gil |
Also run with GIL enabled for comparison |
--stop-on-first-pass |
Stop after first passing iteration |
--detect-extensions |
Check extension GIL compatibility (default: on) |
--tsan |
Parse TSAN warnings from stderr |
--check-stability |
Check system stability before starting |
--extra-deps CSV |
Extra dependencies for every venv |
--test-command-suffix STR |
Append to test commands |
--test-command-override STR |
Override all test commands |
--env KEY=VALUE |
Extra environment variables (repeatable) |
--registry-dir PATH |
Registry directory (default: ~/.local/share/labeille/registry/) |
--repos-dir PATH |
Persistent repos (default: repos) |
--venvs-dir PATH |
Persistent venvs (default: venvs) |
--results-dir PATH |
Output directory (default: results) |
-v, --verbose |
Detailed output |
Display a summary of free-threading test results:
labeille ft show results/ft_20260303_140000
labeille ft show results/ft_20260303_140000 --sort pass_rate
labeille ft show results/ft_20260303_140000 --limit 20Shows: system and Python info, compatibility summary (counts per category), and a per-package table with category, pass rate, and iteration details.
Sorting options: category (default), pass_rate, name.
Investigate intermittent failures in detail:
# Overview of all flaky packages
labeille ft flaky results/ft_20260303_140000
# Deep dive on one package
labeille ft flaky results/ft_20260303_140000 --package urllib3Shows which tests fail intermittently, failure patterns across iterations, and crash signature consistency.
Check which packages have C extensions and their GIL compatibility status:
labeille ft compat results/ft_20260303_140000
labeille ft compat results/ft_20260303_140000 --extensions-onlyReports Py_mod_gil declarations, GIL fallback status, and whether extensions
are free-threading-safe.
Track free-threading compatibility across CPython releases:
labeille ft compare results/ft_314a1 results/ft_314b2Shows:
- Category transitions (e.g.,
crash->compatible) - Pass rate changes
- New issues and resolved issues
Use case: run the same packages against CPython 3.14a1 and 3.14b2 to track which packages gain or lose free-threading compatibility.
Generate a full compatibility report for sharing:
labeille ft report results/ft_20260303_140000
labeille ft report results/ft_20260303_140000 --format markdown -o report.md
labeille ft report results/ft_20260303_140000 --format text# CSV — one row per package
labeille ft export results/ft_20260303_140000 --format csv -o results.csv
# JSON — structured data
labeille ft export results/ft_20260303_140000 --format json -o results.jsonA free-threading run produces:
results/ft_20260303_140000/
├── ft_meta.json # Run metadata (target Python, config, system info)
└── ft_results.jsonl # One JSON line per package with all iterations
ft_results.jsonl contains per-package:
- Failure category classification
- Per-iteration outcomes (pass/fail/crash/deadlock, exit code, duration)
- Pass rate and iteration count
- Extension GIL compatibility info (if
--detect-extensions) - TSAN warnings (if
--tsan) - GIL-enabled comparison results (if
--compare-with-gil)
# Quick scan of top 100 packages
labeille ft run --target-python ~/cpython-ft/python \
--work-dir ~/ft-work \
--stop-on-first-pass --top 100
# View summary
labeille ft show results/ft_*
# Export report for the free-threading compatibility tracker
labeille ft report results/ft_* --format markdown -o ft-compat-report.md# Run many iterations for a specific package
labeille ft run --target-python ~/cpython-ft/python \
--work-dir ~/ft-work \
--packages urllib3 --iterations 50
# Analyze flakiness
labeille ft flaky results/ft_* --package urllib3# Run against 3.14a1
labeille ft run --target-python ~/cpython-314a1/python \
--work-dir ~/ft-work --top 50 \
--results-dir results/ft_314a1
# Run against 3.14b2
labeille ft run --target-python ~/cpython-314b2/python \
--work-dir ~/ft-work --top 50 \
--results-dir results/ft_314b2
# Compare
labeille ft compare results/ft_314a1 results/ft_314b2# Use a TSAN-instrumented build
labeille ft run --target-python ~/cpython-tsan/python \
--work-dir ~/ft-work \
--tsan --iterations 5 --top 30
# Check for packages with tsan_warnings category
labeille ft show results/ft_* --sort categoryPackages classified as compatible pass all iterations with PYTHON_GIL=0. They
are safe to use with free-threaded CPython.
compatible_gil_fallback means the package works, but the GIL is re-enabled at
runtime because a C extension doesn't declare Py_mod_gil. The package is
functionally compatible but doesn't benefit from true parallelism.
intermittent packages have race conditions. Use ft flaky --package NAME to see:
- Which specific tests fail and how often
- Whether failures are concentrated in specific tests or spread across the suite
- Whether crash signatures are consistent (same race) or varied (multiple races)
crash packages segfault or abort under free-threading. Reproduce with:
PYTHON_GIL=0 ~/cpython-ft/python -m pytest tests/If the crash doesn't reproduce, increase iterations — it may be intermittent.
deadlock packages stall without producing output. Reproduce manually and attach
a debugger to inspect thread states. Common causes: lock ordering inversions,
missing lock releases on exception paths, busy-wait loops on GIL-protected state.
The free-threaded Python build may not be properly configured. Verify:
~/cpython-ft/python -c "import sys; print(sys._is_gil_enabled())"Should print False if the GIL is disabled by default.
Increase --stall-threshold for packages with slow tests. The default 60 seconds
may be too short for heavy computation or network-dependent tests.
TSAN requires a TSAN-instrumented CPython build (--with-thread-sanitizer). The
--tsan flag only tells labeille to parse TSAN output — it doesn't enable TSAN
instrumentation in CPython itself.