Overhaul stdlib test running by indygreg · Pull Request #1068 · astral-sh/python-build-standalone

indygreg · 2026-03-29T09:01:05Z

This commit overhauls our ability to run the stdlib test harness and enables the stdlib test harness in CI for builds that we can run natively in CI.

Previously, testdist.py called a run_tests.py script that was bundled in the distribution. This script was simply a wrapper to calling python -m test --slow-ci. And --slow-ci currently expands to --multiprocess 0 --randomize --fail-env-changes --rerun --print-slow --verbose3 -u all --timeout 1200.

This commit effectively inlines run_tests.py into testdir.py as well as greatly expands functionality for running the test harness.

When enabling the stdlib test harness in CI as part of this commit, several test failures were encountered, especially in non-standard builds like static and debug. Even the freethreaded builds seemed to encounter a significant amount of failures (many of them intermittent), implying that the official CPython CI is failing to catch a lot of legitimate test failures.

We want PBS to run stdlib tests to help us catch changes in behavior. And we can only do that if the CI pass/fail signal is high quality: we don't want CI "passing" if there are changes to test pass/fail behavior.

Achieving this requires annotating all tests that can potentially fail. And then the test harness needs to validate that these annotations are accurate (read: that tests actually fail).

So this commit introduces a stdlib-test-annotations.yml file in the root directory. It contains rules that filter a build configuration and 3 sections that describe specific annotations:

Skip running the test harness completely. This is necessary on some builds that are just so broken it wasn't worth annotating tests because so many tests failed.
Exclude all tests within a given Python module. This is reserved for scenarios where importing the test module fails and causes most/all tests to fail. Again, a mechanism to short-circuit having to annotate every failing test.
Expected test failures. The most common annotation. These annotations describe individual tests or glob pattern matches of tests that are "expected" to fail. Entries can be annotated as "intermittent" or "dont-verify" to allow the test to pass without failing our test harness.

Most of the new code is in support of reading and applying these annotations.

At build time, we read the stdlib-test-annotations.yml file and derive a new stdlib-test-annotations.json file with only the active annotations matching the build configuration. This file is included in the build distribution as python/build/stdlib-test-annotations.json. It has to be JSON so the Python test harness runner is able to read the file using just the stdlib.

test-distributions.py has gained some new functionality, including the ability to run the stdlib test harness with raw arguments and emit a JUnit XML file with test results.

One of the things the test harness does now is attempt to ensure that tests annotated as failing actually fail. However, this isn't enforced for tests marked as "intermittent" or "dont-verify." You need an asynchronous mechanism looking at historical execution results to assess whether an "intermittent" test is such. We facilitate this by uploading a JUnit XML artifact with details of test execution. But the mining of historical test results is not implemented. (And I'm not sure if it is worth implementing.)

It took dozens of iterations to get a reliably working set of test annotations. There's just lots of variability across build configurations and Python versions. Despite best efforts, there's likely a few lingering intermittent failures that aren't yet annotated.

This commit overhauls our ability to run the stdlib test harness and enables the stdlib test harness in CI for builds that we can run natively in CI. Previously, `testdist.py` called a `run_tests.py` script that was bundled in the distribution. This script was simply a wrapper to calling `python -m test --slow-ci`. And `--slow-ci` currently expands to `--multiprocess 0 --randomize --fail-env-changes --rerun --print-slow --verbose3 -u all --timeout 1200`. This commit effectively inlines `run_tests.py` into `testdir.py` as well as greatly expands functionality for running the test harness. When enabling the stdlib test harness in CI as part of this commit, several test failures were encountered, especially in non-standard builds like `static` and `debug`. Even the `freethreaded` builds seemed to encounter a significant amount of failures (many of them intermittent), implying that the official CPython CI is failing to catch a lot of legitimate test failures. We want PBS to run stdlib tests to help us catch changes in behavior. And we can only do that if the CI pass/fail signal is high quality: we don't want CI "passing" if there are changes to test pass/fail behavior. Achieving this requires annotating all tests that can potentially fail. And then the test harness needs to validate that these annotations are accurate (read: that tests actually fail). So this commit introduces a `stdlib-test-annotations.yml` file in the root directory. It contains rules that filter a build configuration and 3 sections that describe specific annotations: 1. Skip running the test harness completely. This is necessary on some builds that are just so broken it wasn't worth annotating tests because so many tests failed. 2. Exclude all tests within a given Python module. This is reserved for scenarios where importing the test module fails and causes most/all tests to fail. Again, a mechanism to short-circuit having to annotate every failing test. 3. Expected test failures. The most common annotation. These annotations describe individual tests or glob pattern matches of tests that are "expected" to fail. Entries can be annotated as "intermittent" or "dont-verify" to allow the test to pass without failing our test harness. Most of the new code is in support of reading and applying these annotations. At build time, we read the `stdlib-test-annotations.yml` file and derive a new `stdlib-test-annotations.json` file with only the active annotations matching the build configuration. This file is included in the build distribution as `python/build/stdlib-test-annotations.json`. It has to be JSON so the Python test harness runner is able to read the file using just the stdlib. `test-distributions.py` has gained some new functionality, including the ability to run the stdlib test harness with raw arguments and emit a JUnit XML file with test results. One of the things the test harness does now is attempt to ensure that tests annotated as failing actually fail. However, this isn't enforced for tests marked as "intermittent" or "dont-verify." You need an asynchronous mechanism looking at historical execution results to assess whether an "intermittent" test is such. We facilitate this by uploading a JUnit XML artifact with details of test execution. But the mining of historical test results is not implemented. (And I'm not sure if it is worth implementing.) It took dozens of iterations to get a reliably working set of test annotations. There's just lots of variability across build configurations and Python versions. Despite best efforts, there's likely a few lingering intermittent failures that aren't yet annotated.

jjhelmus · 2026-04-17T18:09:08Z

Running the stdlib test suite adds a significant amount of time to CI adding ~2-20 minutes for each job where the suite is run. Running the suite on the debug builds are the longest of these additions to CI time.

Limiting the test suite to a subset of the build options, perhaps only those which are released as install_only artifact (PGO+LTO), would limit to additional CI time needed. This would still give good signal for these artifacts but would miss detecting changes in non-standard builds.

jjhelmus · 2026-04-17T18:18:23Z

CI usage could also be reduced by only testing a subset of targets by default on pull-requests, #1075. This comes with a trade-off, there is less CI usage but a chance that changes or failures are not detected until after merging. This happened in #1070 which used tags to limit CI targets and only after merge was a validation issue found with riscv64, #1092.

indygreg requested review from jjhelmus and zanieb March 29, 2026 09:01

indygreg force-pushed the gps-stdlib-tests branch from e3cfcef to 6149b24 Compare March 29, 2026 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overhaul stdlib test running#1068

Overhaul stdlib test running#1068
indygreg wants to merge 1 commit intomainfrom
gps-stdlib-tests

indygreg commented Mar 29, 2026

Uh oh!

jjhelmus commented Apr 17, 2026

Uh oh!

jjhelmus commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

indygreg commented Mar 29, 2026

Uh oh!

jjhelmus commented Apr 17, 2026

Uh oh!

jjhelmus commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants