tests: improve integration test performance (~28% faster) by mykaul · Pull Request #771 · scylladb/python-driver

mykaul · 2026-03-28T08:23:11Z

Summary

Reduces integration test wall-clock time by ~28% (from ~33 min to ~24 min per run against ScyllaDB 2026.1) through a combination of sleep elimination, cluster topology consolidation, and test ordering optimizations.

Motivation

Integration tests are the primary bottleneck in the CI feedback loop. The majority of wasted time comes from:

Unconditional time.sleep() calls — tests sleeping for fixed durations instead of polling for the actual condition
Redundant cluster teardown/startup cycles — tests destroying and recreating identical CCM clusters
Suboptimal test ordering — tests with different cluster requirements interleaved, causing unnecessary cluster restarts

Changes (11 commits, each independent and cherry-pickable)

Bug fixes (commits 1-2)

Fix incorrect retry count in error message: execute_with_long_wait_retry said "Failed after 100 attempts" but the retry limit is 10
Remove redundant 10s sleep from setup_keyspace(): The cluster is already confirmed ready by three prior condition-based waits

Sleep elimination (commits 3, 9)

High-priority sleeps replaced with polling (~95s saved): simulacron startup (5s→HTTP polling), auth readiness (10s→credential polling), upgrade auth setup (10s→polling), upgrade control connection (3×20s→wait_until())
Medium-priority sleeps replaced with polling (~73s saved): shard-aware reconnection (25s→wait_until_not_raised), metrics cluster recovery (15s→polling), tablets metadata refresh (13s→polling), simulacron connection pool (20s→polling)

Cluster topology optimization (commits 4-7)

Standardize test_cluster.py to --smp 2: Consistent with the rest of the test suite
Consolidate cluster topologies: test_shard_aware.py shares cluster_tests config; test_client_routes.py module-level shares shared_aware config
Add test ordering by cluster topology (conftest.py): pytest_collection_modifyitems hook groups tests by their CCM cluster requirements, minimizing cluster restarts across files
Switch 6 test files from 3-node to 1-node cluster: test_types.py, test_cython_protocol_handlers.py, test_custom_protocol_handler.py, test_row_factories.py, test_udts.py, test_client_warnings.py — these tests don't exercise multi-node behavior

Test churn reduction (commit 8)

Reduce cluster churn in LoadBalancingPolicyTests: Move remove_cluster() from setUp (ran before every test) to only the destructive test methods that actually modify cluster topology

Compatibility fixes (commits 10-11)

Fix auth warning assertion for --smp 2: Shard-aware connections produce additional ReadyMessages, breaking the exact equality assertion. Simplified to >= 4 lower bound (the test's actual purpose)
Fix conftest sort to preserve definition order within files: The alphabetical tie-breaker broke test_tablets.py by reordering destructive tests before non-destructive ones. Uses original collection index instead.

Test Results

Timing comparison (ScyllaDB 2026.1)

Metric	Baseline (base branch)	Optimized	Savings
Average (5 runs)	33m 12s	24m 02s (10 runs)	9m 10s (27.6%)
Median	31m 41s	24m 50s	6m 51s (21.6%)
Best	31m 04s	21m 03s	10m 01s (32.3%)

Integration test matrix (10 consecutive green runs each)

ScyllaDB Version	Runs	Result	Avg Time
2026.1	10/10	✅ ALL PASSED	~24 min
master (`unstable/master:latest`)	10/10	✅ ALL PASSED	~21 min
2025.4	10/10	✅ ALL PASSED	~19 min
2025.1	10/10	✅ ALL PASSED	~18 min

Cassandra 4 and 5 could not be tested on the development machine (requires Java 8/11; only Java 21/25 available). These should be validated in CI.

Pre-existing failures (deselected, not caused by this PR)

TestFullNodeReplacementThroughNlb::test_should_survive_full_node_replacement_through_nlb — RuntimeError: The process is dead, returncode=1 during node bootstrap. Root cause: cluster name test_client_routes_replacement creates Unix socket paths exceeding the 108-char limit.
TestConcurrentSchemaChangeAndNodeKill (master only) — same Unix socket path length issue with cluster name test_concurrent_schema_change_and_node_kill.

Unit tests

651 passed, 16 skipped, 0 failures.

…r message The error message said 'Failed after 100 attempts' but the retry limit is 10 (while tries < 10). This was a copy-paste error from execute_until_pass() which does retry 100 times.

The time.sleep(10) in setup_keyspace() is redundant because callers already ensure the cluster is fully ready before calling it: - use_cluster() calls start_cluster_wait_for_up() which uses wait_for_binary_proto=True + wait_other_notice=True, then wait_for_node_socket() per node - External cluster path (wait=False) had no sleep anyway Remove the wait parameter entirely and its associated sleep, saving 10s per cluster startup.

Replace fixed sleeps with condition-based polling to speed up tests: - simulacron/utils.py: replace 5s sleep with HTTP endpoint polling (max 15s timeout, typically <1s) - test_authentication.py: replace 10s sleep with auth readiness poll that tries connecting with default credentials - upgrade/__init__.py: replace 10s auth sleep with same polling pattern - upgrade/test_upgrade.py: replace 3x 20s sleeps (60s total) with control connection readiness polling Total potential saving: ~95s of unconditional waiting per test run.

Change test_cluster.py from --smp 1 to --smp 2 to match the standard configuration used by other test files. This enables cluster topology consolidation in a follow-up commit.

Merge cluster names for test files with identical configurations: - test_shard_aware.py: 'shard_aware' -> 'cluster_tests' (same --smp 2, 3 nodes as test_cluster.py) - test_client_routes.py: 'test_client_routes' -> 'shared_aware' (same --smp 2 --memory 2048M, 3 nodes as test_use_keyspace.py) This allows the CCM cluster to be reused when these tests run sequentially, avoiding a full cluster teardown and restart. Also update conftest.py cleanup list to include 'cluster_tests' and 'test_client_routes_replacement' which were previously missing.

Add pytest_collection_modifyitems hook that sorts test modules by their cluster configuration group. This ensures tests sharing the same CCM cluster (same name, same node count, same ext opts) run adjacently, avoiding unnecessary cluster teardown/restart cycles between modules. Groups: default singledc -> cluster_tests -> shared_aware -> single_node -> destructive/special clusters.

These test files don't require multiple nodes for their test logic (they test data types, protocol handlers, row factories, UDTs, and client warnings). Using a single node reduces resource usage and cluster startup time. Files switched from use_singledc() to use_single_node(): - test_types.py - test_cython_protocol_handlers.py - test_custom_protocol_handler.py - test_row_factories.py - test_udts.py - test_client_warnings.py

Move remove_cluster() from setUp (which ran before every test) to only the destructive test methods that actually need a fresh cluster. Read-only tests (test_token_aware_is_used_by_default, test_token_aware_composite_key, test_token_aware_with_local_table, test_dc_aware_roundrobin_two_dcs, test_dc_aware_roundrobin_two_dcs_2) can now reuse an existing cluster, avoiding 5 unnecessary cluster teardown/startup cycles.

Replace fixed sleeps with condition-based polling in four test files: - test_shard_aware.py: replace 25s of sleeps (5+10+5+5) with wait_until_not_raised polling for reconnection after shard connection close and iptables blocking - test_metrics.py: replace 15s of sleeps (5+5+5) with polling for cluster recovery and node-down detection - test_tablets.py: replace 13s of sleeps (3+10) with polling for metadata refresh and decommission completion - simulacron/test_connection.py: replace 20s of sleeps (10+10) with polling for quiescent pool state Total potential saving: ~73s of unconditional waiting.

Lorak-mmk · 2026-03-28T09:28:38Z

Nice, I'll try to look into this soon (let me know when it is ready to review).
Did you check how many time clusters are set up before and after your change? If you print cluster config on each setup and teardown you should find if there are any more redundant teardowns or wrong orderings.

mykaul · 2026-03-28T09:40:06Z

Nice, I'll try to look into this soon (let me know when it is ready to review). Did you check how many time clusters are set up before and after your change? If you print cluster config on each setup and teardown you should find if there are any more redundant teardowns or wrong orderings.

My poor (pathetic?) laptop is reviewing the change while I'm in parallel asking it to do the same for the gocql driver. There's so much it can do in parallel... And it might skew the results too. I'll check this later. I also need another AI review on this.

mykaul · 2026-03-28T10:12:03Z

CI Performance Results: PR #771 vs Baseline

All 11 CI jobs passed. Compared against the 4 most recently merged PRs (#759, #729, #728, #706):

Per-Job Timing

Job	PR #771	Baseline (PR #759)	Savings
test libev (3.11)	16m	25m	9m (36%)
test asyncio (3.11)	19m	27m	8m (30%)
test asyncore (3.11)	18m	26m	8m (31%)
test libev (3.12)	16m	26m	10m (38%)
test asyncio (3.12)	17m	26m	9m (35%)
test libev (3.13)	17m	26m	9m (35%)
test asyncio (3.13)	19m	26m	7m (27%)
test libev (3.14)	18m	27m	9m (33%)
test asyncio (3.14)	18m	26m	8m (31%)
test libev (3.14t)	18m	27m	9m (33%)
test asyncio (3.14t)	18m	27m	9m (33%)

Summary

Metric	Baseline (4 recent PRs avg)	PR #771	Improvement
Avg per job	26.3m	17.6m	8.7m (33.0%)
Total CI time (vs PR #759)	289m	194m	95m (32.9%)
Min job	25m	16m	9m (36%)
Max job	28m	19m	9m (32%)

Also validated locally: 10x10 runs across ScyllaDB 2026.1, master, 2025.4, and 2025.1 — all passed with no regressions.

mykaul · 2026-03-28T10:21:50Z

Cluster Setup/Teardown Analysis (from CI logs)

Compared the test libev (3.11) job logs between PR #771 (run 23681228530) and the driver-153 branch baseline (run 23646333889):

Cluster transitions (teardown/setup cycles)

Baseline (alphabetical file order): ~20 transitions between different cluster types
PR tests: improve integration test performance (~28% faster) #771 (grouped by topology): ~11 transitions
Eliminated: ~9 unnecessary cluster teardown/setup cycles

Standard tests wall-clock timing

Baseline: ~21m30s (12:35:14 to 12:56:44)
PR tests: improve integration test performance (~28% faster) #771: ~13m27s (08:26:09 to 08:39:36)
Savings: ~8m on standard/ tests alone

Per-file timing (biggest savings)

File	Baseline	PR #771	Saved	Why
test_ip_change.py	168s	27s	141s	Reordered next to same cluster type
test_types.py	136s	6s	130s	Switched to single_node, grouped with others
test_tablets.py	47s	0s*	47s	Moved to end, no teardown/setup before it
test_query_paging.py	56s	12s	44s	Stays in cluster_tests group, no transition
test_control_connection.py	39s	0s*	39s	Grouped with cluster_tests neighbors
test_routing.py	36s	1s	34s	Stays in cluster_tests group
test_client_warnings.py	34s	0s*	33s	Grouped with other single_node tests
test_cython_protocol_handlers.py	31s	1s	30s	Grouped with other single_node tests
test_prepared_statements.py	41s	13s	29s	Stays in cluster_tests group
test_row_factories.py	28s	2s	26s	Grouped with other single_node tests

*0s means the file completed within the same second as the previous file's timestamp — the tests themselves are fast, the baseline duration was dominated by cluster setup.

Where the savings come from

~6-9 minutes: 9 eliminated cluster teardown/setup cycles (~40-60s each)
~2-3 minutes: time.sleep() calls replaced with polling (commits 3 and 9)
~1 minute: 6 test files switched from 3-node to single-node clusters (commit 7)

Note: The log.debug() messages about cluster creation/reuse ("Creating new CCM cluster", "Using existing cluster, matching topology") are not visible in CI output since CI runs above DEBUG level. The analysis above is derived from pytest's per-file completion timestamps.

The test_can_connect_with_sslauth test asserted exact equality between auth warning count and ReadyMessage count. With --smp 2, shard-aware connections produce additional ReadyMessages, breaking the equality. Drop the exact equality check and assert a lower bound of >= 3 (one per node connection in a 3-node cluster). The control connection and shard-aware connections may produce additional warnings, so the actual count varies between runs.

The previous sort key used item.name as tie-breaker, which sorted tests alphabetically within each file. This broke test_tablets.py where test_tablets_invalidation_decommission_non_cc_node (a destructive test that decommissions a node) was moved before test_tablets_invalidation_drop_ks, causing the latter to fail. Use the original collection index as tie-breaker instead, preserving the definition order inside each file while still grouping modules by cluster topology across files.

The cluster name 'test_concurrent_schema_change_and_node_kill' (43 chars) causes the maintenance socket path to exceed the 107-byte sun_path limit on Linux when the working directory is deep enough. Shorten to 'test_schema_kill' to stay well within the limit for all environments.

dkropachev · 2026-03-28T12:44:18Z

@mykaul , i think it worth to split it into into 2-3 PRs:

Dropping time.sleep
Optimization clusters lifecycles
The rest

mykaul · 2026-03-28T15:57:01Z

v2 changes

Force-pushed with two changes:

Commit 10 amended: Relaxed auth warning assertion from >= 4 to >= 3. The previous >= 4 assumed the control connection always produces a separate warning, but CI run 23683011433 showed assert 3 >= 4 failure in test libev (3.12). The control connection can reuse an existing node connection, so the hard minimum is 3 (one per node in a 3-node cluster). The comment was updated accordingly.
New commit 12: Shortened the CCM cluster name in test_concurrent_schema_change_and_node_kill.py from test_concurrent_schema_change_and_node_kill (43 chars) to test_schema_kill (16 chars). The original name causes the maintenance socket path (<workdir>/node1/cql.m) to exceed the 107-byte sun_path limit on Linux when the working directory is deep enough. Current CCM in the venv doesn't use maintenance sockets yet, but newer CCM versions (e.g. 2.0.5+) do — this is a preventive fix.

Separately, the investigation of the two pre-existing test failures is complete:

TestFullNodeReplacementThroughNlb: Root-caused and fixed in PR tests: fix NLB replacement test bootstrap crash due to missing rackdc properties #772 (missing data_center/rack in ccm_cluster.add() → empty rackdc properties → snitch crash). This test has @skip_scylla_version_lt(2026.1.0) and CI runs 2025.2, so it was never executed in CI.
TestConcurrentSchemaChangeAndNodeKill: Passes with current venv CCM. The cluster name shortening (commit 12 above) prevents future breakage.

mykaul added 9 commits March 28, 2026 11:15

tests: fix incorrect retry count in execute_with_long_wait_retry erro…

788ca19

…r message The error message said 'Failed after 100 attempts' but the retry limit is 10 (while tries < 10). This was a copy-paste error from execute_until_pass() which does retry 100 times.

tests: standardize test_cluster.py to --smp 2

abc366d

Change test_cluster.py from --smp 1 to --smp 2 to match the standard configuration used by other test files. This enables cluster topology consolidation in a follow-up commit.

mykaul force-pushed the improve/integration-test-performance-pr branch from 93ab1b6 to 4e5f8e7 Compare March 28, 2026 10:11

mykaul added 3 commits March 28, 2026 15:30

mykaul force-pushed the improve/integration-test-performance-pr branch from 4e5f8e7 to d5b9611 Compare March 28, 2026 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: improve integration test performance (~28% faster)#771

tests: improve integration test performance (~28% faster)#771
mykaul wants to merge 12 commits intoscylladb:masterfrom
mykaul:improve/integration-test-performance-pr

mykaul commented Mar 28, 2026

Uh oh!

Lorak-mmk commented Mar 28, 2026

Uh oh!

mykaul commented Mar 28, 2026

Uh oh!

mykaul commented Mar 28, 2026

Uh oh!

mykaul commented Mar 28, 2026

Uh oh!

dkropachev commented Mar 28, 2026

Uh oh!

mykaul commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mykaul commented Mar 28, 2026

Summary

Motivation

Changes (11 commits, each independent and cherry-pickable)

Bug fixes (commits 1-2)

Sleep elimination (commits 3, 9)

Cluster topology optimization (commits 4-7)

Test churn reduction (commit 8)

Compatibility fixes (commits 10-11)

Test Results

Timing comparison (ScyllaDB 2026.1)

Integration test matrix (10 consecutive green runs each)

Pre-existing failures (deselected, not caused by this PR)

Unit tests

Uh oh!

Lorak-mmk commented Mar 28, 2026

Uh oh!

mykaul commented Mar 28, 2026

Uh oh!

mykaul commented Mar 28, 2026

CI Performance Results: PR #771 vs Baseline

Per-Job Timing

Summary

Uh oh!

mykaul commented Mar 28, 2026

Cluster Setup/Teardown Analysis (from CI logs)

Cluster transitions (teardown/setup cycles)

Standard tests wall-clock timing

Per-file timing (biggest savings)

Where the savings come from

Uh oh!

dkropachev commented Mar 28, 2026

Uh oh!

mykaul commented Mar 28, 2026

v2 changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants