Skip to content

tests: improve integration test performance (~28% faster)#771

Open
mykaul wants to merge 12 commits intoscylladb:masterfrom
mykaul:improve/integration-test-performance-pr
Open

tests: improve integration test performance (~28% faster)#771
mykaul wants to merge 12 commits intoscylladb:masterfrom
mykaul:improve/integration-test-performance-pr

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Mar 28, 2026

Summary

Reduces integration test wall-clock time by ~28% (from ~33 min to ~24 min per run against ScyllaDB 2026.1) through a combination of sleep elimination, cluster topology consolidation, and test ordering optimizations.

Motivation

Integration tests are the primary bottleneck in the CI feedback loop. The majority of wasted time comes from:

  1. Unconditional time.sleep() calls — tests sleeping for fixed durations instead of polling for the actual condition
  2. Redundant cluster teardown/startup cycles — tests destroying and recreating identical CCM clusters
  3. Suboptimal test ordering — tests with different cluster requirements interleaved, causing unnecessary cluster restarts

Changes (11 commits, each independent and cherry-pickable)

Bug fixes (commits 1-2)

  • Fix incorrect retry count in error message: execute_with_long_wait_retry said "Failed after 100 attempts" but the retry limit is 10
  • Remove redundant 10s sleep from setup_keyspace(): The cluster is already confirmed ready by three prior condition-based waits

Sleep elimination (commits 3, 9)

  • High-priority sleeps replaced with polling (~95s saved): simulacron startup (5s→HTTP polling), auth readiness (10s→credential polling), upgrade auth setup (10s→polling), upgrade control connection (3×20s→wait_until())
  • Medium-priority sleeps replaced with polling (~73s saved): shard-aware reconnection (25s→wait_until_not_raised), metrics cluster recovery (15s→polling), tablets metadata refresh (13s→polling), simulacron connection pool (20s→polling)

Cluster topology optimization (commits 4-7)

  • Standardize test_cluster.py to --smp 2: Consistent with the rest of the test suite
  • Consolidate cluster topologies: test_shard_aware.py shares cluster_tests config; test_client_routes.py module-level shares shared_aware config
  • Add test ordering by cluster topology (conftest.py): pytest_collection_modifyitems hook groups tests by their CCM cluster requirements, minimizing cluster restarts across files
  • Switch 6 test files from 3-node to 1-node cluster: test_types.py, test_cython_protocol_handlers.py, test_custom_protocol_handler.py, test_row_factories.py, test_udts.py, test_client_warnings.py — these tests don't exercise multi-node behavior

Test churn reduction (commit 8)

  • Reduce cluster churn in LoadBalancingPolicyTests: Move remove_cluster() from setUp (ran before every test) to only the destructive test methods that actually modify cluster topology

Compatibility fixes (commits 10-11)

  • Fix auth warning assertion for --smp 2: Shard-aware connections produce additional ReadyMessages, breaking the exact equality assertion. Simplified to >= 4 lower bound (the test's actual purpose)
  • Fix conftest sort to preserve definition order within files: The alphabetical tie-breaker broke test_tablets.py by reordering destructive tests before non-destructive ones. Uses original collection index instead.

Test Results

Timing comparison (ScyllaDB 2026.1)

Metric Baseline (base branch) Optimized Savings
Average (5 runs) 33m 12s 24m 02s (10 runs) 9m 10s (27.6%)
Median 31m 41s 24m 50s 6m 51s (21.6%)
Best 31m 04s 21m 03s 10m 01s (32.3%)

Integration test matrix (10 consecutive green runs each)

ScyllaDB Version Runs Result Avg Time
2026.1 10/10 ✅ ALL PASSED ~24 min
master (unstable/master:latest) 10/10 ✅ ALL PASSED ~21 min
2025.4 10/10 ✅ ALL PASSED ~19 min
2025.1 10/10 ✅ ALL PASSED ~18 min

Cassandra 4 and 5 could not be tested on the development machine (requires Java 8/11; only Java 21/25 available). These should be validated in CI.

Pre-existing failures (deselected, not caused by this PR)

  • TestFullNodeReplacementThroughNlb::test_should_survive_full_node_replacement_through_nlbRuntimeError: The process is dead, returncode=1 during node bootstrap. Root cause: cluster name test_client_routes_replacement creates Unix socket paths exceeding the 108-char limit.
  • TestConcurrentSchemaChangeAndNodeKill (master only) — same Unix socket path length issue with cluster name test_concurrent_schema_change_and_node_kill.

Unit tests

651 passed, 16 skipped, 0 failures.

mykaul added 9 commits March 28, 2026 11:15
…r message

The error message said 'Failed after 100 attempts' but the retry limit
is 10 (while tries < 10). This was a copy-paste error from
execute_until_pass() which does retry 100 times.
The time.sleep(10) in setup_keyspace() is redundant because callers
already ensure the cluster is fully ready before calling it:
- use_cluster() calls start_cluster_wait_for_up() which uses
  wait_for_binary_proto=True + wait_other_notice=True, then
  wait_for_node_socket() per node
- External cluster path (wait=False) had no sleep anyway

Remove the wait parameter entirely and its associated sleep, saving 10s
per cluster startup.
Replace fixed sleeps with condition-based polling to speed up tests:

- simulacron/utils.py: replace 5s sleep with HTTP endpoint polling
  (max 15s timeout, typically <1s)
- test_authentication.py: replace 10s sleep with auth readiness poll
  that tries connecting with default credentials
- upgrade/__init__.py: replace 10s auth sleep with same polling pattern
- upgrade/test_upgrade.py: replace 3x 20s sleeps (60s total) with
  control connection readiness polling

Total potential saving: ~95s of unconditional waiting per test run.
Change test_cluster.py from --smp 1 to --smp 2 to match the standard
configuration used by other test files. This enables cluster topology
consolidation in a follow-up commit.
Merge cluster names for test files with identical configurations:
- test_shard_aware.py: 'shard_aware' -> 'cluster_tests' (same --smp 2,
  3 nodes as test_cluster.py)
- test_client_routes.py: 'test_client_routes' -> 'shared_aware' (same
  --smp 2 --memory 2048M, 3 nodes as test_use_keyspace.py)

This allows the CCM cluster to be reused when these tests run
sequentially, avoiding a full cluster teardown and restart.

Also update conftest.py cleanup list to include 'cluster_tests' and
'test_client_routes_replacement' which were previously missing.
Add pytest_collection_modifyitems hook that sorts test modules by their
cluster configuration group. This ensures tests sharing the same CCM
cluster (same name, same node count, same ext opts) run adjacently,
avoiding unnecessary cluster teardown/restart cycles between modules.

Groups: default singledc -> cluster_tests -> shared_aware ->
single_node -> destructive/special clusters.
These test files don't require multiple nodes for their test logic
(they test data types, protocol handlers, row factories, UDTs, and
client warnings). Using a single node reduces resource usage and
cluster startup time.

Files switched from use_singledc() to use_single_node():
- test_types.py
- test_cython_protocol_handlers.py
- test_custom_protocol_handler.py
- test_row_factories.py
- test_udts.py
- test_client_warnings.py
Move remove_cluster() from setUp (which ran before every test) to only
the destructive test methods that actually need a fresh cluster.

Read-only tests (test_token_aware_is_used_by_default,
test_token_aware_composite_key, test_token_aware_with_local_table,
test_dc_aware_roundrobin_two_dcs, test_dc_aware_roundrobin_two_dcs_2)
can now reuse an existing cluster, avoiding 5 unnecessary cluster
teardown/startup cycles.
Replace fixed sleeps with condition-based polling in four test files:

- test_shard_aware.py: replace 25s of sleeps (5+10+5+5) with
  wait_until_not_raised polling for reconnection after shard connection
  close and iptables blocking
- test_metrics.py: replace 15s of sleeps (5+5+5) with polling for
  cluster recovery and node-down detection
- test_tablets.py: replace 13s of sleeps (3+10) with polling for
  metadata refresh and decommission completion
- simulacron/test_connection.py: replace 20s of sleeps (10+10) with
  polling for quiescent pool state

Total potential saving: ~73s of unconditional waiting.
@Lorak-mmk
Copy link
Copy Markdown

Nice, I'll try to look into this soon (let me know when it is ready to review).
Did you check how many time clusters are set up before and after your change? If you print cluster config on each setup and teardown you should find if there are any more redundant teardowns or wrong orderings.

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Mar 28, 2026

Nice, I'll try to look into this soon (let me know when it is ready to review). Did you check how many time clusters are set up before and after your change? If you print cluster config on each setup and teardown you should find if there are any more redundant teardowns or wrong orderings.

My poor (pathetic?) laptop is reviewing the change while I'm in parallel asking it to do the same for the gocql driver. There's so much it can do in parallel... And it might skew the results too. I'll check this later. I also need another AI review on this.

@mykaul mykaul force-pushed the improve/integration-test-performance-pr branch from 93ab1b6 to 4e5f8e7 Compare March 28, 2026 10:11
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Mar 28, 2026

CI Performance Results: PR #771 vs Baseline

All 11 CI jobs passed. Compared against the 4 most recently merged PRs (#759, #729, #728, #706):

Per-Job Timing

Job PR #771 Baseline (PR #759) Savings
test libev (3.11) 16m 25m 9m (36%)
test asyncio (3.11) 19m 27m 8m (30%)
test asyncore (3.11) 18m 26m 8m (31%)
test libev (3.12) 16m 26m 10m (38%)
test asyncio (3.12) 17m 26m 9m (35%)
test libev (3.13) 17m 26m 9m (35%)
test asyncio (3.13) 19m 26m 7m (27%)
test libev (3.14) 18m 27m 9m (33%)
test asyncio (3.14) 18m 26m 8m (31%)
test libev (3.14t) 18m 27m 9m (33%)
test asyncio (3.14t) 18m 27m 9m (33%)

Summary

Metric Baseline (4 recent PRs avg) PR #771 Improvement
Avg per job 26.3m 17.6m 8.7m (33.0%)
Total CI time (vs PR #759) 289m 194m 95m (32.9%)
Min job 25m 16m 9m (36%)
Max job 28m 19m 9m (32%)

Also validated locally: 10x10 runs across ScyllaDB 2026.1, master, 2025.4, and 2025.1 — all passed with no regressions.

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Mar 28, 2026

Cluster Setup/Teardown Analysis (from CI logs)

Compared the test libev (3.11) job logs between PR #771 (run 23681228530) and the driver-153 branch baseline (run 23646333889):

Cluster transitions (teardown/setup cycles)

Standard tests wall-clock timing

Per-file timing (biggest savings)

File Baseline PR #771 Saved Why
test_ip_change.py 168s 27s 141s Reordered next to same cluster type
test_types.py 136s 6s 130s Switched to single_node, grouped with others
test_tablets.py 47s 0s* 47s Moved to end, no teardown/setup before it
test_query_paging.py 56s 12s 44s Stays in cluster_tests group, no transition
test_control_connection.py 39s 0s* 39s Grouped with cluster_tests neighbors
test_routing.py 36s 1s 34s Stays in cluster_tests group
test_client_warnings.py 34s 0s* 33s Grouped with other single_node tests
test_cython_protocol_handlers.py 31s 1s 30s Grouped with other single_node tests
test_prepared_statements.py 41s 13s 29s Stays in cluster_tests group
test_row_factories.py 28s 2s 26s Grouped with other single_node tests

*0s means the file completed within the same second as the previous file's timestamp — the tests themselves are fast, the baseline duration was dominated by cluster setup.

Where the savings come from

  1. ~6-9 minutes: 9 eliminated cluster teardown/setup cycles (~40-60s each)
  2. ~2-3 minutes: time.sleep() calls replaced with polling (commits 3 and 9)
  3. ~1 minute: 6 test files switched from 3-node to single-node clusters (commit 7)

Note: The log.debug() messages about cluster creation/reuse ("Creating new CCM cluster", "Using existing cluster, matching topology") are not visible in CI output since CI runs above DEBUG level. The analysis above is derived from pytest's per-file completion timestamps.

mykaul added 3 commits March 28, 2026 15:30
The test_can_connect_with_sslauth test asserted exact equality between
auth warning count and ReadyMessage count. With --smp 2, shard-aware
connections produce additional ReadyMessages, breaking the equality.

Drop the exact equality check and assert a lower bound of >= 3 (one
per node connection in a 3-node cluster).  The control connection and
shard-aware connections may produce additional warnings, so the actual
count varies between runs.
The previous sort key used item.name as tie-breaker, which sorted
tests alphabetically within each file. This broke test_tablets.py
where test_tablets_invalidation_decommission_non_cc_node (a
destructive test that decommissions a node) was moved before
test_tablets_invalidation_drop_ks, causing the latter to fail.

Use the original collection index as tie-breaker instead, preserving
the definition order inside each file while still grouping modules
by cluster topology across files.
The cluster name 'test_concurrent_schema_change_and_node_kill' (43 chars)
causes the maintenance socket path to exceed the 107-byte sun_path
limit on Linux when the working directory is deep enough.  Shorten to
'test_schema_kill' to stay well within the limit for all environments.
@dkropachev
Copy link
Copy Markdown
Collaborator

@mykaul , i think it worth to split it into into 2-3 PRs:

  1. Dropping time.sleep
  2. Optimization clusters lifecycles
  3. The rest

@mykaul mykaul force-pushed the improve/integration-test-performance-pr branch from 4e5f8e7 to d5b9611 Compare March 28, 2026 15:56
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Mar 28, 2026

v2 changes

Force-pushed with two changes:

  1. Commit 10 amended: Relaxed auth warning assertion from >= 4 to >= 3. The previous >= 4 assumed the control connection always produces a separate warning, but CI run 23683011433 showed assert 3 >= 4 failure in test libev (3.12). The control connection can reuse an existing node connection, so the hard minimum is 3 (one per node in a 3-node cluster). The comment was updated accordingly.

  2. New commit 12: Shortened the CCM cluster name in test_concurrent_schema_change_and_node_kill.py from test_concurrent_schema_change_and_node_kill (43 chars) to test_schema_kill (16 chars). The original name causes the maintenance socket path (<workdir>/node1/cql.m) to exceed the 107-byte sun_path limit on Linux when the working directory is deep enough. Current CCM in the venv doesn't use maintenance sockets yet, but newer CCM versions (e.g. 2.0.5+) do — this is a preventive fix.

Separately, the investigation of the two pre-existing test failures is complete:

  • TestFullNodeReplacementThroughNlb: Root-caused and fixed in PR tests: fix NLB replacement test bootstrap crash due to missing rackdc properties #772 (missing data_center/rack in ccm_cluster.add() → empty rackdc properties → snitch crash). This test has @skip_scylla_version_lt(2026.1.0) and CI runs 2025.2, so it was never executed in CI.
  • TestConcurrentSchemaChangeAndNodeKill: Passes with current venv CCM. The cluster name shortening (commit 12 above) prevents future breakage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants