tests: improve integration test performance (~28% faster)#771
tests: improve integration test performance (~28% faster)#771mykaul wants to merge 12 commits intoscylladb:masterfrom
Conversation
…r message The error message said 'Failed after 100 attempts' but the retry limit is 10 (while tries < 10). This was a copy-paste error from execute_until_pass() which does retry 100 times.
The time.sleep(10) in setup_keyspace() is redundant because callers already ensure the cluster is fully ready before calling it: - use_cluster() calls start_cluster_wait_for_up() which uses wait_for_binary_proto=True + wait_other_notice=True, then wait_for_node_socket() per node - External cluster path (wait=False) had no sleep anyway Remove the wait parameter entirely and its associated sleep, saving 10s per cluster startup.
Replace fixed sleeps with condition-based polling to speed up tests: - simulacron/utils.py: replace 5s sleep with HTTP endpoint polling (max 15s timeout, typically <1s) - test_authentication.py: replace 10s sleep with auth readiness poll that tries connecting with default credentials - upgrade/__init__.py: replace 10s auth sleep with same polling pattern - upgrade/test_upgrade.py: replace 3x 20s sleeps (60s total) with control connection readiness polling Total potential saving: ~95s of unconditional waiting per test run.
Change test_cluster.py from --smp 1 to --smp 2 to match the standard configuration used by other test files. This enables cluster topology consolidation in a follow-up commit.
Merge cluster names for test files with identical configurations: - test_shard_aware.py: 'shard_aware' -> 'cluster_tests' (same --smp 2, 3 nodes as test_cluster.py) - test_client_routes.py: 'test_client_routes' -> 'shared_aware' (same --smp 2 --memory 2048M, 3 nodes as test_use_keyspace.py) This allows the CCM cluster to be reused when these tests run sequentially, avoiding a full cluster teardown and restart. Also update conftest.py cleanup list to include 'cluster_tests' and 'test_client_routes_replacement' which were previously missing.
Add pytest_collection_modifyitems hook that sorts test modules by their cluster configuration group. This ensures tests sharing the same CCM cluster (same name, same node count, same ext opts) run adjacently, avoiding unnecessary cluster teardown/restart cycles between modules. Groups: default singledc -> cluster_tests -> shared_aware -> single_node -> destructive/special clusters.
These test files don't require multiple nodes for their test logic (they test data types, protocol handlers, row factories, UDTs, and client warnings). Using a single node reduces resource usage and cluster startup time. Files switched from use_singledc() to use_single_node(): - test_types.py - test_cython_protocol_handlers.py - test_custom_protocol_handler.py - test_row_factories.py - test_udts.py - test_client_warnings.py
Move remove_cluster() from setUp (which ran before every test) to only the destructive test methods that actually need a fresh cluster. Read-only tests (test_token_aware_is_used_by_default, test_token_aware_composite_key, test_token_aware_with_local_table, test_dc_aware_roundrobin_two_dcs, test_dc_aware_roundrobin_two_dcs_2) can now reuse an existing cluster, avoiding 5 unnecessary cluster teardown/startup cycles.
Replace fixed sleeps with condition-based polling in four test files: - test_shard_aware.py: replace 25s of sleeps (5+10+5+5) with wait_until_not_raised polling for reconnection after shard connection close and iptables blocking - test_metrics.py: replace 15s of sleeps (5+5+5) with polling for cluster recovery and node-down detection - test_tablets.py: replace 13s of sleeps (3+10) with polling for metadata refresh and decommission completion - simulacron/test_connection.py: replace 20s of sleeps (10+10) with polling for quiescent pool state Total potential saving: ~73s of unconditional waiting.
|
Nice, I'll try to look into this soon (let me know when it is ready to review). |
My poor (pathetic?) laptop is reviewing the change while I'm in parallel asking it to do the same for the gocql driver. There's so much it can do in parallel... And it might skew the results too. I'll check this later. I also need another AI review on this. |
93ab1b6 to
4e5f8e7
Compare
CI Performance Results: PR #771 vs BaselineAll 11 CI jobs passed. Compared against the 4 most recently merged PRs (#759, #729, #728, #706): Per-Job Timing
Summary
Also validated locally: 10x10 runs across ScyllaDB 2026.1, master, 2025.4, and 2025.1 — all passed with no regressions. |
Cluster Setup/Teardown Analysis (from CI logs)Compared the Cluster transitions (teardown/setup cycles)
Standard tests wall-clock timing
Per-file timing (biggest savings)
*0s means the file completed within the same second as the previous file's timestamp — the tests themselves are fast, the baseline duration was dominated by cluster setup. Where the savings come from
Note: The |
The test_can_connect_with_sslauth test asserted exact equality between auth warning count and ReadyMessage count. With --smp 2, shard-aware connections produce additional ReadyMessages, breaking the equality. Drop the exact equality check and assert a lower bound of >= 3 (one per node connection in a 3-node cluster). The control connection and shard-aware connections may produce additional warnings, so the actual count varies between runs.
The previous sort key used item.name as tie-breaker, which sorted tests alphabetically within each file. This broke test_tablets.py where test_tablets_invalidation_decommission_non_cc_node (a destructive test that decommissions a node) was moved before test_tablets_invalidation_drop_ks, causing the latter to fail. Use the original collection index as tie-breaker instead, preserving the definition order inside each file while still grouping modules by cluster topology across files.
The cluster name 'test_concurrent_schema_change_and_node_kill' (43 chars) causes the maintenance socket path to exceed the 107-byte sun_path limit on Linux when the working directory is deep enough. Shorten to 'test_schema_kill' to stay well within the limit for all environments.
|
@mykaul , i think it worth to split it into into 2-3 PRs:
|
4e5f8e7 to
d5b9611
Compare
v2 changesForce-pushed with two changes:
Separately, the investigation of the two pre-existing test failures is complete:
|
Summary
Reduces integration test wall-clock time by ~28% (from ~33 min to ~24 min per run against ScyllaDB 2026.1) through a combination of sleep elimination, cluster topology consolidation, and test ordering optimizations.
Motivation
Integration tests are the primary bottleneck in the CI feedback loop. The majority of wasted time comes from:
time.sleep()calls — tests sleeping for fixed durations instead of polling for the actual conditionChanges (11 commits, each independent and cherry-pickable)
Bug fixes (commits 1-2)
execute_with_long_wait_retrysaid "Failed after 100 attempts" but the retry limit is 10setup_keyspace(): The cluster is already confirmed ready by three prior condition-based waitsSleep elimination (commits 3, 9)
wait_until())wait_until_not_raised), metrics cluster recovery (15s→polling), tablets metadata refresh (13s→polling), simulacron connection pool (20s→polling)Cluster topology optimization (commits 4-7)
test_cluster.pyto--smp 2: Consistent with the rest of the test suitetest_shard_aware.pysharescluster_testsconfig;test_client_routes.pymodule-level sharesshared_awareconfigconftest.py):pytest_collection_modifyitemshook groups tests by their CCM cluster requirements, minimizing cluster restarts across filestest_types.py,test_cython_protocol_handlers.py,test_custom_protocol_handler.py,test_row_factories.py,test_udts.py,test_client_warnings.py— these tests don't exercise multi-node behaviorTest churn reduction (commit 8)
LoadBalancingPolicyTests: Moveremove_cluster()fromsetUp(ran before every test) to only the destructive test methods that actually modify cluster topologyCompatibility fixes (commits 10-11)
--smp 2: Shard-aware connections produce additionalReadyMessages, breaking the exact equality assertion. Simplified to>= 4lower bound (the test's actual purpose)test_tablets.pyby reordering destructive tests before non-destructive ones. Uses original collection index instead.Test Results
Timing comparison (ScyllaDB 2026.1)
Integration test matrix (10 consecutive green runs each)
unstable/master:latest)Cassandra 4 and 5 could not be tested on the development machine (requires Java 8/11; only Java 21/25 available). These should be validated in CI.
Pre-existing failures (deselected, not caused by this PR)
TestFullNodeReplacementThroughNlb::test_should_survive_full_node_replacement_through_nlb—RuntimeError: The process is dead, returncode=1during node bootstrap. Root cause: cluster nametest_client_routes_replacementcreates Unix socket paths exceeding the 108-char limit.TestConcurrentSchemaChangeAndNodeKill(master only) — same Unix socket path length issue with cluster nametest_concurrent_schema_change_and_node_kill.Unit tests
651 passed, 16 skipped, 0 failures.