test: fix flaky LLMQ signing recovery timeout by thepastaclaw · Pull Request #7233 · dashpay/dash

thepastaclaw · 2026-03-17T02:48:42Z

Summary

Fix a flaky timeout in feature_llmq_signing.py (lines 200-201).

The --spork21 reconnect test bumped mocktime by 2 seconds and waited 2 seconds for signature recovery. This was insufficient: the daemon's signing session cleanup cadence is 5 seconds (CLEANUP_INTERVAL in signing_shares.cpp:1194), so the test's wait window was shorter than one cleanup cycle, meaning recovery responsibility never actually rotated.

Change: bump_mocktime(2) → bump_mocktime(10), wait_for_sigs(..., 2) → wait_for_sigs(..., 15).

The wait_for_sigs timeout is a ceiling, not a polling interval — it just allows more time for propagation.

The PoSe ban assertion fix (previously bundled here) has been split into #7254 for independent review.

thepastaclaw · 2026-03-17T02:48:51Z

@coderabbitai review

coderabbitai · 2026-03-17T02:48:57Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

github-actions · 2026-03-17T02:49:09Z

✅ No Merge Conflicts Detected

This PR currently has no conflicts with other open PRs.

coderabbitai · 2026-03-17T02:52:38Z

Walkthrough

Two functional test files were modified to adjust synchronization and timing behavior. In test/functional/feature_llmq_signing.py, mocktime advancement was increased from 2 seconds to 5 seconds and the wait timeout for recovered signatures was extended from 2 seconds to 15 seconds. In test/functional/feature_llmq_simplepose.py, a hardcoded assertion checking if a Masternode is banned was replaced with a polling mechanism that waits up to a timeout for the banned condition to become true.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title accurately summarizes the main change: fixing a flaky LLMQ signing recovery timeout by adjusting mocktime and wait values.
Description check	✅ Passed	The description is directly related to the changeset, providing detailed technical context about the timeout issue and the specific adjustments made.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

knst · 2026-03-17T06:00:27Z

Sleep sleep=0.05 is meant to be sleep 0.05 otherwise all functional tests that uses dkg runs MUCH longer. not (0.5 - 0.04 = 0.45s) slower, but minutes because it sum-up on each stage for each new quorum for each test.

Just increasing timeout doesn't fix core reason for flackines.

show a proof that each particular change decreased flackines in feature_llmq_signing.py (do 100 run and collect statistically proven datas).
check a performance of all functional tests that use dkg doesn't get significantly slower (1% slower for overall run is ok, 30% slower is not ok).
[most important] try to look for more reliable solution for switching dkg statuses during quorum generation, which is based on node logs rather than increasing timeout and be hopeful that nodes are ready for next block and dkg round.

thepastaclaw · 2026-03-17T08:07:05Z

Thanks for the thorough review @knst. You're right on all three points.

I've force-pushed a stripped-down version that drops the entire second commit (DKG pipeline resilience / timeout inflation / sleep changes). What remains is only the two targeted fixes:

1. feature_llmq_signing.py — bump_mocktime(2) → bump_mocktime(10)

The daemon's signing session cleanup runs every 5 seconds (CLEANUP_INTERVAL in signing_shares.cpp:1194). The old bump_mocktime(2) advanced mocktime by less than one cleanup cycle, so the recovery responsibility never actually rotated to the next member. bump_mocktime(10) guarantees at least one full cycle completes. The wait_for_sigs timeout goes from 2s→15s purely for propagation margin (this is a ceiling, not a polling interval).

2. feature_llmq_simplepose.py — bare assert → wait_until(..., timeout=10)

After mine_quorum() returns, the PoSe ban state update (which happens during block validation) may not be visible via RPC immediately under CPU contention. This is a classic polling fix — the assertion logic is identical, it just retries for up to 10 seconds.

No framework changes, no sleep modifications, no DKG timeout inflation.

Re: your point 3 (log-based DKG status detection) — that's a great idea for a broader follow-up but out of scope for these two specific bugs, which are purely about the signing recovery test and the PoSe assertion. Neither involves DKG phase transitions.

knst · 2026-03-17T08:24:54Z

@thepastaclaw I run develop and a0283cffc1640a1294145e0c58e8697a0f219227 2 times each with this command:

test/functional/test_runner.py -j22  feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py```

Failure rate for both version close to each other. Can you do more testing to be sure that changes from PR is useful?

The daemon's signing session cleanup runs every 5 seconds (CLEANUP_INTERVAL in signing_shares.cpp:1194). The old bump_mocktime(2) advanced mocktime by less than one cleanup cycle, so recovery responsibility never actually rotated to the next member. bump_mocktime(10) guarantees at least one full cycle completes. The wait_for_sigs timeout goes from 2s to 15s purely for propagation margin (this is a ceiling, not a polling interval).

thepastaclaw · 2026-03-25T17:00:01Z

Closing — testing showed the bump_mocktime change doesn't measurably reduce flakiness (43.3% failure rate on both develop and PR branch with 30× parallel --spork21 runs). knst was right.

The PoSe ban assertion fix has been split into #7254 and remains valid.

thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 91c12f9 to 154d8e7 Compare March 17, 2026 04:03

thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 154d8e7 to a0283cf Compare March 17, 2026 08:06

This was referenced Mar 22, 2026

backport: bitcoin#22764, #24505, #26074, #26039, #28230, #29277, #28597 (wallet deprecation + RPC type checking) #7242

Draft

test: fix flaky PoSe ban assertion in feature_llmq_simplepose #7254

Draft

thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 6d48db0 to d1450ae Compare March 25, 2026 16:28

thepastaclaw changed the title ~~test: fix flaky LLMQ signing recovery and PoSe ban assertions~~ test: fix flaky LLMQ signing recovery timeout Mar 25, 2026

thepastaclaw closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix flaky LLMQ signing recovery timeout#7233

test: fix flaky LLMQ signing recovery timeout#7233
thepastaclaw wants to merge 1 commit intodashpay:developfrom
thepastaclaw:fix-flaky-llmq-tests

thepastaclaw commented Mar 17, 2026 •

edited

Loading

Uh oh!

thepastaclaw commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

knst commented Mar 17, 2026

Uh oh!

thepastaclaw commented Mar 17, 2026

Uh oh!

knst commented Mar 17, 2026 •

edited

Loading

Uh oh!

thepastaclaw commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thepastaclaw commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

thepastaclaw commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ No Merge Conflicts Detected

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

knst commented Mar 17, 2026

Uh oh!

thepastaclaw commented Mar 17, 2026

Uh oh!

knst commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thepastaclaw commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thepastaclaw commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

knst commented Mar 17, 2026 •

edited

Loading