Skip to content

test: fix flaky LLMQ signing recovery timeout#7233

Closed
thepastaclaw wants to merge 1 commit intodashpay:developfrom
thepastaclaw:fix-flaky-llmq-tests
Closed

test: fix flaky LLMQ signing recovery timeout#7233
thepastaclaw wants to merge 1 commit intodashpay:developfrom
thepastaclaw:fix-flaky-llmq-tests

Conversation

@thepastaclaw
Copy link
Copy Markdown

@thepastaclaw thepastaclaw commented Mar 17, 2026

Summary

Fix a flaky timeout in feature_llmq_signing.py (lines 200-201).

The --spork21 reconnect test bumped mocktime by 2 seconds and waited 2 seconds for signature recovery. This was insufficient: the daemon's signing session cleanup cadence is 5 seconds (CLEANUP_INTERVAL in signing_shares.cpp:1194), so the test's wait window was shorter than one cleanup cycle, meaning recovery responsibility never actually rotated.

Change: bump_mocktime(2)bump_mocktime(10), wait_for_sigs(..., 2)wait_for_sigs(..., 15).

The wait_for_sigs timeout is a ceiling, not a polling interval — it just allows more time for propagation.


The PoSe ban assertion fix (previously bundled here) has been split into #7254 for independent review.

@thepastaclaw
Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 17, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 17, 2026

✅ No Merge Conflicts Detected

This PR currently has no conflicts with other open PRs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 17, 2026

Walkthrough

Two functional test files were modified to adjust synchronization and timing behavior. In test/functional/feature_llmq_signing.py, mocktime advancement was increased from 2 seconds to 5 seconds and the wait timeout for recovered signatures was extended from 2 seconds to 15 seconds. In test/functional/feature_llmq_simplepose.py, a hardcoded assertion checking if a Masternode is banned was replaced with a polling mechanism that waits up to a timeout for the banned condition to become true.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately summarizes the main change: fixing a flaky LLMQ signing recovery timeout by adjusting mocktime and wait values.
Description check ✅ Passed The description is directly related to the changeset, providing detailed technical context about the timeout issue and the specific adjustments made.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@thepastaclaw thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 91c12f9 to 154d8e7 Compare March 17, 2026 04:03
@knst
Copy link
Copy Markdown
Collaborator

knst commented Mar 17, 2026

Sleep sleep=0.05 is meant to be sleep 0.05 otherwise all functional tests that uses dkg runs MUCH longer. not (0.5 - 0.04 = 0.45s) slower, but minutes because it sum-up on each stage for each new quorum for each test.

Just increasing timeout doesn't fix core reason for flackines.

  1. show a proof that each particular change decreased flackines in feature_llmq_signing.py (do 100 run and collect statistically proven datas).
  2. check a performance of all functional tests that use dkg doesn't get significantly slower (1% slower for overall run is ok, 30% slower is not ok).
  3. [most important] try to look for more reliable solution for switching dkg statuses during quorum generation, which is based on node logs rather than increasing timeout and be hopeful that nodes are ready for next block and dkg round.

@thepastaclaw thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 154d8e7 to a0283cf Compare March 17, 2026 08:06
@thepastaclaw
Copy link
Copy Markdown
Author

Thanks for the thorough review @knst. You're right on all three points.

I've force-pushed a stripped-down version that drops the entire second commit (DKG pipeline resilience / timeout inflation / sleep changes). What remains is only the two targeted fixes:

1. feature_llmq_signing.pybump_mocktime(2)bump_mocktime(10)

The daemon's signing session cleanup runs every 5 seconds (CLEANUP_INTERVAL in signing_shares.cpp:1194). The old bump_mocktime(2) advanced mocktime by less than one cleanup cycle, so the recovery responsibility never actually rotated to the next member. bump_mocktime(10) guarantees at least one full cycle completes. The wait_for_sigs timeout goes from 2s→15s purely for propagation margin (this is a ceiling, not a polling interval).

2. feature_llmq_simplepose.py — bare assertwait_until(..., timeout=10)

After mine_quorum() returns, the PoSe ban state update (which happens during block validation) may not be visible via RPC immediately under CPU contention. This is a classic polling fix — the assertion logic is identical, it just retries for up to 10 seconds.

No framework changes, no sleep modifications, no DKG timeout inflation.

Re: your point 3 (log-based DKG status detection) — that's a great idea for a broader follow-up but out of scope for these two specific bugs, which are purely about the signing recovery test and the PoSe assertion. Neither involves DKG phase transitions.

@knst
Copy link
Copy Markdown
Collaborator

knst commented Mar 17, 2026

@thepastaclaw I run develop and a0283cffc1640a1294145e0c58e8697a0f219227 2 times each with this command:

test/functional/test_runner.py -j22  feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py feature_llmq_signing.py```

Failure rate for both version close to each other. Can you do more testing to be sure that changes from PR is useful?

The daemon's signing session cleanup runs every 5 seconds
(CLEANUP_INTERVAL in signing_shares.cpp:1194). The old
bump_mocktime(2) advanced mocktime by less than one cleanup cycle,
so recovery responsibility never actually rotated to the next member.
bump_mocktime(10) guarantees at least one full cycle completes.

The wait_for_sigs timeout goes from 2s to 15s purely for propagation
margin (this is a ceiling, not a polling interval).
@thepastaclaw thepastaclaw force-pushed the fix-flaky-llmq-tests branch from 6d48db0 to d1450ae Compare March 25, 2026 16:28
@thepastaclaw thepastaclaw changed the title test: fix flaky LLMQ signing recovery and PoSe ban assertions test: fix flaky LLMQ signing recovery timeout Mar 25, 2026
@thepastaclaw
Copy link
Copy Markdown
Author

Closing — testing showed the bump_mocktime change doesn't measurably reduce flakiness (43.3% failure rate on both develop and PR branch with 30× parallel --spork21 runs). knst was right.

The PoSe ban assertion fix has been split into #7254 and remains valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants