test: fix flaky LLMQ signing recovery timeout#7233
test: fix flaky LLMQ signing recovery timeout#7233thepastaclaw wants to merge 1 commit intodashpay:developfrom
Conversation
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
✅ No Merge Conflicts DetectedThis PR currently has no conflicts with other open PRs. |
WalkthroughTwo functional test files were modified to adjust synchronization and timing behavior. In Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
91c12f9 to
154d8e7
Compare
|
Sleep Just increasing timeout doesn't fix core reason for flackines.
|
154d8e7 to
a0283cf
Compare
|
Thanks for the thorough review @knst. You're right on all three points. I've force-pushed a stripped-down version that drops the entire second commit (DKG pipeline resilience / timeout inflation / sleep changes). What remains is only the two targeted fixes: 1. The daemon's signing session cleanup runs every 5 seconds ( 2. After No framework changes, no sleep modifications, no DKG timeout inflation. Re: your point 3 (log-based DKG status detection) — that's a great idea for a broader follow-up but out of scope for these two specific bugs, which are purely about the signing recovery test and the PoSe assertion. Neither involves DKG phase transitions. |
|
@thepastaclaw I run Failure rate for both version close to each other. Can you do more testing to be sure that changes from PR is useful? |
The daemon's signing session cleanup runs every 5 seconds (CLEANUP_INTERVAL in signing_shares.cpp:1194). The old bump_mocktime(2) advanced mocktime by less than one cleanup cycle, so recovery responsibility never actually rotated to the next member. bump_mocktime(10) guarantees at least one full cycle completes. The wait_for_sigs timeout goes from 2s to 15s purely for propagation margin (this is a ceiling, not a polling interval).
6d48db0 to
d1450ae
Compare
|
Closing — testing showed the bump_mocktime change doesn't measurably reduce flakiness (43.3% failure rate on both develop and PR branch with 30× parallel --spork21 runs). knst was right. The PoSe ban assertion fix has been split into #7254 and remains valid. |
Summary
Fix a flaky timeout in
feature_llmq_signing.py(lines 200-201).The
--spork21reconnect test bumped mocktime by 2 seconds and waited 2 seconds for signature recovery. This was insufficient: the daemon's signing session cleanup cadence is 5 seconds (CLEANUP_INTERVALinsigning_shares.cpp:1194), so the test's wait window was shorter than one cleanup cycle, meaning recovery responsibility never actually rotated.Change:
bump_mocktime(2)→bump_mocktime(10),wait_for_sigs(..., 2)→wait_for_sigs(..., 15).The
wait_for_sigstimeout is a ceiling, not a polling interval — it just allows more time for propagation.The PoSe ban assertion fix (previously bundled here) has been split into #7254 for independent review.