refactor: simplify architecture of quorum observer and quorum participant#7247
refactor: simplify architecture of quorum observer and quorum participant#7247knst wants to merge 9 commits intodashpay:developfrom
Conversation
|
WalkthroughThis PR introduces a new llmq::QuorumRole abstraction and moves quorum lifecycle, connection maintenance, and data-recovery logic into it (implemented in quorumsman). QuorumObserver/QuorumParticipant implementations and their files were removed or consolidated; ActiveContext and ObserverContext now implement/override QuorumRole hooks. CQuorumManager was adapted to reference QuorumRole. Several llmq/observer headers/sources were merged into llmq/observer.h/.cpp. CDKGSession visibility was widened and hash-based lookup accessors were added to support the new flow. Build files updated to reflect consolidated observer sources/headers. Sequence Diagram(s)sequenceDiagram
participant Chain as Blockchain (UpdatedBlockTip)
participant QuorumRole as llmq::QuorumRole
participant QMan as CQuorumManager
participant Worker as WorkerPool
participant Peers as Network/Connman
participant DB as QuorumDataDB
Chain->>QuorumRole: UpdatedBlockTip(pindexNew, fInitialDownload)
QuorumRole->>QMan: InitializeQuorumConnections / CheckQuorumConnections(llmqParams, pindexNew)
QMan-->>QuorumRole: deletableQuorums / connection decisions
QuorumRole->>QuorumRole: TriggerQuorumDataRecoveryThreads(pindexNew)
QuorumRole->>Worker: Enqueue DataRecoveryThread(work item)
Worker->>Peers: Request missing vvec / encrypted contributions (QGETDATA)
Peers-->>Worker: Respond with QDATA / QVEC messages
Worker->>QuorumRole: ProcessContribQDATA / ProcessContribQGETDATA
QuorumRole->>DB: Store or Cleanup quorum data
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
src/llmq/dkgsession.h (1)
281-339: Prefer narrower hooks over widening most of the session state toprotected.This trades a small set of explicit friends for subclass access to
invCs, the message maps, and most of the mutable DKG session state. I'd keep as much of this private as possible and add a few targeted hooks/accessors for the remaining subclass needs, so the locking and ordering invariants stay localized.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmq/dkgsession.h` around lines 281 - 339, Make the large protected block private and provide narrow accessor/hooks for subclasses instead of exposing most session state; keep invCs, cs_pending, and the message maps (contributions, complaints, justifications, prematureCommitments, validCommitments) private and add specific thread-safe methods (e.g., bool TryGetContribution(uint256 hash, CDKGContribution& out), void ForEachContributionLocked(fn), AddComplaint/RemoveComplaint, ValidateAndMarkCommitment, GetPendingContributionVerifications(), etc.) that acquire the proper locks and expose only the minimal operations subclasses need; similarly keep members, membersMap, relayMembers, vvecContribution, receivedVvecs, vecEncryptedContributions, m_sk_contributions, myProTxHash/myId/myIdx private and add targeted getters/setters or iteration hooks (e.g., GetMemberByIndex, FindMemberId, AppendEncryptedContribution) to preserve invariants and localize locking around invCs and cs_pending while removing the broad protected visibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/active/context.cpp`:
- Line 40: The file uses standard algorithms (std::sort and std::ranges::any_of)
but only includes <ranges>, which risks build breakage; add a direct include for
<algorithm> at the top of the file (alongside the existing `#include` <ranges>) so
that usages in functions referencing std::sort and std::ranges::any_of are
guaranteed to compile across STL implementations.
In `@src/llmq/observer.cpp`:
- Around line 35-38: The destructor ObserverContext::~ObserverContext is not
formatted per project style; run clang-format (or apply the repository's
formatting tool) on the hunk containing ObserverContext::~ObserverContext and
the call to m_qman.DisconnectManagers() so the braces, indentation and spacing
match the project's clang-format rules and the diff-format job will pass.
In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The QuorumRole destructor currently calls Stop()
(QuorumRole::~QuorumRole()) which can run after derived members are destroyed
and allows ActiveContext's workerPool jobs to dereference torn-down derived
state; remove the Stop() call from the base destructor and instead
enforce/verify that shutdown happened earlier by asserting or logging if Stop()
wasn't already invoked. Locate where derived classes or owners create/own
ActiveContext/workerPool and ensure they call Stop() (or otherwise quiesce
workerPool) before derived member teardown; add an assertion in
QuorumRole::~QuorumRole() that Stop() has already been called (or that the
workerPool is quiesced) to fail-fast if shutdown ordering is violated. Ensure
any code paths that currently rely on the base destructor to stop workers are
updated to call Stop() explicitly before destruction.
- Around line 913-918: The loop currently increments nTries before calling
m_qman.IsDataRequestPending and then continues without hitting the 1s sleep,
which burns attempts on pending requests; fix by not incrementing nTries until
you actually make a request: compute pCurrentMemberHash using (start_offset +
nTries) without ++, call m_qman.IsDataRequestPending(*pCurrentMemberHash,...),
and only increment nTries (or advance start_offset) when you proceed to send a
request; alternatively ensure the continue path still triggers the per-iteration
sleep — adjust the logic around pCurrentMemberHash, nTries, and the continue so
IsDataRequestPending does not consume an attempt and the loop still sleeps on
pending checks.
In `@src/llmq/quorumsman.h`:
- Line 75: Add a forward declaration for llmq::QuorumRole before the
CQuorumManager class so the member llmq::QuorumRole* m_handler (and any other
uses of llmq::QuorumRole in CQuorumManager) are valid; specifically, insert
"namespace llmq { class QuorumRole; }" (or equivalent forward declaration) above
the CQuorumManager definition to resolve the use-before-declaration of
llmq::QuorumRole referenced by m_handler and any other members/methods in
CQuorumManager.
---
Nitpick comments:
In `@src/llmq/dkgsession.h`:
- Around line 281-339: Make the large protected block private and provide narrow
accessor/hooks for subclasses instead of exposing most session state; keep
invCs, cs_pending, and the message maps (contributions, complaints,
justifications, prematureCommitments, validCommitments) private and add specific
thread-safe methods (e.g., bool TryGetContribution(uint256 hash,
CDKGContribution& out), void ForEachContributionLocked(fn),
AddComplaint/RemoveComplaint, ValidateAndMarkCommitment,
GetPendingContributionVerifications(), etc.) that acquire the proper locks and
expose only the minimal operations subclasses need; similarly keep members,
membersMap, relayMembers, vvecContribution, receivedVvecs,
vecEncryptedContributions, m_sk_contributions, myProTxHash/myId/myIdx private
and add targeted getters/setters or iteration hooks (e.g., GetMemberByIndex,
FindMemberId, AppendEncryptedContribution) to preserve invariants and localize
locking around invCs and cs_pending while removing the broad protected
visibility.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f51bb0d0-4ebb-4427-8fb1-dc54321440d3
📒 Files selected for processing (20)
src/Makefile.amsrc/active/context.cppsrc/active/context.hsrc/active/dkgsessionhandler.cppsrc/active/quorums.cppsrc/active/quorums.hsrc/init.cppsrc/llmq/dkgsession.cppsrc/llmq/dkgsession.hsrc/llmq/observer.cppsrc/llmq/observer.hsrc/llmq/observer/quorums.cppsrc/llmq/observer/quorums.hsrc/llmq/quorums.hsrc/llmq/quorumsman.cppsrc/llmq/quorumsman.hsrc/net_processing.cppsrc/node/context.cppsrc/rpc/quorums.cpptest/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
- test/util/data/non-backported.txt
- src/active/quorums.cpp
- src/llmq/observer/quorums.cpp
- src/llmq/observer/quorums.h
- src/active/quorums.h
| #include <validation.h> | ||
| #include <validationinterface.h> | ||
|
|
||
| #include <ranges> |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and inspect the context.cpp file
find . -name "context.cpp" -path "*/active/*" -type fRepository: dashpay/dash
Length of output: 80
🏁 Script executed:
# Check if the file exists and view the includes around line 40
head -50 src/active/context.cppRepository: dashpay/dash
Length of output: 2113
🏁 Script executed:
# Check lines around 153, 241, and 301 for the claimed algorithm usage
sed -n '150,156p' src/active/context.cpp
sed -n '238,244p' src/active/context.cpp
sed -n '298,304p' src/active/context.cppRepository: dashpay/dash
Length of output: 1333
🏁 Script executed:
# Check all `#include` statements in the file
grep -n "^#include" src/active/context.cppRepository: dashpay/dash
Length of output: 1104
Add #include <algorithm> for standard library algorithms.
The file uses std::sort at line 153 and std::ranges::any_of at lines 241 and 301, both of which require <algorithm>. Including only <ranges> creates a build fragility risk: while the code may currently compile due to transitive includes, different STL implementations or future include reshuffles could break it.
Minimal fix
+#include <algorithm>
`#include` <ranges>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include <ranges> | |
| `#include` <algorithm> | |
| `#include` <ranges> |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/active/context.cpp` at line 40, The file uses standard algorithms
(std::sort and std::ranges::any_of) but only includes <ranges>, which risks
build breakage; add a direct include for <algorithm> at the top of the file
(alongside the existing `#include` <ranges>) so that usages in functions
referencing std::sort and std::ranges::any_of are guaranteed to compile across
STL implementations.
| ObserverContext::~ObserverContext() | ||
| { | ||
| m_qman.DisconnectManagers(); | ||
| } |
There was a problem hiding this comment.
Run clang-format on this destructor hunk.
The diff-format job is already failing on this block, so this needs to be reformatted before merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmq/observer.cpp` around lines 35 - 38, The destructor
ObserverContext::~ObserverContext is not formatted per project style; run
clang-format (or apply the repository's formatting tool) on the hunk containing
ObserverContext::~ObserverContext and the call to m_qman.DisconnectManagers() so
the braces, indentation and spacing match the project's clang-format rules and
the diff-format job will pass.
| QuorumRole::~QuorumRole() | ||
| { | ||
| Stop(); | ||
| } |
There was a problem hiding this comment.
Don't rely on the base destructor to stop quorum workers.
Line 774 calls Stop() only from QuorumRole::~QuorumRole(), but base destructors run after derived members are already gone. ActiveContext queues workerPool jobs that dereference derived state in src/active/context.cpp, Lines 283-285, so destroying an ActiveContext without an earlier explicit stop leaves a shutdown-time UAF window here. The pool needs to be quiesced before derived member teardown; this destructor should at most assert that shutdown already happened.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmq/quorumsman.cpp` around lines 772 - 775, The QuorumRole destructor
currently calls Stop() (QuorumRole::~QuorumRole()) which can run after derived
members are destroyed and allows ActiveContext's workerPool jobs to dereference
torn-down derived state; remove the Stop() call from the base destructor and
instead enforce/verify that shutdown happened earlier by asserting or logging if
Stop() wasn't already invoked. Locate where derived classes or owners create/own
ActiveContext/workerPool and ensure they call Stop() (or otherwise quiesce
workerPool) before derived member teardown; add an assertion in
QuorumRole::~QuorumRole() that Stop() has already been called (or that the
workerPool is quiesced) to fail-fast if shutdown ordering is violated. Ensure
any code paths that currently rely on the base destructor to stop workers are
updated to call Stop() explicitly before destruction.
| pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries++) % vecMemberHashes.size()]; | ||
| if (m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash, | ||
| pQuorum->qc->llmqType)) { | ||
| printLog("Already asked"); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Don't burn recovery attempts on still-pending requests.
Line 913 increments nTries before IsDataRequestPending(), and the continue path on Line 917 skips the normal 1-second sleep at the bottom of the loop. Once a few older requests are still pending, this can spin across vecMemberHashes and hit "All tried but failed" without actually waiting for any of those requests to complete.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmq/quorumsman.cpp` around lines 913 - 918, The loop currently
increments nTries before calling m_qman.IsDataRequestPending and then continues
without hitting the 1s sleep, which burns attempts on pending requests; fix by
not incrementing nTries until you actually make a request: compute
pCurrentMemberHash using (start_offset + nTries) without ++, call
m_qman.IsDataRequestPending(*pCurrentMemberHash,...), and only increment nTries
(or advance start_offset) when you proceed to send a request; alternatively
ensure the continue path still triggers the per-iteration sleep — adjust the
logic around pCurrentMemberHash, nTries, and the continue so
IsDataRequestPending does not consume an attempt and the loop still sleeps on
pending checks.
| const ChainstateManager& m_chainman; | ||
| llmq::CDKGSessionManager* m_qdkgsman{nullptr}; | ||
| llmq::QuorumObserver* m_handler{nullptr}; | ||
| llmq::QuorumRole* m_handler{nullptr}; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/llmq/quorumsman.h | head -250Repository: dashpay/dash
Length of output: 12660
Add forward declaration for QuorumRole before its use in CQuorumManager.
Line 75 and line 111 use llmq::QuorumRole* before the class is declared at line 197, making this header invalid C++. Add a forward declaration in the llmq namespace before CQuorumManager is defined.
Suggested fix
namespace llmq {
+class QuorumRole;
+
enum class VerifyRecSigStatus : uint8_t {Also applies to: 111-111
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmq/quorumsman.h` at line 75, Add a forward declaration for
llmq::QuorumRole before the CQuorumManager class so the member llmq::QuorumRole*
m_handler (and any other uses of llmq::QuorumRole in CQuorumManager) are valid;
specifically, insert "namespace llmq { class QuorumRole; }" (or equivalent
forward declaration) above the CQuorumManager definition to resolve the
use-before-declaration of llmq::QuorumRole referenced by m_handler and any other
members/methods in CQuorumManager.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Clean, well-structured refactoring that successfully collapses the QuorumObserver/QuorumParticipant/QuorumObserverParent hierarchy into a single QuorumRole base class with two concrete subclasses (ObserverContext and ActiveContext). The code movement is mechanically correct with no logic changes, CDKGSession friend removal is properly replaced with getters and protected access, and the destruction/initialization ordering is preserved.
Reviewed commit: 40ab4f9
🟡 2 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 244-249: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads should be protected, not public
These two virtual methods are internal implementation hooks of the Template Method pattern — they're called by `UpdatedBlockTip` and overridden by `ActiveContext`, but never called externally through a `QuorumRole*`. In the old `QuorumObserver` class they were `protected virtual`. Making them public unnecessarily exposes internal state management.
In `ActiveContext::context.h` these overrides are correctly in the `protected` section (line 83–87), which reinforces that this is the intended visibility.
In `src/active/context.cpp`:
- [SUGGESTION] lines 283-286: Worker pool lifetime vs ActiveContext member lifetime during abnormal destruction
The worker pool now lives in the `QuorumRole` base class, while tasks queued via `StartDataRecoveryThread` capture `this` (an `ActiveContext*`) and dereference `nodeman` (line 285). C++ destruction order is: derived destructor → member destruction → base destructor. So if `ActiveContext` is ever destroyed without a prior explicit `Stop()` call, there's a window where `nodeman` is already destroyed but the worker pool hasn't been stopped yet — `QuorumRole::~QuorumRole()` calls `Stop()` too late.
In practice the normal shutdown path (`init.cpp:293`) always calls `Stop()` before destruction, and `QuorumRole::~QuorumRole()` has a safety-net `Stop()` call. But in the old code, the worker pool was inside the owned `qman_handler` member, so it was destroyed in member-destruction order alongside the other members — there was no split between member lifetimes and pool lifetime. This refactor introduced a subtle ordering change that could bite during exception-unwind or future refactors.
Consider adding an `assert` or explicit `Stop()` in `ActiveContext::~ActiveContext()` to make the contract self-enforcing.
| virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams, | ||
| gsl::not_null<const CBlockIndex*> pindexNew) const; | ||
|
|
||
| //! Observer default: triggers vvec sync threads only. | ||
| //! QuorumParticipant overrides to also recover sk shares for member quorums. | ||
| virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const; |
There was a problem hiding this comment.
🟡 Suggestion: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads should be protected, not public
These two virtual methods are internal implementation hooks of the Template Method pattern — they're called by UpdatedBlockTip and overridden by ActiveContext, but never called externally through a QuorumRole*. In the old QuorumObserver class they were protected virtual. Making them public unnecessarily exposes internal state management.
In ActiveContext::context.h these overrides are correctly in the protected section (line 83–87), which reinforces that this is the intended visibility.
💡 Suggested change
| virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams, | |
| gsl::not_null<const CBlockIndex*> pindexNew) const; | |
| //! Observer default: triggers vvec sync threads only. | |
| //! QuorumParticipant overrides to also recover sk shares for member quorums. | |
| virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const; | |
| protected: | |
| //! Observer default: connects quorum peers with is_masternode=false. | |
| //! QuorumParticipant overrides with masternode-aware connection logic. | |
| virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams, | |
| gsl::not_null<const CBlockIndex*> pindexNew) const; | |
| //! Observer default: triggers vvec sync threads only. | |
| //! QuorumParticipant overrides to also recover sk shares for member quorums. | |
| virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const; |
source: ['claude']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 244-249: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads should be protected, not public
These two virtual methods are internal implementation hooks of the Template Method pattern — they're called by `UpdatedBlockTip` and overridden by `ActiveContext`, but never called externally through a `QuorumRole*`. In the old `QuorumObserver` class they were `protected virtual`. Making them public unnecessarily exposes internal state management.
In `ActiveContext::context.h` these overrides are correctly in the `protected` section (line 83–87), which reinforces that this is the intended visibility.
| workerPool.push([pQuorum = std::move(pQuorum), pIndex, nDataMaskIn, this](int threadId) mutable { | ||
| const size_t size_offset = GetQuorumRecoveryStartOffset(*pQuorum, pIndex); | ||
| DataRecoveryThread(pIndex, std::move(pQuorum), nDataMaskIn, nodeman->GetProTxHash(), size_offset); | ||
| }); |
There was a problem hiding this comment.
🟡 Suggestion: Worker pool lifetime vs ActiveContext member lifetime during abnormal destruction
The worker pool now lives in the QuorumRole base class, while tasks queued via StartDataRecoveryThread capture this (an ActiveContext*) and dereference nodeman (line 285). C++ destruction order is: derived destructor → member destruction → base destructor. So if ActiveContext is ever destroyed without a prior explicit Stop() call, there's a window where nodeman is already destroyed but the worker pool hasn't been stopped yet — QuorumRole::~QuorumRole() calls Stop() too late.
In practice the normal shutdown path (init.cpp:293) always calls Stop() before destruction, and QuorumRole::~QuorumRole() has a safety-net Stop() call. But in the old code, the worker pool was inside the owned qman_handler member, so it was destroyed in member-destruction order alongside the other members — there was no split between member lifetimes and pool lifetime. This refactor introduced a subtle ordering change that could bite during exception-unwind or future refactors.
Consider adding an assert or explicit Stop() in ActiveContext::~ActiveContext() to make the contract self-enforcing.
source: ['codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/active/context.cpp`:
- [SUGGESTION] lines 283-286: Worker pool lifetime vs ActiveContext member lifetime during abnormal destruction
The worker pool now lives in the `QuorumRole` base class, while tasks queued via `StartDataRecoveryThread` capture `this` (an `ActiveContext*`) and dereference `nodeman` (line 285). C++ destruction order is: derived destructor → member destruction → base destructor. So if `ActiveContext` is ever destroyed without a prior explicit `Stop()` call, there's a window where `nodeman` is already destroyed but the worker pool hasn't been stopped yet — `QuorumRole::~QuorumRole()` calls `Stop()` too late.
In practice the normal shutdown path (`init.cpp:293`) always calls `Stop()` before destruction, and `QuorumRole::~QuorumRole()` has a safety-net `Stop()` call. But in the old code, the worker pool was inside the owned `qman_handler` member, so it was destroyed in member-destruction order alongside the other members — there was no split between member lifetimes and pool lifetime. This refactor introduced a subtle ordering change that could bite during exception-unwind or future refactors.
Consider adding an `assert` or explicit `Stop()` in `ActiveContext::~ActiveContext()` to make the contract self-enforcing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54d9a6b83a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| workerPool.push([pQuorum = std::move(pQuorum), pIndex, nDataMaskIn, this](int threadId) mutable { | ||
| const size_t size_offset = GetQuorumRecoveryStartOffset(*pQuorum, pIndex); | ||
| DataRecoveryThread(pIndex, std::move(pQuorum), nDataMaskIn, nodeman->GetProTxHash(), size_offset); | ||
| }); |
There was a problem hiding this comment.
Stop quorum workers before destroying masternode state
This worker task captures this and dereferences nodeman, but after this refactor ActiveContext no longer owns a separate QuorumParticipant member whose destructor stopped quorum workers before other members were torn down. If ActiveContext is ever destroyed without an explicit prior Stop() (e.g., a future call site or test harness), QuorumRole::~QuorumRole() will run too late (after member destruction), and queued/running tasks here can use freed nodeman, causing a use-after-free crash.
Useful? React with 👍 / 👎.
|
This pull request has conflicts, please rebase. |
54d9a6b to
323c98a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/llmq/quorumsman.cpp (2)
772-775:⚠️ Potential issue | 🟠 MajorBase destructor calls
Stop()which may cause UAF issues.The base class destructor (
QuorumRole::~QuorumRole()) callsStop(), but base destructors run after derived member destruction. SinceActiveContextqueuesworkerPooljobs that may reference derived state, this creates a shutdown-time use-after-free window.Consider asserting that
Stop()was already called, rather than calling it from the destructor:QuorumRole::~QuorumRole() { - Stop(); + // Derived classes must call Stop() before destruction + assert(workerPool.size() == 0 || !workerPool.is_running()); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmq/quorumsman.cpp` around lines 772 - 775, The base destructor QuorumRole::~QuorumRole() must not call Stop() because derived state may already be destroyed and ActiveContext/workerPool tasks can reference that derived state; remove the Stop() call from QuorumRole::~QuorumRole() and instead assert that Stop() was already invoked (e.g. check a boolean like m_running or a finished/shutdown flag) to catch incorrect shutdown ordering, or move the Stop() invocation into derived class destructors where derived resources are still valid; reference QuorumRole::~QuorumRole, Stop(), and ActiveContext/workerPool when making the change.
912-922:⚠️ Potential issue | 🟠 MajorRecovery loop burns attempts on pending requests without sleeping.
When
IsDataRequestPending()returns true, the loop incrementsnTriesandcontinues without hitting the 1-second sleep at the bottom. This can rapidly exhaust all members and report "All tried but failed" without actually waiting for pending requests to complete.Consider not incrementing
nTriesuntil a request is actually sent:- if ((GetTime<std::chrono::seconds>().count() - nTimeLastSuccess) > nRequestTimeout) { - if (nTries >= vecMemberHashes.size()) { - printLog("All tried but failed"); - break; - } - pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries++) % vecMemberHashes.size()]; - if (m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash, - pQuorum->qc->llmqType)) { - printLog("Already asked"); - continue; - } + if ((GetTime<std::chrono::seconds>().count() - nTimeLastSuccess) > nRequestTimeout) { + // Find next member without a pending request + size_t attempts = 0; + while (attempts < vecMemberHashes.size()) { + pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries) % vecMemberHashes.size()]; + if (!m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash, + pQuorum->qc->llmqType)) { + break; + } + nTries++; + attempts++; + } + if (attempts >= vecMemberHashes.size()) { + printLog("All pending or tried"); + // Don't break - wait for pending to complete + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmq/quorumsman.cpp` around lines 912 - 922, The loop currently increments nTries when selecting pCurrentMemberHash before calling IsDataRequestPending, which causes pending-request branches to still consume attempts and skip the 1s sleep; change the logic in the recovery loop around IsDataRequestPending so that nTries is only incremented when an actual request is sent (i.e., move the nTries++ to after the code path that sends a request or use a separate index variable for probing and only increment nTries on a real send), keep the IsDataRequestPending check as-is (call on *pCurrentMemberHash), and ensure that when the branch finds a pending request it does not continue past the sleep (so the loop will hit the existing 1-second sleep) and preserve the existing printLog(...) messages (e.g., "Already asked" / "All tried but failed").
🧹 Nitpick comments (1)
src/active/context.cpp (1)
182-182: Redundant assertion ongsl::not_nullparameter.The
block_indexparameter is declared asgsl::not_null<const CBlockIndex*>, which already guarantees the pointer is non-null at the call site. This assertion is unnecessary.Proposed removal
if (request.GetDataMask() & llmq::CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) { - assert(block_index); - int memberIdx = quorum.GetMemberIndex(request.GetProTxHash());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/active/context.cpp` at line 182, Remove the redundant runtime assertion for the non-null pointer: delete the call to assert(block_index); in the function that takes gsl::not_null<const CBlockIndex*> block_index (i.e., where the parameter is declared as gsl::not_null<const CBlockIndex*> block_index). The gsl::not_null type already enforces non-null at call sites, so simply remove the assert and leave the parameter usage as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/active/context.cpp`:
- Line 173: The modulo expression "nIndex % quorum.qc->validMembers.size()" can
divide by zero if quorum.qc->validMembers is empty; add a defensive check before
the modulo (inspect quorum.qc and quorum.qc->validMembers.empty()) and handle
the empty case (e.g., return 0 or throw/log a clear error) so you never perform
the modulo on size() == 0; update the function containing that return to use
this guard and keep the rest of the logic unchanged.
---
Duplicate comments:
In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The base destructor QuorumRole::~QuorumRole() must not
call Stop() because derived state may already be destroyed and
ActiveContext/workerPool tasks can reference that derived state; remove the
Stop() call from QuorumRole::~QuorumRole() and instead assert that Stop() was
already invoked (e.g. check a boolean like m_running or a finished/shutdown
flag) to catch incorrect shutdown ordering, or move the Stop() invocation into
derived class destructors where derived resources are still valid; reference
QuorumRole::~QuorumRole, Stop(), and ActiveContext/workerPool when making the
change.
- Around line 912-922: The loop currently increments nTries when selecting
pCurrentMemberHash before calling IsDataRequestPending, which causes
pending-request branches to still consume attempts and skip the 1s sleep; change
the logic in the recovery loop around IsDataRequestPending so that nTries is
only incremented when an actual request is sent (i.e., move the nTries++ to
after the code path that sends a request or use a separate index variable for
probing and only increment nTries on a real send), keep the IsDataRequestPending
check as-is (call on *pCurrentMemberHash), and ensure that when the branch finds
a pending request it does not continue past the sleep (so the loop will hit the
existing 1-second sleep) and preserve the existing printLog(...) messages (e.g.,
"Already asked" / "All tried but failed").
---
Nitpick comments:
In `@src/active/context.cpp`:
- Line 182: Remove the redundant runtime assertion for the non-null pointer:
delete the call to assert(block_index); in the function that takes
gsl::not_null<const CBlockIndex*> block_index (i.e., where the parameter is
declared as gsl::not_null<const CBlockIndex*> block_index). The gsl::not_null
type already enforces non-null at call sites, so simply remove the assert and
leave the parameter usage as-is.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: a10c1fef-7887-4c73-b10b-d5db9bfee1be
📒 Files selected for processing (21)
src/Makefile.amsrc/active/context.cppsrc/active/context.hsrc/active/dkgsession.cppsrc/active/dkgsessionhandler.cppsrc/active/quorums.cppsrc/active/quorums.hsrc/init.cppsrc/llmq/dkgsession.cppsrc/llmq/dkgsession.hsrc/llmq/observer.cppsrc/llmq/observer.hsrc/llmq/observer/quorums.cppsrc/llmq/observer/quorums.hsrc/llmq/quorums.hsrc/llmq/quorumsman.cppsrc/llmq/quorumsman.hsrc/net_processing.cppsrc/node/context.cppsrc/rpc/quorums.cpptest/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
- test/util/data/non-backported.txt
- src/llmq/observer/quorums.cpp
- src/active/quorums.h
- src/active/quorums.cpp
- src/llmq/observer/quorums.h
✅ Files skipped from review due to trivial changes (2)
- src/node/context.cpp
- src/net_processing.cpp
🚧 Files skipped from review as they are similar to previous changes (4)
- src/rpc/quorums.cpp
- src/active/dkgsession.cpp
- src/llmq/quorums.h
- src/init.cpp
| } | ||
| } | ||
| } | ||
| return nIndex % quorum.qc->validMembers.size(); |
There was a problem hiding this comment.
Potential division by zero if validMembers is empty.
If quorum.qc->validMembers.size() is 0, this modulo operation causes undefined behavior. While a properly initialized quorum should always have valid members, adding a defensive check would prevent crashes from edge cases or future refactoring.
Proposed defensive check
- return nIndex % quorum.qc->validMembers.size();
+ const size_t validMembersCount = quorum.qc->validMembers.size();
+ assert(validMembersCount > 0);
+ return nIndex % validMembersCount;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return nIndex % quorum.qc->validMembers.size(); | |
| const size_t validMembersCount = quorum.qc->validMembers.size(); | |
| assert(validMembersCount > 0); | |
| return nIndex % validMembersCount; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/active/context.cpp` at line 173, The modulo expression "nIndex %
quorum.qc->validMembers.size()" can divide by zero if quorum.qc->validMembers is
empty; add a defensive check before the modulo (inspect quorum.qc and
quorum.qc->validMembers.empty()) and handle the empty case (e.g., return 0 or
throw/log a clear error) so you never perform the modulo on size() == 0; update
the function containing that return to use this guard and keep the rest of the
logic unchanged.
… no more child of 'observer'
… entity to reduce amount of abstractions
…on' [-Werror,-Wshadow-field]
… and InitializeQuorumConnections
323c98a to
98399b4
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/llmq/quorumsman.h (1)
135-159:⚠️ Potential issue | 🟡 MinorRe-run clang-format on the new declarations.
CI is already red here; these updated declarations and virtual signatures do not match the repository formatting yet.
Also applies to: 235-256
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmq/quorumsman.h` around lines 135 - 159, Re-run clang-format on the updated declarations in quorumsman.h and fix formatting to match the repository style for the following declarations: RequestQuorumData, GetQuorum, both ScanQuorums overloads, IsMasternode, IsWatching, IsDataRequestPending, GetDataRequestStatus, CleanupExpiredDataRequests, and CleanupOldQuorumData; ensure spacing, line breaks, parameter wrapping and the placement/spacing of the EXCLUSIVE_LOCKS_REQUIRED annotations match existing code style (and reformat the other affected virtual/signature blocks in the file the same way).
♻️ Duplicate comments (4)
src/llmq/quorumsman.cpp (2)
772-775:⚠️ Potential issue | 🟠 MajorDon't make the base destructor the first place that stops the quorum pool.
QuorumRole::~QuorumRole()runs after derived destructors, so it is too late to be the primary shutdown point for queued jobs.Stop()needs to happen beforeObserverContext/ActiveContexttear down their own state; the base destructor should only verify that shutdown already happened.Please verify that derived teardown already stops the shared pool before member destruction and that the shared pool still runs derived work. Expected: an explicit early
Stop()in derived destructors or owners; otherwise this remains a shutdown-time lifetime hazard.#!/bin/bash set -euo pipefail echo "Base shutdown path:" rg -n -C2 'QuorumRole::~QuorumRole|void QuorumRole::Stop|void QuorumRole::Start' src/llmq/quorumsman.cpp echo echo "Derived destructors / explicit Stop() calls:" rg -n -C2 '(ActiveContext|ObserverContext)::(~|Stop\()' src/active src/llmq echo echo "Shared quorum-pool jobs:" rg -n -C3 'workerPool\.push|TriggerQuorumDataRecoveryThreads|StartDataRecoveryThread|StartVvecSyncThread' src/active src/llmq🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmq/quorumsman.cpp` around lines 772 - 775, The base destructor QuorumRole::~QuorumRole currently calls Stop() too late (after derived destructors run); move the primary shutdown of the shared worker pool out of the base destructor and ensure derived classes (e.g., ObserverContext and ActiveContext) explicitly call Stop() early in their own destructors or owners perform an early Stop() before tearing down derived members; update or add explicit Stop() calls in ActiveContext::~ActiveContext and ObserverContext::~ObserverContext (or their owning shutdown paths) so the workerPool and methods like TriggerQuorumDataRecoveryThreads/StartDataRecoveryThread/StartVvecSyncThread are stopped before derived member destruction, and keep QuorumRole::~QuorumRole as a defensive check that verifies shutdown already occurred rather than performing it.
917-923:⚠️ Potential issue | 🟠 MajorPending requests shouldn't burn recovery attempts.
nTriesis incremented beforeIsDataRequestPending(), and thecontinueskips the normal 1-second backoff. A few outstanding requests can make this loop reach"All tried but failed"without actually giving any of them time to complete.Proposed fix
- pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries++) % vecMemberHashes.size()]; + pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries) % vecMemberHashes.size()]; if (m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash, pQuorum->qc->llmqType)) { printLog("Already asked"); + quorumThreadInterrupt.sleep_for(std::chrono::seconds(1)); continue; } + ++nTries; quorumThreadInterrupt.sleep_for(std::chrono::milliseconds(start_offset * 100));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmq/quorumsman.cpp` around lines 917 - 923, The loop currently increments nTries before calling m_qman.IsDataRequestPending and then uses continue, which consumes an attempt without waiting; to fix, compute the candidate member index using a temporary variable (e.g., idx = (start_offset + nTries) % vecMemberHashes.size()) and call IsDataRequestPending with that candidate without incrementing nTries, then only increment nTries when you actually perform an attempt (after passing the pending check) or otherwise perform the 1-second backoff before continuing; adjust quorumThreadInterrupt.sleep_for usage so the pending-case flows through the same backoff path instead of skipping it.src/llmq/quorumsman.h (1)
75-75:⚠️ Potential issue | 🔴 CriticalForward-declare
QuorumRolebeforeCQuorumManager.Line 75 and Line 111 use
llmq::QuorumRolebefore the class is declared at Line 197, so this header is no longer self-contained and can fail to parse when included on its own. Add a forward declaration next to the otherllmqforward declarations beforeCQuorumManager.Minimal fix
class CDKGSessionManager; class CQuorumBlockProcessor; class CQuorumSnapshotManager; +class QuorumRole;Also applies to: 111-111
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llmq/quorumsman.h` at line 75, Add a forward declaration for llmq::QuorumRole before the declaration of CQuorumManager so the header is self-contained: locate the block with the other llmq forward declarations (where CQuorumManager is declared) and add "namespace llmq { class QuorumRole; }" (or equivalent) so members like llmq::QuorumRole* m_handler and any use at lines referencing QuorumRole compile without needing the full class definition.src/active/context.cpp (1)
291-294:⚠️ Potential issue | 🟠 MajorStop the recovery pool before these jobs can outlive
ActiveContext.Lines 291-294 capture
thisand dereferencenodeman, butworkerPoolnow lives inQuorumRole. If destruction ever happens without an earlier explicitStop(), this lambda can run afterActiveContextmembers start tearing down. Make shutdown self-enforcing here by stopping the pool before member destruction, or assert that shutdown already did.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/active/context.cpp` around lines 291 - 294, The lambda pushed into workerPool captures this and dereferences nodeman, which can outlive ActiveContext if the pool is owned by QuorumRole; ensure shutdown is enforced by either stopping the pool before any ActiveContext members are destroyed or asserting shutdown has already occurred. Modify the code around workerPool.push (the lambda that calls DataRecoveryThread with pQuorum/pIndex and nodeman->GetProTxHash()) to first call or verify Stop() (or an equivalent self-enforcing shutdown on workerPool) or add an assertion that ActiveContext has already been stopped, so the lambda cannot run after ActiveContext destruction; update any related lifetimes or comments accordingly to reference workerPool, Stop(), ActiveContext, QuorumRole, nodeman, and DataRecoveryThread.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/active/context.cpp`:
- Around line 153-154: Reformat the function signature and wrapped lines for
ActiveContext::GetQuorumRecoveryStartOffset and the nearby blocks mentioned (the
function headers and wrapped calls/conditionals in the chunks around lines
191-193, 203-205, and 241-330) with clang-format so they match the project's
style; run clang-format (or the repo's formatting command) on
src/active/context.cpp to reflow the long signatures, wrapped calls, and
conditionals and commit the updated formatting ensuring the signatures like
ActiveContext::GetQuorumRecoveryStartOffset are a single consistent style as
enforced by CI.
In `@src/llmq/observer.cpp`:
- Around line 57-62: ObserverContext::InitializeCurrentBlockTip is calling
llmq::QuorumRole::InitializeQuorumConnections(tip) after calling
UpdatedBlockTip(tip, nullptr, ibd), which duplicates the quorum init logic and
bypasses the IBD guard inside QuorumRole::UpdatedBlockTip; remove the explicit
call to llmq::QuorumRole::InitializeQuorumConnections(tip) from
InitializeCurrentBlockTip so quorum connections are only initialized via
UpdatedBlockTip/QuorumRole::UpdatedBlockTip (which already checks ibd) and let
the existing UpdatedBlockTip flow handle connection setup.
- Around line 44-47: The code currently sets ENCRYPTED_CONTRIBUTIONS_MISSING and
returns MisbehavingError when request_limit_exceeded is true, which penalizes
peers for a condition observers can't serve; change the branch in observer.cpp
so that after
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING)
you always return MessageProcessingResult{} (no MisbehavingError) regardless of
request_limit_exceeded, matching CQuorumManager::sendQDATA() behavior and
avoiding misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING.
---
Outside diff comments:
In `@src/llmq/quorumsman.h`:
- Around line 135-159: Re-run clang-format on the updated declarations in
quorumsman.h and fix formatting to match the repository style for the following
declarations: RequestQuorumData, GetQuorum, both ScanQuorums overloads,
IsMasternode, IsWatching, IsDataRequestPending, GetDataRequestStatus,
CleanupExpiredDataRequests, and CleanupOldQuorumData; ensure spacing, line
breaks, parameter wrapping and the placement/spacing of the
EXCLUSIVE_LOCKS_REQUIRED annotations match existing code style (and reformat the
other affected virtual/signature blocks in the file the same way).
---
Duplicate comments:
In `@src/active/context.cpp`:
- Around line 291-294: The lambda pushed into workerPool captures this and
dereferences nodeman, which can outlive ActiveContext if the pool is owned by
QuorumRole; ensure shutdown is enforced by either stopping the pool before any
ActiveContext members are destroyed or asserting shutdown has already occurred.
Modify the code around workerPool.push (the lambda that calls DataRecoveryThread
with pQuorum/pIndex and nodeman->GetProTxHash()) to first call or verify Stop()
(or an equivalent self-enforcing shutdown on workerPool) or add an assertion
that ActiveContext has already been stopped, so the lambda cannot run after
ActiveContext destruction; update any related lifetimes or comments accordingly
to reference workerPool, Stop(), ActiveContext, QuorumRole, nodeman, and
DataRecoveryThread.
In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The base destructor QuorumRole::~QuorumRole currently
calls Stop() too late (after derived destructors run); move the primary shutdown
of the shared worker pool out of the base destructor and ensure derived classes
(e.g., ObserverContext and ActiveContext) explicitly call Stop() early in their
own destructors or owners perform an early Stop() before tearing down derived
members; update or add explicit Stop() calls in ActiveContext::~ActiveContext
and ObserverContext::~ObserverContext (or their owning shutdown paths) so the
workerPool and methods like
TriggerQuorumDataRecoveryThreads/StartDataRecoveryThread/StartVvecSyncThread are
stopped before derived member destruction, and keep QuorumRole::~QuorumRole as a
defensive check that verifies shutdown already occurred rather than performing
it.
- Around line 917-923: The loop currently increments nTries before calling
m_qman.IsDataRequestPending and then uses continue, which consumes an attempt
without waiting; to fix, compute the candidate member index using a temporary
variable (e.g., idx = (start_offset + nTries) % vecMemberHashes.size()) and call
IsDataRequestPending with that candidate without incrementing nTries, then only
increment nTries when you actually perform an attempt (after passing the pending
check) or otherwise perform the 1-second backoff before continuing; adjust
quorumThreadInterrupt.sleep_for usage so the pending-case flows through the same
backoff path instead of skipping it.
In `@src/llmq/quorumsman.h`:
- Line 75: Add a forward declaration for llmq::QuorumRole before the declaration
of CQuorumManager so the header is self-contained: locate the block with the
other llmq forward declarations (where CQuorumManager is declared) and add
"namespace llmq { class QuorumRole; }" (or equivalent) so members like
llmq::QuorumRole* m_handler and any use at lines referencing QuorumRole compile
without needing the full class definition.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: abc2aee3-2464-4810-95b6-71fe1dde4c97
📒 Files selected for processing (21)
src/Makefile.amsrc/active/context.cppsrc/active/context.hsrc/active/dkgsession.cppsrc/active/dkgsessionhandler.cppsrc/active/quorums.cppsrc/active/quorums.hsrc/init.cppsrc/llmq/dkgsession.cppsrc/llmq/dkgsession.hsrc/llmq/observer.cppsrc/llmq/observer.hsrc/llmq/observer/quorums.cppsrc/llmq/observer/quorums.hsrc/llmq/quorums.hsrc/llmq/quorumsman.cppsrc/llmq/quorumsman.hsrc/net_processing.cppsrc/node/context.cppsrc/rpc/quorums.cpptest/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
- test/util/data/non-backported.txt
- src/llmq/observer/quorums.h
- src/llmq/observer/quorums.cpp
- src/active/quorums.h
- src/active/quorums.cpp
✅ Files skipped from review due to trivial changes (3)
- src/rpc/quorums.cpp
- src/node/context.cpp
- src/net_processing.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
- src/active/dkgsession.cpp
- src/init.cpp
- src/llmq/quorums.h
| size_t ActiveContext::GetQuorumRecoveryStartOffset(const llmq::CQuorum& quorum, | ||
| gsl::not_null<const CBlockIndex*> pIndex) const |
There was a problem hiding this comment.
Re-run clang-format on this block.
The diff-format job is already failing on these signatures, wrapped calls, and conditionals, so this chunk will keep CI red until it is reformatted.
Also applies to: 191-193, 203-205, 241-330
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/active/context.cpp` around lines 153 - 154, Reformat the function
signature and wrapped lines for ActiveContext::GetQuorumRecoveryStartOffset and
the nearby blocks mentioned (the function headers and wrapped calls/conditionals
in the chunks around lines 191-193, 203-205, and 241-330) with clang-format so
they match the project's style; run clang-format (or the repo's formatting
command) on src/active/context.cpp to reflow the long signatures, wrapped calls,
and conditionals and commit the updated formatting ensuring the signatures like
ActiveContext::GetQuorumRecoveryStartOffset are a single consistent style as
enforced by CI.
| if (request.GetDataMask() & CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) { | ||
| request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING); | ||
| return request_limit_exceeded ? MisbehavingError{25, "request limit exceeded"} : MessageProcessingResult{}; | ||
| } |
There was a problem hiding this comment.
Don't punish retries for data observers can never serve.
CQuorumManager::sendQDATA() explicitly skips misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING, but this branch returns MisbehavingError for the same case when the request repeats. That turns a benign "data not available here" response into a peer penalty.
Proposed fix
if (request.GetDataMask() & CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) {
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING);
- return request_limit_exceeded ? MisbehavingError{25, "request limit exceeded"} : MessageProcessingResult{};
+ return {};
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmq/observer.cpp` around lines 44 - 47, The code currently sets
ENCRYPTED_CONTRIBUTIONS_MISSING and returns MisbehavingError when
request_limit_exceeded is true, which penalizes peers for a condition observers
can't serve; change the branch in observer.cpp so that after
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING)
you always return MessageProcessingResult{} (no MisbehavingError) regardless of
request_limit_exceeded, matching CQuorumManager::sendQDATA() behavior and
avoiding misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING.
| void ObserverContext::InitializeCurrentBlockTip(const CBlockIndex* tip, bool ibd) | ||
| { | ||
| UpdatedBlockTip(tip, nullptr, ibd); | ||
| if (tip) { | ||
| qman_handler->InitializeQuorumConnections(tip); | ||
| llmq::QuorumRole::InitializeQuorumConnections(tip); | ||
| } |
There was a problem hiding this comment.
This duplicates quorum connection setup and bypasses the IBD guard.
UpdatedBlockTip(tip, nullptr, ibd) already reaches QuorumRole::UpdatedBlockTip(), which initializes quorum connections when appropriate. The extra InitializeQuorumConnections(tip) repeats that work, and when ibd is true it still opens quorum connections on startup.
Proposed fix
void ObserverContext::InitializeCurrentBlockTip(const CBlockIndex* tip, bool ibd)
{
UpdatedBlockTip(tip, nullptr, ibd);
- if (tip) {
- llmq::QuorumRole::InitializeQuorumConnections(tip);
- }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmq/observer.cpp` around lines 57 - 62,
ObserverContext::InitializeCurrentBlockTip is calling
llmq::QuorumRole::InitializeQuorumConnections(tip) after calling
UpdatedBlockTip(tip, nullptr, ibd), which duplicates the quorum init logic and
bypasses the IBD guard inside QuorumRole::UpdatedBlockTip; remove the explicit
call to llmq::QuorumRole::InitializeQuorumConnections(tip) from
InitializeCurrentBlockTip so quorum connections are only initialized via
UpdatedBlockTip/QuorumRole::UpdatedBlockTip (which already checks ibd) and let
the existing UpdatedBlockTip flow handle connection setup.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The incremental push rebases the original refactoring on top of three merged bugfix/test PRs (#7240, #7241, #7252) and adds two new clean refactoring commits: de-duplicating UpdatedBlockTip/InitializeQuorumConnections code and simplifying the observer_ctx guard in init.cpp. The two prior review findings (protected visibility for Template Method hooks, worker pool lifetime safety) remain unaddressed.
Reviewed commit: 98399b4
🟡 2 suggestion(s)
1 additional finding
🟡 suggestion: Worker pool tasks can outlive ActiveContext members on abnormal destruction
src/active/context.cpp (lines 76-80)
Worker pool tasks launched by StartDataRecoveryThread capture this (ActiveContext*) and dereference nodeman->GetProTxHash(). The worker pool Stop() lives in ~QuorumRole(), but C++ destruction order destroys ActiveContext members (including nodeman) before ~QuorumRole() runs. The normal shutdown path calls ActiveContext::Stop() first, but if destruction occurs without an explicit Stop() (e.g., exception during initialization), in-flight tasks could use a destroyed nodeman. Adding QuorumRole::Stop() in ~ActiveContext() would make the contract self-enforcing. (Same finding as prior review — not yet addressed.)
💡 Suggested change
ActiveContext::~ActiveContext()
{
// Ensure worker pool is stopped before our members are destroyed,
// since tasks may capture `this` and dereference nodeman.
llmq::QuorumRole::Stop();
m_qman.DisconnectManagers();
m_isman.DisconnectSigner();
}
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 245-250: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads are still public — should be protected
These two virtual methods are Template Method hooks — they are only ever called from within the QuorumRole hierarchy (by `InitializeQuorumConnections` and `UpdatedBlockTip`). Exposing them as `public` allows external callers to invoke per-LLMQ-type methods directly, bypassing the intended `InitializeQuorumConnections` entry point which iterates all types. In `ActiveContext::context.h` the overrides are correctly declared in the `protected` section, confirming the intended visibility. (Same finding as prior review — not yet addressed.)
In `src/active/context.cpp`:
- [SUGGESTION] lines 76-80: Worker pool tasks can outlive ActiveContext members on abnormal destruction
Worker pool tasks launched by `StartDataRecoveryThread` capture `this` (ActiveContext*) and dereference `nodeman->GetProTxHash()`. The worker pool `Stop()` lives in `~QuorumRole()`, but C++ destruction order destroys ActiveContext members (including `nodeman`) *before* `~QuorumRole()` runs. The normal shutdown path calls `ActiveContext::Stop()` first, but if destruction occurs without an explicit `Stop()` (e.g., exception during initialization), in-flight tasks could use a destroyed `nodeman`. Adding `QuorumRole::Stop()` in `~ActiveContext()` would make the contract self-enforcing. (Same finding as prior review — not yet addressed.)
| virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams, | ||
| gsl::not_null<const CBlockIndex*> pindexNew) const; | ||
|
|
||
| //! Observer default: triggers vvec sync threads only. | ||
| //! QuorumParticipant overrides to also recover sk shares for member quorums. | ||
| virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const; |
There was a problem hiding this comment.
🟡 Suggestion: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads are still public — should be protected
These two virtual methods are Template Method hooks — they are only ever called from within the QuorumRole hierarchy (by InitializeQuorumConnections and UpdatedBlockTip). Exposing them as public allows external callers to invoke per-LLMQ-type methods directly, bypassing the intended InitializeQuorumConnections entry point which iterates all types. In ActiveContext::context.h the overrides are correctly declared in the protected section, confirming the intended visibility. (Same finding as prior review — not yet addressed.)
💡 Suggested change
| virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams, | |
| gsl::not_null<const CBlockIndex*> pindexNew) const; | |
| //! Observer default: triggers vvec sync threads only. | |
| //! QuorumParticipant overrides to also recover sk shares for member quorums. | |
| virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const; | |
| protected: | |
| //! Observer default: connects quorum peers with is_masternode=false. | |
| //! QuorumParticipant overrides with masternode-aware connection logic. | |
| virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams, | |
| gsl::not_null<const CBlockIndex*> pindexNew) const; | |
| //! Observer default: triggers vvec sync threads only. | |
| //! QuorumParticipant overrides to also recover sk shares for member quorums. | |
| virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const; |
source: ['claude']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 245-250: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads are still public — should be protected
These two virtual methods are Template Method hooks — they are only ever called from within the QuorumRole hierarchy (by `InitializeQuorumConnections` and `UpdatedBlockTip`). Exposing them as `public` allows external callers to invoke per-LLMQ-type methods directly, bypassing the intended `InitializeQuorumConnections` entry point which iterates all types. In `ActiveContext::context.h` the overrides are correctly declared in the `protected` section, confirming the intended visibility. (Same finding as prior review — not yet addressed.)
I have couple changes to add to PR, wait with review |
Issue being fixed or feature implemented
It's a follow-up for #7066 and related PRs
This PR is reducing mental concentration when reading quorum-related code by significant simplification amount of levels of abstractions and inheritance classes from each other.
Performance is not expected to be improved by any noticeable margin.
Compilation time is expected to be improved marginally due to fewer files (-4) to compile.
What was done?
This changes has been spotted during #7234 and partially done as a part of local branch, but currently it is not a direct dependency of 7234.
How Has This Been Tested?
Run unit & functional tests
Breaking Changes
N/A
Checklist: