feat(metrics): add block transaction count and SR set change monitoring#6624
feat(metrics): add block transaction count and SR set change monitoring#6624warku123 wants to merge 10 commits intotronprotocol:developfrom
Conversation
…monitoring Replace the dedicated tron:block_empty_total counter with a more comprehensive tron:block_transaction_count histogram that tracks the distribution of transaction counts per block. Changes: - Add overloaded init() method in MetricsHistogram to support custom buckets - Add BLOCK_TRANSACTION_COUNT histogram with buckets [0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000] - Record transaction count for all blocks (including empty blocks with txCount=0) - Empty blocks can be queried via bucket le=0.0 - Remove unused BLOCK_EMPTY counter This provides richer insights for network analysis while still supporting empty block monitoring via histogram bucket queries. Closes tronprotocol#6590
Replace duplicated miner string literal with MINER_LABEL constant to fix SonarQube code smell. Relates tronprotocol#6624
- Replace duplicated miner string literal with MINER_LABEL constant in PrometheusApiServiceTest to fix SonarQube code smell - Add //NOSONAR comment for new BLOCK_TRANSACTION_COUNT histogram label name in MetricsHistogram Relates tronprotocol#6624
4433c52 to
93d0083
Compare
Extract repeated 'miner' string literal into MINER_LABEL constant in MetricsHistogram to follow DRY principle and fix SonarQube warning. Also update PrometheusApiServiceTest to use MINER_LABEL constant for test assertions. Relates tronprotocol#6624
93d0083 to
6315ffc
Compare
Extract repeated 'miner' string literal into MINER_LABEL constant in MetricsHistogram to follow DRY principle and fix SonarQube warning. Also update PrometheusApiServiceTest to use MINER_LABEL constant for test assertions. Relates tronprotocol#6624
6315ffc to
e92b8e2
Compare
| init(MetricKeys.Counter.TXS, "tron txs info .", "type", "detail"); | ||
| init(MetricKeys.Counter.MINER, "tron miner info .", "miner", "type"); | ||
| init(MetricKeys.Counter.BLOCK_FORK, "tron block fork info .", "type"); | ||
| init(MetricKeys.Counter.SR_SET_CHANGE, "tron sr set change .", "action", "witness"); |
There was a problem hiding this comment.
The witness label uses addresses. Although the Active SR pool is relatively limited (27 slots), as SRs rotate in and out over time, label values will accumulate monotonically — they only grow, never shrink. This is a known Prometheus anti-pattern.
Practical impact assessment:
-
TRON's SR candidate pool is not infinite, and rotation frequency is low, so it won't explode in the short term
-
However, as a
long-running node metric, months or years of operation could still accumulate a significant number of time series
Possible approaches:
-
If the address dimension is truly needed, keep the current design, but document this characteristic
-
Or remove the
witnesslabel entirely, keeping onlyaction(add/remove), and output address details via logs instead. Follow the principle thatPrometheus should be used for alerting and awareness, while logs are for precise investigation and root-cause analysis.
There was a problem hiding this comment.
Thanks for raising the high-cardinality concern — it's a valid Prometheus best practice to watch for.
However, I believe the witness label is acceptable here given TRON's specific constraints:
- Bounded candidate pool: The SR candidate pool is finite. Even over years of operation, the total number of distinct addresses that have ever held an SR slot has a practical ceiling — far from the
unbounded growth typical of high-cardinality anti-patterns. - Low rotation frequency: SR set changes only occur at maintenance intervals (every 6 hours), so new label values are introduced very infrequently.
- Diagnostic value: The primary purpose of this metric is to identify which SR was added or removed. Dropping the witness label would reduce the metric to just "something changed", requiring operators to
cross-reference logs — losing the at-a-glance visibility in Grafana dashboards that makes this metric useful in the first place.
In summary, the worst-case time series count is roughly candidate_pool_size × 2 (add/remove), which stays well within Prometheus's comfortable operating range.
That said, I agree it's worth documenting this characteristic. Could you suggest what form you'd prefer — a code comment at the metric registration site, a note in the PR description, or something else?
There was a problem hiding this comment.
prefer to a code comment at the metric registration site.
framework/src/test/java/org/tron/core/metrics/prometheus/PrometheusApiServiceTest.java
Outdated
Show resolved
Hide resolved
framework/src/test/java/org/tron/core/metrics/prometheus/PrometheusApiServiceTest.java
Show resolved
Hide resolved
framework/src/main/java/org/tron/core/metrics/blockchain/BlockChainMetricManager.java
Show resolved
Hide resolved
- Extract MINER_LABEL to MetricLabels.Histogram.MINER as shared constant - Use int instead of double for block count in tests - Store witness addresses as base58 directly, removing unnecessary hex round-trip Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| MetricLabels.Counter.TXS_SUCCESS, MetricLabels.Counter.TXS_SUCCESS); | ||
| } | ||
| // Record transaction count distribution for all blocks (including empty blocks) | ||
| int txCount = block.getTransactions().size(); |
There was a problem hiding this comment.
L174, L175, and L179 can use txCount.
|
Two questions regarding observability adoption of the new metrics:
|
| } | ||
| // Record transaction count distribution for all blocks (including empty blocks) | ||
| int txCount = block.getTransactions().size(); | ||
| Metrics.histogramObserve(MetricKeys.Histogram.BLOCK_TRANSACTION_COUNT, txCount, |
There was a problem hiding this comment.
metricsService.applyBlock(block) runs before dbManager.pushBlock(block) in TronNetDelegate.processBlock(), so this histogram is updated for blocks that are later rejected or only live on a fork. For an empty-block monitoring metric, that produces false positives per witness and lets invalid peer traffic skew the distribution. This should be recorded only after the block has been applied successfully.
There was a problem hiding this comment.
Good catch. I actually noticed this ordering issue during my investigation — metricsService.applyBlock() is called before dbManager.pushBlock(), meaning all metrics in BlockChainMetricManager.applyBlock() are recorded for blocks that may later be rejected.
However, this is not specific to the histogram I added — the existing metrics in the same method (e.g., MINER_LATENCY, TXS, NET_LATENCY, duplicate witness detection) all share the same problem. I chose to keep the same pattern to maintain architectural consistency and avoid expanding the scope of this PR.
If you agree, I'd like to open a separate issue to move metricsService.applyBlock() after dbManager.pushBlock() succeeds in TronNetDelegate.processBlock(), which would fix the timing problem for all metrics at once.
| StringUtil.encode58Check(address)); | ||
|
|
||
| // SR set change detection | ||
| long nextMaintenanceTime = dbManager.getDynamicPropertiesStore().getNextMaintenanceTime(); |
There was a problem hiding this comment.
The SR-set change detection is reading nextMaintenanceTime and getActiveWitnesses() before the current block is applied. In the real pipeline those values are updated during dbManager.pushBlock() / consensus.applyBlock(), so on the maintenance block this branch still sees the old schedule and old maintenance timestamp. That makes tron:sr_set_change show up one block late, and the added test masks the problem by mutating both stores manually before generating the next block.
There was a problem hiding this comment.
Same root cause as above — since metricsService.applyBlock() runs before dbManager.pushBlock(), the maintenance time and active witness list haven't been updated yet at the point of reading.
In fact, this is precisely why the SR-set change detection relies on nextMaintenanceTime change as the trigger — because getActiveWitnesses() still returns the old schedule at this point, comparing against the previous nextMaintenanceTime is the only way to detect that a maintenance cycle has occurred and capture the witness list transition.
This would be addressed together in that follow-up issue: once metricsService.applyBlock() is moved after a successful pushBlock(), the SR-set change detection will naturally read the up-to-date state without needing this workaround.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What does this PR do?
Add two new metrics enhancements for better network monitoring:
Block Transaction Count Histogram
tron:block_empty_totalcounter withtron:block_transaction_counthistogramle=0.0bucketSR Set Change Monitoring
tron:sr_set_changecounter to track SR membership changesaction(add/remove),witness(SR address)Changes:
init()method inMetricsHistogramfor custom bucketsBLOCK_TRANSACTION_COUNThistogram metricSR_SET_CHANGEcounter withSR_ADD/SR_REMOVElabelsBlockChainMetricManagerPrometheusApiServiceTestwith comprehensive test coverageWhy are these changes required?
Compared to #6602, this PR provides a cleaner implementation focused solely on metrics enhancement without additional complexity.
This PR has been tested by:
PrometheusApiServiceTest)Follow up
N/A
Extra details