Skip to content

feat(metrics): add block transaction count and SR set change monitoring#6624

Open
warku123 wants to merge 10 commits intotronprotocol:developfrom
warku123:metric_modify_clean
Open

feat(metrics): add block transaction count and SR set change monitoring#6624
warku123 wants to merge 10 commits intotronprotocol:developfrom
warku123:metric_modify_clean

Conversation

@warku123
Copy link
Copy Markdown

@warku123 warku123 commented Apr 1, 2026

What does this PR do?

Add two new metrics enhancements for better network monitoring:

  1. Block Transaction Count Histogram

    • Replace tron:block_empty_total counter with tron:block_transaction_count histogram
    • Track transaction count distribution per block with buckets: [0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000]
    • Empty blocks can be monitored via le=0.0 bucket
  2. SR Set Change Monitoring

    • Add tron:sr_set_change counter to track SR membership changes
    • Detect SR additions and removals at each maintenance time interval
    • Labels: action (add/remove), witness (SR address)

Changes:

  • Add overloaded init() method in MetricsHistogram for custom buckets
  • Add BLOCK_TRANSACTION_COUNT histogram metric
  • Add SR_SET_CHANGE counter with SR_ADD/SR_REMOVE labels
  • Implement SR set change detection logic in BlockChainMetricManager
  • Update PrometheusApiServiceTest with comprehensive test coverage

Why are these changes required?

  • Transaction histogram provides richer insights than simple counter for network analysis while maintaining empty block detection capability
  • SR set change monitoring enables tracking of network consensus participant changes for governance analysis

Compared to #6602, this PR provides a cleaner implementation focused solely on metrics enhancement without additional complexity.

This PR has been tested by:

  • Unit Tests (updated PrometheusApiServiceTest)
  • Manual Testing

Follow up

N/A

Extra details

…monitoring

Replace the dedicated tron:block_empty_total counter with a more comprehensive
tron:block_transaction_count histogram that tracks the distribution of
transaction counts per block.

Changes:
- Add overloaded init() method in MetricsHistogram to support custom buckets
- Add BLOCK_TRANSACTION_COUNT histogram with buckets [0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000]
- Record transaction count for all blocks (including empty blocks with txCount=0)
- Empty blocks can be queried via bucket le=0.0
- Remove unused BLOCK_EMPTY counter

This provides richer insights for network analysis while still supporting
empty block monitoring via histogram bucket queries.

Closes tronprotocol#6590
@warku123 warku123 changed the title feat(metrics): add block transaction count histogram for empty block monitoring feat(metrics): add block transaction count and SR set change monitoring Apr 1, 2026
warku123 added a commit to warku123/java-tron that referenced this pull request Apr 2, 2026
Replace duplicated miner string literal with MINER_LABEL constant
to fix SonarQube code smell.

Relates tronprotocol#6624
warku123 added a commit to warku123/java-tron that referenced this pull request Apr 2, 2026
- Replace duplicated miner string literal with MINER_LABEL constant
  in PrometheusApiServiceTest to fix SonarQube code smell
- Add //NOSONAR comment for new BLOCK_TRANSACTION_COUNT histogram
  label name in MetricsHistogram

Relates tronprotocol#6624
@warku123 warku123 force-pushed the metric_modify_clean branch from 4433c52 to 93d0083 Compare April 2, 2026 03:44
warku123 added a commit to warku123/java-tron that referenced this pull request Apr 2, 2026
Extract repeated 'miner' string literal into MINER_LABEL constant
in MetricsHistogram to follow DRY principle and fix SonarQube warning.

Also update PrometheusApiServiceTest to use MINER_LABEL constant
for test assertions.

Relates tronprotocol#6624
@warku123 warku123 force-pushed the metric_modify_clean branch from 93d0083 to 6315ffc Compare April 2, 2026 03:54
Extract repeated 'miner' string literal into MINER_LABEL constant
in MetricsHistogram to follow DRY principle and fix SonarQube warning.

Also update PrometheusApiServiceTest to use MINER_LABEL constant
for test assertions.

Relates tronprotocol#6624
init(MetricKeys.Counter.TXS, "tron txs info .", "type", "detail");
init(MetricKeys.Counter.MINER, "tron miner info .", "miner", "type");
init(MetricKeys.Counter.BLOCK_FORK, "tron block fork info .", "type");
init(MetricKeys.Counter.SR_SET_CHANGE, "tron sr set change .", "action", "witness");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The witness label uses addresses. Although the Active SR pool is relatively limited (27 slots), as SRs rotate in and out over time, label values will accumulate monotonically — they only grow, never shrink. This is a known Prometheus anti-pattern.

Practical impact assessment:

  • TRON's SR candidate pool is not infinite, and rotation frequency is low, so it won't explode in the short term

  • However, as a long-running node metric, months or years of operation could still accumulate a significant number of time series

Possible approaches:

  • If the address dimension is truly needed, keep the current design, but document this characteristic

  • Or remove the witness label entirely, keeping only action (add/remove), and output address details via logs instead. Follow the principle that Prometheus should be used for alerting and awareness, while logs are for precise investigation and root-cause analysis.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising the high-cardinality concern — it's a valid Prometheus best practice to watch for.

However, I believe the witness label is acceptable here given TRON's specific constraints:

  1. Bounded candidate pool: The SR candidate pool is finite. Even over years of operation, the total number of distinct addresses that have ever held an SR slot has a practical ceiling — far from the
    unbounded growth typical of high-cardinality anti-patterns.
  2. Low rotation frequency: SR set changes only occur at maintenance intervals (every 6 hours), so new label values are introduced very infrequently.
  3. Diagnostic value: The primary purpose of this metric is to identify which SR was added or removed. Dropping the witness label would reduce the metric to just "something changed", requiring operators to
    cross-reference logs — losing the at-a-glance visibility in Grafana dashboards that makes this metric useful in the first place.

In summary, the worst-case time series count is roughly candidate_pool_size × 2 (add/remove), which stays well within Prometheus's comfortable operating range.

That said, I agree it's worth documenting this characteristic. Could you suggest what form you'd prefer — a code comment at the metric registration site, a note in the PR description, or something else?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to a code comment at the metric registration site.

- Extract MINER_LABEL to MetricLabels.Histogram.MINER as shared constant
- Use int instead of double for block count in tests
- Store witness addresses as base58 directly, removing unnecessary hex round-trip

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@warku123 warku123 requested a review from halibobo1205 April 9, 2026 02:46
MetricLabels.Counter.TXS_SUCCESS, MetricLabels.Counter.TXS_SUCCESS);
}
// Record transaction count distribution for all blocks (including empty blocks)
int txCount = block.getTransactions().size();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L174, L175, and L179 can use txCount.

@halibobo1205
Copy link
Copy Markdown
Collaborator

Two questions regarding observability adoption of the new metrics:

  1. PromQL query guide: Could you provide some common query examples for the new tron:block_transaction_count histogram and tron:sr_set_change counter? For instance:

    • How to query empty block rate via the le="0.0" bucket
    • How to calculate P50/P99 transaction counts over a given time window
    • How to query SR membership changes from the most recent maintenance cycle
  2. Grafana dashboard examples: Are there any accompanying Grafana dashboard JSONs or screenshots? If so, it would be helpful to include them in the PR description or documentation, so node operators can import and use them directly, reducing the onboarding effort.

}
// Record transaction count distribution for all blocks (including empty blocks)
int txCount = block.getTransactions().size();
Metrics.histogramObserve(MetricKeys.Histogram.BLOCK_TRANSACTION_COUNT, txCount,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metricsService.applyBlock(block) runs before dbManager.pushBlock(block) in TronNetDelegate.processBlock(), so this histogram is updated for blocks that are later rejected or only live on a fork. For an empty-block monitoring metric, that produces false positives per witness and lets invalid peer traffic skew the distribution. This should be recorded only after the block has been applied successfully.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I actually noticed this ordering issue during my investigation — metricsService.applyBlock() is called before dbManager.pushBlock(), meaning all metrics in BlockChainMetricManager.applyBlock() are recorded for blocks that may later be rejected.

However, this is not specific to the histogram I added — the existing metrics in the same method (e.g., MINER_LATENCY, TXS, NET_LATENCY, duplicate witness detection) all share the same problem. I chose to keep the same pattern to maintain architectural consistency and avoid expanding the scope of this PR.

If you agree, I'd like to open a separate issue to move metricsService.applyBlock() after dbManager.pushBlock() succeeds in TronNetDelegate.processBlock(), which would fix the timing problem for all metrics at once.

StringUtil.encode58Check(address));

// SR set change detection
long nextMaintenanceTime = dbManager.getDynamicPropertiesStore().getNextMaintenanceTime();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SR-set change detection is reading nextMaintenanceTime and getActiveWitnesses() before the current block is applied. In the real pipeline those values are updated during dbManager.pushBlock() / consensus.applyBlock(), so on the maintenance block this branch still sees the old schedule and old maintenance timestamp. That makes tron:sr_set_change show up one block late, and the added test masks the problem by mutating both stores manually before generating the next block.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root cause as above — since metricsService.applyBlock() runs before dbManager.pushBlock(), the maintenance time and active witness list haven't been updated yet at the point of reading.

In fact, this is precisely why the SR-set change detection relies on nextMaintenanceTime change as the trigger — because getActiveWitnesses() still returns the old schedule at this point, comparing against the previous nextMaintenanceTime is the only way to detect that a maintenance cycle has occurred and capture the witness list transition.

This would be addressed together in that follow-up issue: once metricsService.applyBlock() is moved after a successful pushBlock(), the SR-set change detection will naturally read the up-to-date state without needing this workaround.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Feature] Add Prometheus metrics for empty blocks and SR set changes

5 participants