Skip to content

HDDS-13192. Adding pending delete bytes info in storage report of DN#8755

Closed
priyeshkaratha wants to merge 1 commit intoapache:HDDS-13177from
priyeshkaratha:HDDS-13192
Closed

HDDS-13192. Adding pending delete bytes info in storage report of DN#8755
priyeshkaratha wants to merge 1 commit intoapache:HDDS-13177from
priyeshkaratha:HDDS-13192

Conversation

@priyeshkaratha
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Accurate accounting of pending-delete bytes per container

  • Added incrPendingDeletions(long) and decrPendingDeletions(long) to ContainerData with a new pendingDeletionBytes field that is persisted in container metadata.
  • Integrated the counter with DeleteBlocksCommandHandler:
    • Introduced helper method incrPendingDeletionBytes() which sums the sizes of blocks in each DeletedBlocksTransaction and updates the container’s counter when blocks are logically marked for deletion.
    • When blocks are physically removed in deleteViaTransactionStore(), the counter is decremented accordingly.

Datanode-level aggregation & reporting

  • StorageReport now includes an aggregated pendingDeletionBytes value (sum across all containers).
  • Introduced a short-lived in-memory cache (default TTL: 10 minutes) for the aggregated value to avoid recomputing on every heartbeat, reducing CPU overhead.

Metrics & API exposure

  • Exposed pendingDeletionBytes as a Datanode metric.
  • Extended:
    • Datanode-to-SCM heartbeat protobuf to include the new metric.
    • Recon REST API (/datanode) to surface the pending deletion bytes.

Upgrade path for existing clusters

  • For containers created before the upgrade:
    • During open, if pendingDeletionBytes is missing, it is computed using KeyValueContainerUtil.populateContainerMetadata() by scanning the block table.
    • A background task reprocesses existing delete transactions and block metadata to compute accurate pending delete sizes.
  • Mixed-version compatibility:
    • Older DNs/SCMs gracefully ignore unknown fields.
    • Recon detects support and falls back to legacy logic if needed.

What is the link to the Apache JIRA

HDDS-13192

How was this patch tested?

Unit Tests

  • Verified counter updates via unit tests in ContainerData.
  • Added unit tests for DeleteBlocksCommandHandler flows that mark and delete blocks.

Integration Testing

  • Pending

Upgrade & Backward Compatibility

  • No impacts

Performance Testing

  • Pending

@priyeshkaratha priyeshkaratha marked this pull request as draft July 7, 2025 07:42
@kerneltime kerneltime requested a review from Copilot July 7, 2025 16:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces tracking and reporting of bytes pending deletion at the container and datanode levels. It updates container metadata handling, deletion command logic, storage reports, metrics, APIs, and relevant tests across Recon, SCM, and container-service modules.

  • Adds pendingDeletionBytes counters in container data and integrates them into delete command handlers.
  • Extends Datanode and storage‐location reports, SCM heartbeats, Recon endpoints, and metrics to include pending‐deletion bytes.
  • Implements a short-lived cache in HddsVolume for aggregated pending‐deletion bytes and updates unit tests accordingly.

Reviewed Changes

Copilot reviewed 39 out of 39 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
recon/src/main/java/.../DatanodeStorageReport.java Added pendingDeletions field, constructor arg, and getter.
recon/src/main/java/.../NodeEndpoint.java & ClusterStateEndpoint.java Propagate new field into storage‐report construction.
recon/src/test/java/.../TestNSSummaryEndpoint*.java Updated mocks to pass extra constructor parameter.
server-scm/.../SCMNodeStat.java & SCMNodeMetric.java & NodeStat.java Added pendingDeletions metric in stat/metric APIs.
server-scm/.../Test*.java Updated tests to account for new stat constructor args.
container-service/src/.../KeyValueContainerData.java Added methods for pending deletion bytes.
container-service/src/.../KeyValueContainerUtil.java Compute blockPendingDeletionBytes on metadata population.
container-service/src/.../HddsVolume.java Cached aggregation of pending‐deletion bytes with TTL.
container-service/src/.../DeleteBlocksCommandHandler.java Increment pending‐deletion bytes when marking blocks.
container-service/src/.../BlockDeletingService.java & Metrics Track and expose total pending‐deletion bytes in metrics.
common/src/main/java/.../OzoneConsts.java Added constant for metadata key.
Comments suppressed due to low confidence (3)

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/helpers/ContainerUtils.java:343

  • The exception message refers to getPendingDeletionBlocks but the method is getPendingDeletionBytes. Update the message for consistency.
              " not support.");

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/HddsVolume.java:324

  • It would be beneficial to add a unit test for getPendingDeletionBytes(), verifying both correct aggregation and cache-expiration behavior.
  public long getPendingDeletionBytes() {

hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueContainer.java:1040

  • [nitpick] The inline comment was accidentally concatenated (HddsVolumetestImportV2ReplicaToV3HddsVolume). Please revert it to // create HddsVolume or clarify.
    // create HddsVolumetestImportV2ReplicaToV3HddsVolume

Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the pending delete information being sent back to SCM? If SCM is not acting on this information, then pending delete byte and block count only need to be exposed as metrics and we can use prometheus + grafana to either inspect individual nodes or aggregate across the cluster. Pushing observability metrics like this through SCM and Recon is not sustainable, we already have a metrics system to handle this.

@priyeshkaratha
Copy link
Copy Markdown
Contributor Author

@errose28
Thank you for raising this valid point regarding the use of SCM for exposing pending delete information.

Our main goal here is to build a dashboard in Recon to show storage usage distribution across the cluster. Recon already uses StorageReport for other usage stats, so we thought it would be a good idea to extend it to include pending deletion info as well, keeping everything in one place.

We considered using Prometheus metrics, but based on my understanding. In ozone services these values might reset from beginning and become inaccurate in case of service restarts. This can lead to wrong conclusions in the dashboard. By passing the data through SCM to Recon, we can ensure more reliable and consistent reporting, even after restarts.

Also I think to avoid dependency on Prometheus service running in this case (most customers do not have it set up when they hit issues related to deletion or anything related to space reclamation).

@errose28
Copy link
Copy Markdown
Contributor

Our main goal here is to build a dashboard in Recon to show storage usage distribution across the cluster.

We should leave all dashboarding to Grafana instead of reimplementing the wheel in Recon.

Recon already uses StorageReport for other usage stats, so we thought it would be a good idea to extend it to include pending deletion info as well, keeping everything in one place

StorageReport is not a catch-all to implement metrics collection in Recon, it is for reporting information about the disks within datanodes. In general Recon displays current categorical information (keys, containers, pipelines, volumes/disks) and metrics + dashboards track numeric information (bytes, durations, or event counts) over time. Deletion progress falls cleanly in the second category.

We considered using Prometheus metrics, but based on my understanding. In ozone services these values might reset from beginning and become inaccurate in case of service restarts. This can lead to wrong conclusions in the dashboard.

Metrics collection is decoupled from persistence. Yes we will need to persist the counters in datanodes so they do not need to be recomputed on every restart, but this is independent of how the metrics are published and consumed by other services.

Also I think to avoid dependency on Prometheus service running in this case (most customers do not have it set up when they hit issues related to deletion or anything related to space reclamation).

If any Ozone users want dashboards they will need Prometheus and Grafana. We do not have the bandwidth to build and maintain our own dashboarding setup when quality ones already exist.

This should be a reasonably small change if it is done the way Ozone is designed to handle it:

  • Add metrics, with persistence underneath if necessary (this PR)
  • Add the metrics to a dashboard, like this (follow-up PR)

@priyeshkaratha
Copy link
Copy Markdown
Contributor Author

priyeshkaratha commented Jul 21, 2025

I agree with your points.
But here our goal is to enhance Recon's storage observability by providing a consolidated view of used capacity, including pending deletions. Recon already consumes the StorageReport for capacity metrics—extending this structure allows us to enrich the existing model without introducing a parallel system. I have already added metrics that can be published to prometheus too.

If any Ozone users want dashboards they will need Prometheus and Grafana. We do not have the bandwidth to build and maintain our own dashboarding setup when quality ones already exist.

Are you telling that we should not rely on Recon for capacity distribution?

@priyeshkaratha
Copy link
Copy Markdown
Contributor Author

@errose28 Followup to my previous comment.

To retrieve pending deletion data, one option is to query Prometheus metrics from each DataNode individually. Alternatively, we could introduce a new endpoint at the DataNode level, which Recon would call to gather and aggregate this information. However, since the storage report already includes other capacity-related fields, we considered extending it to include pending deletion information as well — which would simplify the implementation.

Given that this data ultimately needs to be available at the Recon level — especially to enhance the current high-level view with more granular insights (e.g., how much space can be reclaimed after deletions or where deletions are stuck) — which approach do you think is better?

@errose28
Copy link
Copy Markdown
Contributor

Are you telling that we should not rely on Recon for capacity distribution?

"Capacity Distribution" will need a tighter definition in order to make this call. We should not rely on Recon for time series data, and the information added in this PR is time series data.

Alternatively, we could introduce a new endpoint at the DataNode level, which Recon would call to gather and aggregate this information.

You have just described the process of Prometheus's metrics scraping : ) This is why we should not use Recon for this since Prometheus already handles this.

since the storage report already includes other capacity-related fields, we considered extending it to include pending deletion information as well — which would simplify the implementation.

I don't see any evidence supporting this simplification claim. Making Ozone do this is far more complex to implement than leaving it to existing systems designed to handle time series data.

Given that this data ultimately needs to be available at the Recon level — especially to enhance the current high-level view with more granular insights (e.g., how much space can be reclaimed after deletions or where deletions are stuck)

I disagree that this needs to be available in Recon. Recon is not designed to store or display time-series data. The terminology "stuck" used here is an indication that we are dealing with times series data, because it translates to deletion progress over time.

which approach do you think is better?

Publish this time series data as metrics from the datanodes and let existing systems handle it much better than we can.

@priyeshkaratha priyeshkaratha force-pushed the HDDS-13192 branch 2 times, most recently from 313053a to 317ba8c Compare August 25, 2025 16:35
@priyeshkaratha
Copy link
Copy Markdown
Contributor Author

Based on the analysis, the JMX Metrics approach appears to be more efficient for exposing pending deletion information, primarily due to the reduced overhead and direct access for monitoring. While it introduces a JMX dependency, the benefits in terms of performance and data relevance outweigh the disadvantages of the Storage Report approach.

So, it will be handled as part of #8995

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants