Skip to content

HDDS-11233. Ozone Storage Policy Support.#6989

Open
xichen01 wants to merge 4 commits intoapache:masterfrom
xichen01:HDDS-11233
Open

HDDS-11233. Ozone Storage Policy Support.#6989
xichen01 wants to merge 4 commits intoapache:masterfrom
xichen01:HDDS-11233

Conversation

@xichen01
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Design storage policy for for Ozone. Please comment inline on the markdown document to ask questions and post feedback. Switch to Rich Diff mode for smoother reading.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11233

How was this patch tested?

N/A

@xichen01 xichen01 changed the title HDDS-11233. Ozone Storage Policy Support HDDS-11233. Ozone Storage Policy Support. Jul 25, 2024
@xichen01 xichen01 added documentation Improvements or additions to documentation design labels Jul 25, 2024
@kerneltime
Copy link
Copy Markdown
Contributor

cc @sodonnel

Comment on lines +108 to +115
| Tier | StorageType of Pipeline | One Replication 
Container Replicas Storage Type | Three replication
Container Replicas Storage Type | EC
Container Replicas Storage Type |
| --- | --- | --- | --- | --- |
| SSD | SSD | SSD | 3 SSD | n SSD |
| DISK | DISK | DISK | 3 DISK | n DISK |
| ARCHIVE | ARCHIVE | ARCHIVE | 3 ARCHIVE | n ARCHIVE |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The table does not seem to be rendered properly.


| Storage Policy | Storage Tier for Write | Fallback Tier for Write |
| --- | --- | --- |
| Hot | SSD | DISK |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the final list of desired/planned storage policies? Wouldn't we like to implement policies like in HDFS - https://hadoop.apache.org/docs/r3.4.1/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we only support these three simple storage policies current, we can extend it in the future.

The relation of Storage Policy, Storage Type and Storage Tier

- The storage policy is the property of key/bucket/ prefix (Managed by OM);
- The storage tier is the property of Pipeline and Container (Managed by SCM);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we deal with the storage tier as an entry of the cluster topology?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does “deal with as a cluster topology” mean?
Storage Tier is the property of Pipeline and Container, when we create the key we will select the matched Storage Tier Pipeline and Container based on the key Storage Policy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant org.apache.hadoop.hdds.scm.net.NodeSchema. Will the Storage Tier (aka rack of specific storage volumes) become a part of the network topology

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage Tier is more like the ReplicationConfig, will be a independent fields in ContainerInfo and Pipeline

## SCM Pipeline / ContainerReplica Management and Block Allocation

- Pipeline needs to add tier-related fields to distinguish between different tiers of Pipelines.
- For Pipelines tier:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of the background pipeline creator creates a limited list of RATIS/THREE-pipelines per datanode. The design doc proposes to deal with specific pipelines for different storage policies. Is there any proposal doc on how we gonna deal with the limitation mentioned earlier? (How should we deal with the distribution of the pipelines of different types: 30% per storage tire or some other option? What if the pipelines of one of the storrage tire will not be used?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count limit for current Pipeline count will be extend to each storage tier, so in the extreme condition, there will be three times Pipeline in a cluster.

But in the actual situation, there may not be so many Pipelines, because a Pipeline may support multiple tiers, such as DISK and SSD, because there may be both SSD and DISK type Volume in a Datanode machine.

And the background Pipeline creator will check the current cluster's Volume type, if a cluster has only SSD Volume, the background Pipeline will create a Pipeline with only SSD Tier type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, on a dedicated-tier cluster (16 SSD, 16 DISK nodes), the SSD tier can only form a limited number of EC 3+2 pipelines. Does the background creator have a target count? Does it back off when the tier is saturated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ratis' Pipeline utilizes a Background service for creation (BackgroundPipelineCreator). The number of Ratis Pipelines can be controlled through configuration.

EC-type Pipelines are created only when used. Different EC configurations generate distinct Pipelines. Pipelines from different EC configurations do not interfere with each other. (WritableECContainerProvider#getContainer) SCM selects the DN with the appropriate Storage Type to create the Pipeline.

@vtutrinov
Copy link
Copy Markdown
Contributor

@xichen01 is there an understanding of the time frame for the functionality to be implemented? I'd start creating the JIRA tickets and implementing them

@vtutrinov
Copy link
Copy Markdown
Contributor

@xichen01 @kerneltime @sodonnel, could you help somehow to force the review of the design doc? The feature is very needed, and I would gladly start implementation.

@ivandika3
Copy link
Copy Markdown
Contributor

ivandika3 commented Jun 8, 2025

@vtutrinov Currently the implementation is being worked on internally for the past one year.

The basic implementation of storage policy and storage types integration on containers, pipelines, volumes, s3 storage class, and creating key / file with storage policy has been implemented but still need extensive testing. Currently we are focusing on storage policy migration implementation.

@xichen01 would know more about the approximate timestamps, but we hope to have a working implementation in the next quarter (i.e. Q3 2025) or so. We are currently quite busy with our internal works so there might not be any bandwidth to backport all these patches in the short term.

@vtutrinov
Copy link
Copy Markdown
Contributor

@ivandika3 thanks for the response!
Can we glance at the implementation as the first phase (maybe in a custom feature branch)? Or are there too many private details?

@ivandika3
Copy link
Copy Markdown
Contributor

ivandika3 commented Jun 9, 2025

@vtutrinov the fastest way I think we can do is to provide you with the diffs. However these diffs won't apply cleanly on the master branch since our branch is based on 1.4.1 version with some of our internal specific changes. I probably can provide some of it this weekend.

Feature branch in community fork might take a while since we need to resolve the conflicts.

@vtutrinov
Copy link
Copy Markdown
Contributor

@ivandika3 it would be great!

@vtutrinov
Copy link
Copy Markdown
Contributor

@ivandika3 I don't want to rush, but is there any news about the mentioned diff?

@ivandika3
Copy link
Copy Markdown
Contributor

@vtutrinov Thanks for the reminder. I have attached https://issues.apache.org/jira/secure/attachment/13077025/storage-policy-diff.tar.gz for the list of diffs of the storage policy integration.

Please be reminded to attribute @xichen01 for any commits generated from these diffs.

@vtutrinov
Copy link
Copy Markdown
Contributor

@ivandika3 @xichen01 the diff above has references to the following non-existent files (relative to ozone-1.4.1):

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementDataCenterAware.java
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementDataRecovery.java
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelinePlacementDataCenterAware.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementDataCenterAware.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementDataCenterAwareSpecialCase.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementDataRecovery.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementDcFlow.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/TestPipelinePlacementDataCenterAware.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerService.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/TestSCMContainerPlacementStorageTier.java
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/client/StorageTierUtil.java
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeUtils.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/TestSpecialCloseContainerEventHandler.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/container/TestPeriodicContainerCloser.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/fs/ozone/AbstractRootedOzoneFileSystemTest.java
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/shell/UpdateBucketOptions.java
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/client/StorageTypeUtils.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestOzoneStoragePolicy.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancerTaskDcFlow.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerUtils.java
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/client/StorageTierUtilTest.java
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/dcflow/ContainerBalancerSelectionCriteriaDcFlow.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/AbstractStorageTypeChoosingPolicy.java
hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/storagepolicy/StoragePolicyCommands.java
hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/storagepolicy/UsageInfoSubCommand.java

Could you provide them too, or point me to the commit where I can fetch them?

@jojochuang jojochuang requested a review from chungen0126 August 5, 2025 17:32
@github-actions
Copy link
Copy Markdown

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions Bot added the stale label Nov 12, 2025
@github-actions
Copy link
Copy Markdown

Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it.

@github-actions github-actions Bot closed this Nov 19, 2025
@greenwich
Copy link
Copy Markdown
Contributor

Hi, do you know if any work is planned for this ticket? AFAIK, the patch diff wasn't added to the branch and is probably out of date right now. What are the next steps here?

@errose28
Copy link
Copy Markdown
Contributor

Hi @greenwich I see you have also opened #9807. If you would like to continue work on this, we should start by reaching agreement on a design doc. I'm not sure we finished that process yet. @ivandika3 @xichen01 does this doc need more updates/review? Should we continue work on it here or open a new PR?

@greenwich
Copy link
Copy Markdown
Contributor

greenwich commented Feb 23, 2026

@errose28 Thanks for looking! I am very sorry for the noise my PR caused; it wasn't intended to be public (My bad - I haven't set it to Draft when I created it).

Let me explain my motivation.

  1. My team needs storage policy and tiering support in Ozone this year (ideally in H1). This capability is becoming increasingly important for us, especially as there are parallel evaluations happening within the company. I would strongly prefer to continue building on Ozone, which is why I’m investing time in helping move this forward. We are planning for this cluster to grow beyond 20-30PB, and at that scale storage tiering becomes operationally critical rather than just a nice-to-have capability.

  2. The patch attached to this pull request doesn’t seem to be up to date with master or even the 2.0 or 2.1 releases.

  3. I might be missing something, but I didn’t see recent updates on this PR since Nov 2025.

  4. I may be mistaken, but I have the impression that before starting implementation, we aim for a comprehensive and fully generic design that covers all possible use cases. I completely understand and appreciate that approach — it leads to a robust and future-proof solution. At the same time, I’m mindful of the time constraints on my side (as mentioned in point 1), so I’m hoping we can find a pragmatic way to make incremental progress while the broader design continues to evolve.

  5. My approach is slightly different — I would like to move forward incrementally. My goal is to design and implement storage tiering in small, self-contained steps (MVP-1, MVP-2, etc.), where each stage delivers a complete end-to-end capability.

This would allow my team to start using the functionality early, gather real-world feedback, and better understand practical requirements. I believe this iterative approach could also help inform and refine the broader design discussion.

I am currently working on MVP-1, and MVP-2 already includes many of the most important features we need. My plan is to validate these internally in our TEST environment (and eventually PROD if stable), and then share the results and learnings with the Ozone community for discussion and alignment. Of course, I’m very open to feedback on this approach and happy to adjust direction based on community input.

  1. I haven’t created any Jira tickets yet because my initial intention was to implement and validate the first MVP internally. However, each independent feature is pushed as a separate commit, so it can later be aligned with and retrofitted into ASF Jira tickets if needed.

  2. Finally, regarding why I created a PR at this stage: the GitHub PR workflow is very useful for triggering CI, viewing diffs against master, and keeping my branch aligned. I also hope that parts of this work may eventually be contributed back to master.

@xichen01
Copy link
Copy Markdown
Contributor Author

@errose28 Thanks for noticing this PR.

@greenwich Thank you for the update.

Regarding the current status of this PR

The current PR hasn't been updated for a while, mainly because there doesn't seem to be a strong demand for this feature from other members of the community, so reviews have stagnated.
However, this feature has been fully implemented internally, including support for StoragePolicy across all S3 and Filesystem write interfaces, as well as support for StoragePolicy in ReplicationManager and ContainerBalancer.
We've basically implemented it according to this design document (some parts of the document need updating; I can update the document if needed).

We also support S3 Lifecycle (https://issues.apache.org/jira/browse/HDDS-8342), allowing you to set a Lifecycle for a Bucket to migrate specified keys to a specified StoragePolicy at a specified time (including from SSD migrate to DISK, and also from THREE Replication migrate to EC) or similar HDFS SPS (Storage Policy Satisfaction) functionality.

Follow-up

If you or others in the community are willing, we can continue to move forward with this PR, of course, you can also move forward with your own proposals, we can cooperate if you need it.

@greenwich
Copy link
Copy Markdown
Contributor

@xichen01, Thanks for the update. That's great news indeed! In that case, there is no point for me to invent a wheel. I had a quick look at the patch you submitted earlier, and it looks comprehensive. I will have a detailed look today.

Would it be possible to move forward with that PR (and merge it, or at least create a working branch and sync it with master)? Also, as I am very interested in that feature, I can provide assistance on my end.

cc: @ivandika3 @chungen0126
Please let me know.

@ivandika3
Copy link
Copy Markdown
Contributor

ivandika3 commented Feb 25, 2026

@xichen01 Let's reopen this patch and move this forward. I am willing spend some time on the storage policy support development. Hopefully other community members (@errose28 @chungen0126) are also able to help to push the review process and commit to our implementation.

@greenwich As mentioned in the #8811 (reply in thread) discussion last year. There are mainly two phases of Ozone Storage Policy Support

  1. Supporting storage policy and storage types on Ozone
  2. Storage Policy Migration support

Phase 1 goal is to allow client to upload keys with different storage policy and integrate storage policy and types to all Ozone pipeline, containers, etc. This will be our short term focus now since we are not introducing any new subsystem. @xichen01 Let's start by backporting this to the community fork and ensure that everything works normally, we can discuss on which patch we will do.

Phase 2 goal is to support Storage Policy Migration. However, our current implementation requires a separate job worker subsystem. This will be longer term since we are introducing some new subsystem. If you don't require a separate job worker subsystem, you might need to write your own implementation of "Storage Policy Satisfier".

That said, hope you understand that all contributions to Ozone are purely voluntary and made by members with other higher level priorities and therefore we cannot 100% guarantee that this will be done in a timely manner.

@ivandika3 ivandika3 reopened this Feb 25, 2026
@ivandika3 ivandika3 removed the stale label Feb 25, 2026
@ivandika3
Copy link
Copy Markdown
Contributor

the diff above has references to the following non-existent files (relative to ozone-1.4.1):

These refer to our internal placement policy to support multi-DC setup and can be ignored as these do not pertain the any current functionality in community Ozone. The patch diff serve only as an overview of what the changes might look like.

@greenwich
Copy link
Copy Markdown
Contributor

@ivandika3 Thanks for the clarification — that makes sense.

I fully understand that contributions are voluntary, and I really appreciate the time and effort everyone is putting into this.

I’m very interested in helping move this forward and would be glad to contribute where it makes the most sense.

Regarding Phase 2, just thinking ahead — would it make sense to initially target a simpler mover-style implementation (similar in spirit to HDFS Mover) before introducing a separate job worker subsystem? That might allow basic Storage Policy Migration functionality to be delivered earlier and iterated on over time.

Of course, I’m happy to align with the broader design direction — just exploring whether an incremental path could also work here.

Please let me know how I can best contribute.

Copy link
Copy Markdown
Contributor

@chungen0126 chungen0126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the patch. Left some comments

- `ozone admin storagepolicies list`, lists all supported policies in the current cluster.
- `ozone admin storagepolicies get ${key}`, get the key storage policy (include the inherited storage policy)
- `ozone admin storagepolicies check ${key}`, check whether the key storage policy is satisfied
- `ozone admin storagepolicies satisfyStoragePolicy -bucket / -prefix / -key`, triggers migrations to satisfy the corresponding data's storage policy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans to introduce rate limiting or throttling for this command to prevent performance degradation for regular user operation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our internal, the actual migration task will be executed by the Jobworker (A standalone service, just like S3G), configuring the number of worker threads for Jobworker can limit the migration speed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Do we have any discussions on whether the Jobworker should be a standalone component or integrated into an existing components? Also, is it a scheduler or the actual worker doing the heavy lifting?

In my opinion, throttling the I/O directly would be more effective and straightforward than limiting the number of worker threads. It gives us more direct control over the resource of migration.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the Job Worker architecture, I believe integrating it directly into Ozone is a better approach than having it as a standalone component.

The workflow could involve the Ozone Manager (OM) detecting migration needs while the Storage Container Manager (SCM) orchestrates the actual data movement. The primary advantage of integration is the ability to leverage the most comprehensive and real-time cluster information.

We could approach this within an OM listener to reduce the loading from the OM Leader.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chungen0126, I agree about the Job Worker architecture. Still, from an operational perspective, it's often convenient to have a standalone tool/service that runs the actual migration based on the configured team's migration priorities.

Comment thread hadoop-hdds/docs/content/design/storage-policy.md Outdated
Comment thread hadoop-hdds/docs/content/design/storage-policy.md Outdated

| Storage Policy | Storage Tier for Write | Fallback Tier for Write |
| --- | --- | --- |
| Hot | SSD | DISK |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to add more storage tiers beyond SSDs, such as NVME? So then it becomes Hot, and SSDs are Warm? And/Or give a user the ability to create their own storage policies and name their own tiers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we only implement few storage policy types, but it is easy to add new storage policy in the code. we Do not support custom storage policy by user, but this implementation should not be difficult.


- Use ozone admin storagepolicies satisfyStoragePolicy to trigger the migration of corresponding changes, and mark the corresponding storage policy changes as completed.

### FSO type buckets:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document doesn't mention OSB buckets, but I assume they are also supported, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, support for FSO and OBS will be added. This part needs to be updated.

## SCM Pipeline / ContainerReplica Management and Block Allocation

- Pipeline needs to add tier-related fields to distinguish between different tiers of Pipelines.
- For Pipelines tier:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, on a dedicated-tier cluster (16 SSD, 16 DISK nodes), the SSD tier can only form a limited number of EC 3+2 pipelines. Does the background creator have a target count? Does it back off when the tier is saturated?


Lists commands that need to be added/adapted to storage policies, but may not be all commands

- Storage policy management
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but the current patch has the following limitation:

If an operator changes a bucket's policy from WARM to HOT, and teams decide to create their own mover or policy-satisfier service.

  1. There's no command/API to trigger bulk migration
  2. There's no way to discover which keys are on the wrong tier (except checking one key at a time)
  3. The only migration path is writing a custom script that calls RewriteKey per key
  4. There is no container-level migration, too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the key migration, our internal implementation utilizes S3 Lifecycle rules by configuring a rule for the bucket. OzoneManager's LifecycleService checks whether keys comply with the storage policy based on this rule. If non-compliant, it generates a command to be executed by Jobworker for migration. Upon receiving the command, Jobworker performs the migration via the rewriteKey API.

If Jobworker is not used, a local client must be implemented to inspect and migrate keys via the rewriteKey API.

Copy link
Copy Markdown
Contributor

@greenwich greenwich Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does LifecycleService/Jobworker have information about all keys that don't comply with the storage policy? Does it run the migration key by key iteratively?
At the moment, in the patch, I couldn't see any API that allows collecting information about all keys that don't satisfy the storage policy. Also, is there any plan to migrate bulk file migration rather than key by key using rewriteKey?


- Support replica type storage policies.
- Support EC type storage policies.
- EC types cannot support some storage policies that mix different storage media.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What is the minimum node count per tier for each replication config? (EC 3+2 needs at least 5 nodes, ideally across 3+ racks)
  • Should SCM refuse to enable a tier if the node count is below the minimum?
  • How does pipeline closure cascade when a tier loses nodes — does it close all pipelines on that tier or just affected ones?
  • What happens to in-flight writes when the last pipeline on a tier closes — does fallback kick in immediately, or does the write fail?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a Storage Policy does not alter the pipeline's behavioral logic; it only affects the nodes composing the pipeline.
The pipeline will add a Storage Tier attribute. For a pipeline with an SSD Storage Tier attribute, it will be composed of DNs that support SSD.
The primary logical change occurs during pipeline creation, where DNs with suitable volumes are selected to form the pipeline. Read/write behavior after pipeline creation remains unchanged.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand: am I right that in that case, datanodes with heterogeneous volume distributions are preferred (DNs that have both SSDs and DISKs)?

## Datanode Heartbeat Reporting Process

- Datanode reports the storage type information of each volume through heartbeats. SCM aggregates and manages Datanodes based on different storage types, including the remaining capacity of each storage type.
- Currently, Datanode reports volume capacity information through NodeReport. The storage type information of volume can be reported to SCM in DATA_VOLUME of getNodeReport.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have a quota or capacity awareness per tier?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there is no quota mechanism similar to bucket quotas that physically limits usage per tier.

However, you can use ozone admin datanode list to view the specific usage details for each tier.

@xichen01
Copy link
Copy Markdown
Contributor Author

xichen01 commented Mar 2, 2026

@greenwich @chungen0126 @errose28 @vtutrinov I will update this document, adding more detailed content and incorporating some minor changes. I will try to complete it this week.

@xichen01
Copy link
Copy Markdown
Contributor Author

@greenwich @chungen0126 @errose28 @vtutrinov The document has been updated. Please check it. All the content covered in the current document has been implemented in our internal version.


### Bucket, Key

A storage policy attribute is added to buckets and keys on OM.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a prefix-level policy support?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hasn't been implemented yet, but we can implement it if needed.

Copy link
Copy Markdown
Contributor

@greenwich greenwich Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's highly required.

- Support specifying a storage policy when writing a key.
- If a storage policy is specified when writing a key, the key storage policy is the specified storage policy.
- If a storage policy is not specified when writing a key, the storage policy of the key will depend on the bucket.
If the bucket does not have a storage policy set, the default storage policy is determined by the configuration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand the policy resolution is key → bucket → config, any plans for ``key → prefix → bucket → config`?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prefix has not yet been implemented. If the prefix is implemented, the priority order will be key → prefix → bucket → config.


# Backward Compatibility

This section describes the backward compatibility for data created before the storage policy feature is introduced.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the upgrade scenario? Does it support the rolling upgrade?
My understanding is that the existing cluster should be shut down and the pipelines closed. Then run the upgrade, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our upgrade sequence is SCM -> OM -> Datanode.

SCM and OM: Since SCM and OM operate in HA mode, we must first upgrade the two followers, transfer leadership to the upgraded node, and then upgrade the last remaining old leader.

Datanode: Before upgrading a Datanode, it must be set to maintenance mode. Once the Datanode enters maintenance mode, all its pipelines are shut down, and the Datanode can be restarted immediately.

The upgrade process is transparent to clients.

The above upgrade process applies to most version upgrades, including storage policy version upgrades. Upgrading the storage policy version generally does not require any special steps.


message Pipeline {
//...
repeated StorageTierProto supportedStorageTier = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that the pipeline may support multiple storage tiers. What will happen if volumes change: volumes crash, or DN adds/removes volumes between heartbeats. What should operators do with pipelines when DN storage changes (pipelines close/create)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works exactly the same as the original logic: if the volume changes, SCM will detect it and may recreate the pipeline.

What will happen if volumes change: volumes crash, or DN adds/removes volumes between heartbeats.

Adding or removing a volume requires a restart.
If a volume is marked as failed or an exception such as Volume DISK_OUT_OF_SPACE occurs during the heartbeat interval, KeyDataStreamOutput.handleException will be applied, causing a block to be reallocated.

What should operators do with pipelines when DN storage changes (pipelines close/create)

  • Changing the storage type of a volume requires restarting the Datanode.
  • If the storage type changes, the SCM will update the Pipeline's storage tier field.
  • If the SCM assigns a storage type that the Datanode doesn't actually support, a KeyDataStreamOutput.handleException will trigger a block reassignment (this is practically impossible, as changing the storage type of a volume requires restarting the Datanode).

PROVIDED_TYPE = 5;
}

enum StorageTierProto {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like one more enum, why don't we use StorageType directly? It's simpler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A storage policy needs to be mapped to a set of storage types, such as HOT -> [SSD, SSD, SSD]. Using storage tiers provides a simpler way to represent a set of storage types, making it more convenient for both coding and management.

created on a volume of the storage type specified by the configuration `hdds.datanode.default.storagetype` (default
is DISK). This typically only occurs when a client from an older version that does not support the storage policy
feature writes data to the Datanode.
- The Datanode does not perform fallback when creating container replicas. If a volume of the storage type specified in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DN may lose volume between pipeline creation and container creation, or due to a race between heartbeat reporting and allocation. Does it mean it will cause write failures?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DN may lose volume between pipeline creation and container creation,

If a change in the volume causes a chunk write to fail, the client should attempt the operation again and reassign the block; this does not actually change the existing retry mechanism.

if writeChunk encounter exception, theKeyDataStreamOutput#handleException will allocate a new Block.

- The storage tier is an attribute of the pipeline, indicating the storage tiers supported by that pipeline. A pipeline
can support multiple storage tiers. For example, if Datanodes all have both SSD and DISK type volumes, the pipeline
created on these Datanodes will have supported storage tier attributes of both SSD and DISK.
- When allocating a container through an existing pipeline, SCM filters matching pipelines based on their supported
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document mentions filtering pipelines by storage tier, but it does not mention filtering containers within the pipeline by storage tier. Since the pipeline may handle both SSD and DISK, how do we ensure we write to the container on the correct tier?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point was omitted here; in fact, SCM also uses storage tiers to filter containers.

- MIS_STORAGE_TYPE_WITH_VOLUME: Container replica whose storage type mismatches its volume's storage type.
- ReplicationManager will select a matching volume and migrate the container replica for this health state.
- MIS_STORAGE_TYPE_WITH_CONTAINER: Container replica whose storage type mismatches the container's storage tier.
- ReplicationManager will send a command to update the storage type of the container replica on the Datanode.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. Could you clarify how ReplicationManagers will send that command, and what command?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new command (SetContainerStorageTypeCommand): When a Datanode receives this command, it updates the container replica, including both the in-memory data and the YAML file.

### Ozone Filesystem

- Support specifying a storage policy when writing a key.
- If a storage policy is specified when writing a key, the key storage policy is the specified storage policy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! What is the S3/ozone CLI command for that? What is the corresponding proto change for that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3
You can use the standard AWS S3 command or SDK; specify the storage-class when setting the put key.

Ozone
The Ozone API has added a storagepolicy parameter.

However, the more common approach is to create a bucket and set a storage policy; then, all writes to that bucket will be subject to the specified storage policy.

Copy link
Copy Markdown
Contributor

@chungen0126 chungen0126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the update. I left some comments. Besides, I didn't see the storage policy migrating command anymore. Do we still plan to approach it?

Comment thread hadoop-hdds/docs/content/design/storage-policy.md
## DN Container Replica Creation
- `ozone.scm.container.allow.null.storage.tier`
Default value: false. Whether to allow selecting containers with a null storage tier during container allocation.
This can be used during the upgrade period to allow new keys to be written to existing containers that do not yet
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify this configuration. I don't understand why we need it during the upgrade period.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because if the tier of a container created by SCM before the upgrade is null, and if these containers are not selected after the upgrade, no new data will be written to them, causing them to remain open (In the normal process, a container must be filled before it can be closed).

Comment thread hadoop-hdds/docs/content/design/storage-policy.md
@greenwich
Copy link
Copy Markdown
Contributor

greenwich commented Apr 13, 2026

@ivandika3, @xichen01, As per discussion above:

Let's start by backporting this to the community fork and ensuring everything works normally. We can discuss which patch we will use.

  1. Have you started rebasing to master/2.x? If so, can we collaborate on a shared branch?
  2. If not, would you be open to us doing the rebase and contributing it back?

@ivandika3
Copy link
Copy Markdown
Contributor

ivandika3 commented Apr 22, 2026

@greenwich Sorry for the late reply. Let me try to use AI agent to backport the storage policy to my fork.

@greenwich
Copy link
Copy Markdown
Contributor

greenwich commented Apr 25, 2026

@greenwich Sorry for the late reply. Let me try to use an AI agent to backport the storage policy to my fork.

@ivandika3, I managed to do a similar thing because I wanted to prepare everything up-front to make things go faster for this PR.

@ivandika3
Copy link
Copy Markdown
Contributor

Thanks @greenwich , @xichen01 would you mind take a look at that?

I'm currently backporting it in my fork, but the test are still failing (https://github.com/ivandika3/ozone/tree/refs/heads/backport-storage-policy-storage-class)

@ivandika3
Copy link
Copy Markdown
Contributor

ivandika3 commented Apr 30, 2026

@greenwich @xichen01 FYI, test in https://github.com/ivandika3/ozone/tree/refs/heads/backport-storage-policy-storage-class passed. You can refer to the diff in ivandika3#4

@ivandika3
Copy link
Copy Markdown
Contributor

Regarding Phase 2, just thinking ahead — would it make sense to initially target a simpler mover-style implementation (similar in spirit to HDFS Mover) before introducing a separate job worker subsystem? That might allow basic Storage Policy Migration functionality to be delivered earlier and iterated on over time.

@greenwich Yes, we can first implement a standalone StoragePolicySatisfier (similar to HDFS Mover or HDFS StoragePolicySatisfier) without needing to implement a separate subsystem.

@xichen01
Copy link
Copy Markdown
Contributor Author

xichen01 commented May 2, 2026

@greenwich @xichen01 FYI, test in https://github.com/ivandika3/ozone/tree/refs/heads/backport-storage-policy-storage-class passed. You can refer to the diff in ivandika3#4

@ivandika3 We can create corresponding sub tasks for these commits and then merge them into the HDDS-8342 branch via MR

@ivandika3
Copy link
Copy Markdown
Contributor

@xichen01 How about we cut a new HDDS-11233 branch? The current branch can be done without the transition patch (HDDS-8342). We can add the transition after HDDS-8342 is merged to master.

@xichen01
Copy link
Copy Markdown
Contributor Author

xichen01 commented May 3, 2026

@xichen01 How about we cut a new HDDS-11233 branch? The current branch can be done without the transition patch (HDDS-8342). We can add the transition after HDDS-8342 is merged to master.

I think we can create a new HDDS-11233 branch on the Apache Ozone repositories, just like https://github.com/apache/ozone/tree/HDDS-8342. Then we can merge related Commit to this branch.

@ivandika3
Copy link
Copy Markdown
Contributor

ivandika3 commented May 5, 2026

@xichen01 Could you help to take over? I'm currently quite busy and my GH account was recently blocked from GH actions (still waiting for support) so no CI can be triggered.

@xichen01
Copy link
Copy Markdown
Contributor Author

xichen01 commented May 5, 2026

@xichen01 Could you help to take over? I'm currently quite busy and my GH account was recently blocked from GH actions (still waiting for support) so no CI can be triggered.

OK, I will

@xichen01
Copy link
Copy Markdown
Contributor Author

xichen01 commented May 5, 2026

@xichen01
Copy link
Copy Markdown
Contributor Author

xichen01 commented May 5, 2026

@ivandika3 @greenwich @chungen0126
#10191 Please help to review

Copy link
Copy Markdown
Contributor

@chungen0126 chungen0126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me.

@greenwich
Copy link
Copy Markdown
Contributor

@ivandika3 @greenwich @chungen0126 #10191 Please help to review

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

design documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants