Skip to content

HDDS-14239. Simplify Prepare Batch looping for DeleteRange Op processing in RDBBatchOperation#9553

Draft
swamirishi wants to merge 3 commits intoapache:masterfrom
swamirishi:HDDS-14239
Draft

HDDS-14239. Simplify Prepare Batch looping for DeleteRange Op processing in RDBBatchOperation#9553
swamirishi wants to merge 3 commits intoapache:masterfrom
swamirishi:HDDS-14239

Conversation

@swamirishi
Copy link
Contributor

@swamirishi swamirishi commented Dec 24, 2025

What changes were proposed in this pull request?

The PrepareBatchOperation looping was very convoluted in HDDS-13415 (#8774) implementation. It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. The following example will explain the scenario better.

  1. Put Key1

  2. DeleteRange Key2 - Key5

  3. Put Key2

  4. DeleteRange Key1 - Key4

Here the operation 1 should ideally be cancelled by Op4. But currently both Op1 & OP4 gets executed however it would be more optimal to just execute Op4 & Op1 is redundant.
Initially while implementing the deleteRangeWithBatch I had plans to optimize this the deleteRange search by sorting the continuous ranges and do a binary search to figure out the range matching the entry to reduce overall complexity. However we saw the code was getting more complex and not giving us much returns since deleteRange as an op was going to be infrequent and was only going to be used by snapshot create.
But the deleteRange is also going to be used by DirectoryPurgeRequest removing entries from the file table and directory table. This optimization would help reduce complexity of rocksdb compactions if a key is committed and also deleted within the same double buffer batch.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14239

How was this patch tested?

Updated unit tests. I intend to add another unit test suggested in this comment
#8774 (comment)

@github-actions
Copy link

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions bot added the stale label Jan 17, 2026
@aryangupta1998
Copy link
Contributor

@swamirishi, are you working on this?

@swamirishi
Copy link
Contributor Author

@swamirishi, are you working on this?
I have a branch ready locally. I would push the code in some time after resolving some merge conflicts.

…nges and simple logic

Change-Id: I7a06f5a427ba7d1c4f760ff70fcb4093abaa3cbd
@swamirishi
Copy link
Contributor Author

@szetszwo @jojochuang Can you take a look at this change?

@swamirishi
Copy link
Contributor Author

@smengcl this needs to be review before #9690

@szetszwo
Copy link
Contributor

szetszwo commented Feb 6, 2026

@swamirishi , in our current usage, is it the case that DeleteRange won't mix with other SingleKeyOp? If it is true, we only have to support batch DeleteRange and batch SingleKeyOp separately but don't have to support mixing them together.

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 6, 2026

@swamirishi , in our current usage, is it the case that DeleteRange won't mix with other SingleKeyOp? If it is true, we only have to support batch DeleteRange and batch SingleKeyOp separately but don't have to support mixing them together.

No that is not the case @szetszwo it completely depends on the transactions that happen together in the same double buffer. This sort of intersection can happen even in the same DirectoryPurgeRequest, look at #9423 DirectoryDeletingService purge path request could be opening up deleted sub directory(move from DirectoryTable to deletedDirectoryTable) and also deleting the same entry from deletedDirectoryTable because of the optimizeDirDeleteAndSubmitRequest in the same transaction.
This would unnecessarily increase the rocksdb and compactions.

@swamirishi
Copy link
Contributor Author

This PR #9690 improves the efficiency of finding out the intersection from O(n) to log(n)

@szetszwo
Copy link
Contributor

szetszwo commented Feb 6, 2026

... it completely depends on the transactions that happen together in the same double buffer ...

@swamirishi , I agree. I mean that the current code does not have such transactions and we are not going to support such code.

... improves the efficiency ...

Let's talk about the correctness before talking about efficiency. As you mentioned in the description, the previously implementation (quoted below) indeed has a bug. So this change is very risky.

... It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. ...

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 7, 2026

... it completely depends on the transactions that happen together in the same double buffer ...

@swamirishi , I agree. I mean that the current code does not have such transactions and we are not going to support such code.

... improves the efficiency ...

Let's talk about the correctness before talking about efficiency. As you mentioned in the description, the previously implementation (quoted below) indeed has a bug. So this change is very risky.

... It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. ...

It was not a bug but just inefficient. We are good as long as the order of operations are correct.
For instance:
[PUT K1, DELETE RANGE K5- K6, PUT K5 DELETE RANGE K1- K4] is equivalent to [DELETE RANGE K5-K6, PUT K5, DELETE RANGE K1-K4]
So corresponding to the Single Key Op PUT K1 the previous implementation was only looking into the very next DeleteRange operation i.e. DELETE RANGE K5-K6 and was performing the first set of operations if there is no intersection and the implementation proposed in this PR looks into optimizing it further and peeks into all the Delete Range that come after i.e. after PUT K1 there are 2 DELETE Range K5-K6 and DELETE RANGE K1-K4 and K1 has an intersection with DELETE RANGE K1-K4 so the operation PUT K1 can be skipped even if it is performed it is not an issue but just redundant. So the issue was never around correctness and doesn't constitute as a bug.

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 7, 2026

... it completely depends on the transactions that happen together in the same double buffer ...

@swamirishi , I agree. I mean that the current code does not have such transactions and we are not going to support such code.

... improves the efficiency ...

Let's talk about the correctness before talking about efficiency. As you mentioned in the description, the previously implementation (quoted below) indeed has a bug. So this change is very risky.

... It also misses a case where a putKey/deleteKey can get added even though a deleteRange has been executed in the next batch after the following continuousDeleteRange batch. ...

It was not a bug but just inefficient. We are good as long as the order of operations are correct. For instance: [PUT K1, DELETE RANGE K5- K6, PUT K5 DELETE RANGE K1- K4] is equivalent to [DELETE RANGE K5-K6, PUT K5, DELETE RANGE K1-K4] So corresponding to the Single Key Op PUT K1 the previous implementation was only looking into the very next DeleteRange operation i.e. DELETE RANGE K5-K6 and was performing the first set of operations if there is no intersection and the implementation proposed in this PR looks into optimizing it further and peeks into all the Delete Range that come after i.e. after PUT K1 there are 2 DELETE Range K5-K6 and DELETE RANGE K1-K4 and K1 has an intersection with DELETE RANGE K1-K4 so the operation PUT K1 can be skipped even if it is performed it is not an issue but just redundant. So the issue was never around correctness.

We had done it the earlier way consciously since the DELETE RANGE was only being used for Snapshot CREATE and there was going to be no intersection of DELETE RANGE with any of the operation beyond one transaction since SNAPSHOT CREATE splits the double buffer batch. This changes proposed here is for more generic use cases like @aryangupta1998 's PR change which is also a safe change. The earlier implementation had an Order of O(n + k) and with this PR we would make it O(n * k) where n is the number of Single Key Op and k is number of Delete range operations in a batch(P.S. we don't anticipate having a lot of delete range in one batch K should be ideally much smaller than n). The next PR brings this efficiency down to O(n Log k).

@szetszwo
Copy link
Contributor

szetszwo commented Feb 7, 2026

@swamirishi , sorry that this does not look like safe change since it dramatically changes RDBBatchOperation. In particular, this PR adds the new Map<Integer, Op> ops to support a unneeded case mixing DeleteRange and SingleKeyOp in a batch.

  • I propose support pure batch DeleteRange (but not mixing other SingleKeyOp). The code would be much simpler.

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 7, 2026

@swamirishi , sorry that this does not look like safe change since it dramatically changes RDBBatchOperation. In particular, this PR adds the new Map<Integer, Op> ops to support a unneeded case mixing DeleteRange and SingleKeyOp in a batch.

  • I propose support pure batch DeleteRange (but not mixing other SingleKeyOp). The code would be much simpler.

That is not even possible. It means double buffer would have to understand how the DBResponse interface is going to flush the transaction on a db. One transaction can do both put, delete and delete range. We can never guarantee one transaction is going to have only delete range operation. All of them need to be flushed in the same rocksdb batch for atomicity guarantees.

@szetszwo
Copy link
Contributor

szetszwo commented Feb 7, 2026

That is not even possible. ...

Today, we don't have batch DeleteRange and everything works. If we add pure batch DeleteRange tomorrow, would anything become not working?

Change-Id: Ia3db219dab70a7a2faf82009e8f0dac59c665938
Change-Id: Icae0e1299fcb7d37bd5a0f3c2f405537b1d68f3e
@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 7, 2026

That is not even possible. ...

Today, we don't have batch DeleteRange and everything works. If we add pure batch DeleteRange tomorrow, would anything become not working?

We don't have deleteRange because we reverted #8774 and snapshot create was using delete range operation. Without this patch we are going to kill snapshot create performance and the optimization to improve rocksdb performance in case of huge deletes.

@szetszwo
Copy link
Contributor

szetszwo commented Feb 8, 2026

... snapshot create was using delete range operation. ...

I see you point. How about extending RDBBatchOperation in a subclass, say RDBBatchOperationWithDeleteRange? The new code goes to RDBBatchOperationWithDeleteRange while keeping RDBBatchOperation unchanged.

In case there is a bug, people not using OM snapshot won't be affected.

@swamirishi
Copy link
Contributor Author

swamirishi commented Feb 8, 2026

I see you point. How about extending RDBBatchOperation in a subclass, say RDBBatchOperationWithDeleteRange? The new code goes to RDBBatchOperationWithDeleteRange while keeping RDBBatchOperation unchanged.

Are you saying we have separate implementation and make it configurable? Then it would mean we would have to branch the logic OmResponse where it would have to understand which implementation has been passed

protected abstract void addToDBBatch(OMMetadataManager omMetadataManager,
BatchOperation batchOperation) throws IOException;

@szetszwo
Copy link
Contributor

szetszwo commented Feb 9, 2026

Are you saying we have separate implementation and make it configurable?

Configurable is a good idea:

  • People not using snapshot can use RDBBatchOperation.
  • Even people using snapshot could choose to use RDBBatchOperation (DeleteRange can be implemented by iterating Delete). This is very useful in case there is a bug in the new code.

Then it would mean we would have to branch the logic OmResponse where it would have to understand which implementation has been passed

No. We could add a RDBBatchOperationFactory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants