Skip to content

kvstorage: Add batching to the WAGTruncator#168351

Open
iskettaneh wants to merge 2 commits intocockroachdb:masterfrom
iskettaneh:rse_truncate_7
Open

kvstorage: Add batching to the WAGTruncator#168351
iskettaneh wants to merge 2 commits intocockroachdb:masterfrom
iskettaneh:rse_truncate_7

Conversation

@iskettaneh
Copy link
Copy Markdown
Contributor

@iskettaneh iskettaneh commented Apr 14, 2026

This PR adds the ability to truncate multiple WAG nodes in a single batch. It adds the cluster setting kv.wag.truncator_batch_size to control the batch size.

Batch sizes benchmark:

BenchmarkWAGTruncation/batchSize=1                247114             15261 ns/op            1002 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=4                396939              5533 ns/op             608 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=8                452824              5220 ns/op             549 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=16               447286              4400 ns/op             505 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               512752              2402 ns/op             503 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               484096              3737 ns/op             481 B/op          9 allocs/op

References: #167607

Release note: None

Co-Authored-By: roachdev-claude roachdev-claude-bot@cockroachlabs.com

@iskettaneh iskettaneh requested a review from pav-kv April 14, 2026 18:19
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Apr 14, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 14, 2026

Detected infrastructure failure (matched: ). Automatically rerunning failed jobs. (run link)

@iskettaneh iskettaneh marked this pull request as ready for review April 15, 2026 14:08
@iskettaneh iskettaneh requested review from a team as code owners April 15, 2026 14:08
@iskettaneh iskettaneh requested a review from sumeerbhola April 15, 2026 14:08
Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated
Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated
Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated
@iskettaneh iskettaneh requested review from a team and removed request for a team and sumeerbhola April 17, 2026 20:20
Copy link
Copy Markdown
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iskettaneh made 4 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on pav-kv).


pkg/kv/kvserver/kvstorage/wag_truncator_test.go line 528 at r6 (raw file):

// thing, but it should give an idea of the improvement of different batch
// sizes.
func BenchmarkWAGTruncation(b *testing.B) {

@pav-kv I am not sure if the Benchmark is needed really, it just helped me make sure that the batching works and helped me pick a default batchSize.

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated
Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated
Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated
@iskettaneh iskettaneh requested a review from pav-kv April 17, 2026 20:24
Copy link
Copy Markdown
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-test code LGTM. I'll review tests a bit later, stepping off for today.

Comment on lines +175 to 178
if err = t.clearReplicaRaftLogAndSideloaded(ctx,
Raft{RO: t.eng.LogEngine(), WO: b}, event.Addr.RangeID, event.Addr.Index); err != nil {
return false, err
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (idiomatic): err :=, bring err != nil to next line

			if err := t.clearReplicaRaftLogAndSideloaded(
				ctx, Raft{RO: t.eng.LogEngine(), WO: b}, event.Addr.RangeID, event.Addr.Index,
			); err != nil {
				return false, err
			}

}
return index, nil

if err = b.Commit(false); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: false /* sync */

settings.SystemOnly,
"kv.wag.truncator_batch_size",
"number of WAG nodes to delete per write batch during truncation",
8,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 32 or 64? Looks 2x better than 8 according to benchmarks. Wonder if 64 being slower than 32 is a flake, or there is some actual slowdown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reran the benchmark (for 5 times), and it gets kinda noisy after 16:

BenchmarkWAGTruncation/batchSize=1                100000             15413 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             17085 ns/op             959 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             16959 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             15240 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             15508 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7110 ns/op             611 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              5439 ns/op             610 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              5314 ns/op             610 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7148 ns/op             590 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7059 ns/op             612 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              3677 ns/op             539 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5388 ns/op             540 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5329 ns/op             539 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              3750 ns/op             540 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5428 ns/op             541 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              3021 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              4555 ns/op             496 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              2834 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              2827 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              4440 ns/op             501 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2374 ns/op             487 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              3944 ns/op             484 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2514 ns/op             486 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2406 ns/op             485 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              4016 ns/op             484 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              3946 ns/op             479 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2328 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2247 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              3747 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2192 ns/op             480 B/op          9 allocs/op
PASS

I think I will pick 16. There is one downside that I can think of with larger batch size. If there is some transient disk error or something of that sort, the larger batch size will be a bit more likely to encounter it.

"kv.wag.truncator_batch_size",
"number of WAG nodes to delete per write batch during truncation",
8,
settings.IntInRange(1, 1024),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 seems a reasonable cap. Not just using IntWithMinimum to make things "safe" in some sense?

It probably wouldn't be much of a win to raise it much higher anyway?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't see a reason why we might want to increase it above 1024.

// their events have been applied to the state engine. For nodes containing
// EventDestroy or EventSubsume events, it also clears the corresponding raft
// log prefix from the engine and the sideloaded entries storage.
// log prefix from the engine and the sideloaded entries.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert "storage"? The old thing reads to me as "clears the ... raft log prefix ... from .. the sideloaded entries storage". The sentence seems broken without "storage".

if count == 0 {
return false, nil
}
if err := iter.Error(); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this check first? count=0 is possible on an error, so we should probably pritoritize returning an error in this case. Could also squash as:

if err := iter.Error(); err != nil || count == 0 {
	return false, err
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

if err := iter.Error(); err != nil {
return false, err
}
if err := b.Commit(false); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false /* sync */

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err := b.Commit(false); err != nil {
return false, err
}
t.truncIndex.Store(targetIndex - 1) // targetIndex is pointing at the last index truncated + 1.
Copy link
Copy Markdown
Collaborator

@pav-kv pav-kv Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about flipping the script a bit, so that we don't +- 1 so much?

truncated := t.truncIndex.Load()
for ... {
	if index != truncated+1 && index > t.initIndex {
		// We cannot ignore gaps for WAG indices > initIndex.
		break
	}
	...
	truncated = index
	count++
	...
}
...
t.truncIndex.Store(truncated)
return true, nil

"-1" needs a "proof" and relies on the "count == 0" early exit above. Whereas this way, no proof needed, and even an accidental unconditional Store would be correct in the no-op case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense

iskettaneh and others added 2 commits April 19, 2026 20:43
…learRaftState

Move batch creation, commit, and truncIndex advancement from
truncateAppliedNodes into truncateAppliedWAGNodeAndClearRaftState,
making the latter fully self-contained. This simplifies the caller
loop and makes the method signature cleaner (bool instead of uint64).

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Previously, truncateAppliedWAGNodeAndClearRaftState deleted one WAG node
per batch and committed immediately.

This commit does the following:

1) Rename truncateAppliedWAGNodeAndClearRaftState() to truncateBatch().

2) Introduce a cluster-setting that controls the batch size.

3) Try to fit up-to batchSize deletion in each call to truncateBatch().

Benchmark results:

```
BenchmarkWAGTruncation/batchSize=1                100000             15413 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             17085 ns/op             959 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             16959 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             15240 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             15508 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7110 ns/op             611 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              5439 ns/op             610 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              5314 ns/op             610 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7148 ns/op             590 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7059 ns/op             612 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              3677 ns/op             539 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5388 ns/op             540 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5329 ns/op             539 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              3750 ns/op             540 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5428 ns/op             541 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              3021 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              4555 ns/op             496 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              2834 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              2827 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              4440 ns/op             501 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2374 ns/op             487 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              3944 ns/op             484 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2514 ns/op             486 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2406 ns/op             485 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              4016 ns/op             484 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              3946 ns/op             479 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2328 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2247 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              3747 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2192 ns/op             480 B/op          9 allocs/op
```

Release note: None
Epic: none
Copy link
Copy Markdown
Contributor Author

@iskettaneh iskettaneh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iskettaneh made 5 comments and resolved 3 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on pav-kv).

settings.SystemOnly,
"kv.wag.truncator_batch_size",
"number of WAG nodes to delete per write batch during truncation",
8,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reran the benchmark (for 5 times), and it gets kinda noisy after 16:

BenchmarkWAGTruncation/batchSize=1                100000             15413 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             17085 ns/op             959 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             16959 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             15240 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=1                100000             15508 ns/op            1003 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7110 ns/op             611 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              5439 ns/op             610 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              5314 ns/op             610 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7148 ns/op             590 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=4                100000              7059 ns/op             612 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              3677 ns/op             539 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5388 ns/op             540 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5329 ns/op             539 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              3750 ns/op             540 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=8                100000              5428 ns/op             541 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              3021 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              4555 ns/op             496 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              2834 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              2827 ns/op             495 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=16               100000              4440 ns/op             501 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2374 ns/op             487 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              3944 ns/op             484 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2514 ns/op             486 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              2406 ns/op             485 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               100000              4016 ns/op             484 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              3946 ns/op             479 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2328 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2247 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              3747 ns/op             480 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               100000              2192 ns/op             480 B/op          9 allocs/op
PASS

I think I will pick 16. There is one downside that I can think of with larger batch size. If there is some transient disk error or something of that sort, the larger batch size will be a bit more likely to encounter it.

"kv.wag.truncator_batch_size",
"number of WAG nodes to delete per write batch during truncation",
8,
settings.IntInRange(1, 1024),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't see a reason why we might want to increase it above 1024.

if count == 0 {
return false, nil
}
if err := iter.Error(); err != nil {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

if err := iter.Error(); err != nil {
return false, err
}
if err := b.Commit(false); err != nil {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err := b.Commit(false); err != nil {
return false, err
}
t.truncIndex.Store(targetIndex - 1) // targetIndex is pointing at the last index truncated + 1.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense

@iskettaneh iskettaneh requested a review from pav-kv April 20, 2026 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants