kvstorage: Add batching to the WAGTruncator by iskettaneh · Pull Request #168351 · cockroachdb/cockroach

iskettaneh · 2026-04-14T18:19:19Z

This PR adds the ability to truncate multiple WAG nodes in a single batch. It adds the cluster setting kv.wag.truncator_batch_size to control the batch size.

Batch sizes benchmark:

BenchmarkWAGTruncation/batchSize=1                247114             15261 ns/op            1002 B/op         20 allocs/op
BenchmarkWAGTruncation/batchSize=4                396939              5533 ns/op             608 B/op         11 allocs/op
BenchmarkWAGTruncation/batchSize=8                452824              5220 ns/op             549 B/op         10 allocs/op
BenchmarkWAGTruncation/batchSize=16               447286              4400 ns/op             505 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=32               512752              2402 ns/op             503 B/op          9 allocs/op
BenchmarkWAGTruncation/batchSize=64               484096              3737 ns/op             481 B/op          9 allocs/op

References: #167607

Release note: None

Co-Authored-By: roachdev-claude roachdev-claude-bot@cockroachlabs.com

trunk-io · 2026-04-14T18:19:23Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

cockroach-teamcity · 2026-04-14T18:19:39Z

This change is

blathers-crl · 2026-04-14T18:56:49Z

Detected infrastructure failure (matched: ). Automatically rerunning failed jobs. (run link)

iskettaneh

@iskettaneh made 4 comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on pav-kv).

pkg/kv/kvserver/kvstorage/wag_truncator_test.go line 528 at r6 (raw file):

// thing, but it should give an idea of the improvement of different batch
// sizes.
func BenchmarkWAGTruncation(b *testing.B) {

@pav-kv I am not sure if the Benchmark is needed really, it just helped me make sure that the batching works and helped me pick a default batchSize.

pav-kv

The non-test code LGTM. I'll review tests a bit later, stepping off for today.

pav-kv · 2026-04-17T21:19:42Z

+			if err = t.clearReplicaRaftLogAndSideloaded(ctx,
+				Raft{RO: t.eng.LogEngine(), WO: b}, event.Addr.RangeID, event.Addr.Index); err != nil {
+				return false, err
 			}


nit (idiomatic): err :=, bring err != nil to next line

if err := t.clearReplicaRaftLogAndSideloaded( ctx, Raft{RO: t.eng.LogEngine(), WO: b}, event.Addr.RangeID, event.Addr.Index, ); err != nil { return false, err }

pav-kv · 2026-04-17T21:21:24Z

 		}
-		return index, nil
+
+		if err = b.Commit(false); err != nil {


nit: false /* sync */

pav-kv · 2026-04-17T21:24:30Z

+	settings.SystemOnly,
+	"kv.wag.truncator_batch_size",
+	"number of WAG nodes to delete per write batch during truncation",
+	8,


How about 32 or 64? Looks 2x better than 8 according to benchmarks. Wonder if 64 being slower than 32 is a flake, or there is some actual slowdown.

I reran the benchmark (for 5 times), and it gets kinda noisy after 16:

BenchmarkWAGTruncation/batchSize=1 100000 15413 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 17085 ns/op 959 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 16959 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 15240 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 15508 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7110 ns/op 611 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 5439 ns/op 610 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 5314 ns/op 610 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7148 ns/op 590 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7059 ns/op 612 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 3677 ns/op 539 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5388 ns/op 540 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5329 ns/op 539 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 3750 ns/op 540 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5428 ns/op 541 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 3021 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 4555 ns/op 496 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 2834 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 2827 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 4440 ns/op 501 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2374 ns/op 487 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 3944 ns/op 484 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2514 ns/op 486 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2406 ns/op 485 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 4016 ns/op 484 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 3946 ns/op 479 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2328 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2247 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 3747 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2192 ns/op 480 B/op 9 allocs/op PASS

I think I will pick 16. There is one downside that I can think of with larger batch size. If there is some transient disk error or something of that sort, the larger batch size will be a bit more likely to encounter it.

pav-kv · 2026-04-17T21:31:00Z

+	"kv.wag.truncator_batch_size",
+	"number of WAG nodes to delete per write batch during truncation",
+	8,
+	settings.IntInRange(1, 1024),


1024 seems a reasonable cap. Not just using IntWithMinimum to make things "safe" in some sense?

It probably wouldn't be much of a win to raise it much higher anyway?

Yeah I don't see a reason why we might want to increase it above 1024.

pav-kv · 2026-04-17T21:33:05Z

+// their events have been applied to the state engine. For nodes containing
 // EventDestroy or EventSubsume events, it also clears the corresponding raft
-// log prefix from the engine and the sideloaded entries storage.
+// log prefix from the engine and the sideloaded entries.


nit: revert "storage"? The old thing reads to me as "clears the ... raft log prefix ... from .. the sideloaded entries storage". The sentence seems broken without "storage".

pav-kv · 2026-04-17T21:39:02Z

+	if count == 0 {
+		return false, nil
+	}
+	if err := iter.Error(); err != nil {


Put this check first? count=0 is possible on an error, so we should probably pritoritize returning an error in this case. Could also squash as:

if err := iter.Error(); err != nil || count == 0 { return false, err }

Good point!

pav-kv · 2026-04-17T21:40:08Z

+	if err := iter.Error(); err != nil {
+		return false, err
+	}
+	if err := b.Commit(false); err != nil {


false /* sync */

pav-kv · 2026-04-17T21:50:27Z

+	if err := b.Commit(false); err != nil {
+		return false, err
+	}
+	t.truncIndex.Store(targetIndex - 1) // targetIndex is pointing at the last index truncated + 1.


How about flipping the script a bit, so that we don't +- 1 so much?

truncated := t.truncIndex.Load() for ... { if index != truncated+1 && index > t.initIndex { // We cannot ignore gaps for WAG indices > initIndex. break } ... truncated = index count++ ... } ... t.truncIndex.Store(truncated) return true, nil

"-1" needs a "proof" and relies on the "count == 0" early exit above. Whereas this way, no proof needed, and even an accidental unconditional Store would be correct in the no-op case.

Yeah that makes sense

…learRaftState Move batch creation, commit, and truncIndex advancement from truncateAppliedNodes into truncateAppliedWAGNodeAndClearRaftState, making the latter fully self-contained. This simplifies the caller loop and makes the method signature cleaner (bool instead of uint64). Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

Previously, truncateAppliedWAGNodeAndClearRaftState deleted one WAG node per batch and committed immediately. This commit does the following: 1) Rename truncateAppliedWAGNodeAndClearRaftState() to truncateBatch(). 2) Introduce a cluster-setting that controls the batch size. 3) Try to fit up-to batchSize deletion in each call to truncateBatch(). Benchmark results: ``` BenchmarkWAGTruncation/batchSize=1 100000 15413 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 17085 ns/op 959 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 16959 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 15240 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 15508 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7110 ns/op 611 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 5439 ns/op 610 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 5314 ns/op 610 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7148 ns/op 590 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7059 ns/op 612 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 3677 ns/op 539 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5388 ns/op 540 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5329 ns/op 539 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 3750 ns/op 540 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5428 ns/op 541 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 3021 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 4555 ns/op 496 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 2834 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 2827 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 4440 ns/op 501 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2374 ns/op 487 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 3944 ns/op 484 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2514 ns/op 486 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2406 ns/op 485 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 4016 ns/op 484 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 3946 ns/op 479 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2328 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2247 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 3747 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2192 ns/op 480 B/op 9 allocs/op ``` Release note: None Epic: none

iskettaneh

@iskettaneh made 5 comments and resolved 3 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on pav-kv).

iskettaneh · 2026-04-20T00:45:24Z

+	settings.SystemOnly,
+	"kv.wag.truncator_batch_size",
+	"number of WAG nodes to delete per write batch during truncation",
+	8,


I reran the benchmark (for 5 times), and it gets kinda noisy after 16:

BenchmarkWAGTruncation/batchSize=1 100000 15413 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 17085 ns/op 959 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 16959 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 15240 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=1 100000 15508 ns/op 1003 B/op 20 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7110 ns/op 611 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 5439 ns/op 610 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 5314 ns/op 610 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7148 ns/op 590 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=4 100000 7059 ns/op 612 B/op 11 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 3677 ns/op 539 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5388 ns/op 540 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5329 ns/op 539 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 3750 ns/op 540 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=8 100000 5428 ns/op 541 B/op 10 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 3021 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 4555 ns/op 496 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 2834 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 2827 ns/op 495 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=16 100000 4440 ns/op 501 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2374 ns/op 487 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 3944 ns/op 484 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2514 ns/op 486 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 2406 ns/op 485 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=32 100000 4016 ns/op 484 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 3946 ns/op 479 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2328 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2247 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 3747 ns/op 480 B/op 9 allocs/op BenchmarkWAGTruncation/batchSize=64 100000 2192 ns/op 480 B/op 9 allocs/op PASS

I think I will pick 16. There is one downside that I can think of with larger batch size. If there is some transient disk error or something of that sort, the larger batch size will be a bit more likely to encounter it.

iskettaneh · 2026-04-20T00:45:24Z

+	"kv.wag.truncator_batch_size",
+	"number of WAG nodes to delete per write batch during truncation",
+	8,
+	settings.IntInRange(1, 1024),


Yeah I don't see a reason why we might want to increase it above 1024.

iskettaneh · 2026-04-20T00:45:25Z

+	if count == 0 {
+		return false, nil
+	}
+	if err := iter.Error(); err != nil {


Good point!

iskettaneh · 2026-04-20T00:45:25Z

+	if err := iter.Error(); err != nil {
+		return false, err
+	}
+	if err := b.Commit(false); err != nil {


iskettaneh · 2026-04-20T00:45:25Z

+	if err := b.Commit(false); err != nil {
+		return false, err
+	}
+	t.truncIndex.Store(targetIndex - 1) // targetIndex is pointing at the last index truncated + 1.


Yeah that makes sense

iskettaneh requested a review from pav-kv April 14, 2026 18:19

iskettaneh marked this pull request as ready for review April 15, 2026 14:08

iskettaneh requested review from a team as code owners April 15, 2026 14:08

iskettaneh requested a review from sumeerbhola April 15, 2026 14:08

pav-kv reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated

pav-kv reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated

pav-kv reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated

iskettaneh force-pushed the rse_truncate_7 branch from 4236da9 to 55a5cbd Compare April 17, 2026 20:19

iskettaneh requested review from a team and removed request for a team and sumeerbhola April 17, 2026 20:20

iskettaneh force-pushed the rse_truncate_7 branch from 55a5cbd to 5ca62af Compare April 17, 2026 20:22

iskettaneh commented Apr 17, 2026

View reviewed changes

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated

Comment thread pkg/kv/kvserver/kvstorage/wag_truncator.go Outdated

iskettaneh requested a review from pav-kv April 17, 2026 20:24

pav-kv reviewed Apr 17, 2026

View reviewed changes

iskettaneh and others added 2 commits April 19, 2026 20:43

iskettaneh force-pushed the rse_truncate_7 branch from 5ca62af to cf21b21 Compare April 20, 2026 00:45

iskettaneh commented Apr 20, 2026

View reviewed changes

iskettaneh requested a review from pav-kv April 20, 2026 00:45

Conversation

iskettaneh commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io bot commented Apr 14, 2026

Uh oh!

cockroach-teamcity commented Apr 14, 2026

Uh oh!

blathers-crl bot commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pav-kv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pav-kv Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iskettaneh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iskettaneh commented Apr 14, 2026 •

edited

Loading

pav-kv Apr 17, 2026 •

edited

Loading