feat(k8sbatch): per-reconcile mutation batcher (Phase 1) by anngdinh · Pull Request #10 · vngcloud/vngcloud-load-balancer-controller

anngdinh · 2026-05-09T15:28:08Z

Summary

Lands pkg/k8sbatch — a generic per-reconcile batcher that coalesces queued mutations against Kubernetes objects so a single reconcile performs at most one GET + one Spec patch + one Status patch per distinct object, instead of N GETs and N PATCHes for N small mutator functions.

Public API: Batcher, New, Pending, Flush(ctx), generic MutateSpec[T] / MutateStatus[T].
Semantics: MergeFromWithOptimisticLock + retry.RetryOnConflict per object (mutators re-run against fresh state on 409). Best-effort across objects via errors.Join. Spec-fail-skips-Status (entry stays queued, both retried on next Flush).
Phase 1 only: No production call sites are migrated. Phases 2 (re-route existing PatchMutateStatus* helpers) and 3 (migrate reconcilers) are explicit follow-ups.

Spec: docs/superpowers/specs/2026-05-09-k8sbatch-design.md
Plan: docs/superpowers/plans/2026-05-09-k8sbatch.md

Test plan

13 envtest specs pass (go test -count=1 ./pkg/k8sbatch/...) — covers happy path, mutator-returns-false, queue ordering, NotFound, Spec-only, Spec+Status ordering, Spec-fail-skips-Status, best-effort across objects, mid-reconcile flush, real API-server 409 retry, identity coalescing, and multi-CR-type batching.
go vet ./pkg/k8sbatch/... clean.
Reviewer to confirm pinned golangci-lint v1.59.1 typecheck noise on this package is the same pre-existing repo-wide issue (appears on internal/controller/... too), not a regression.

Scope discipline

Owner-controller Spec writes (Ingress/Service writing to LBC/NSG Spec) are intentionally not migrated; the API supports them via MutateSpec, but call-site migration is deferred.
The type-specific cmpopts.IgnoreFields(...) filtering used in the old patchMutateStatusObject debug-diff log is dropped (the new package is generic). Acceptable per the spec.

🤖 Generated with Claude Code

Adds the empty pkg/k8sbatch package and a Ginkgo TestSuite that boots envtest with the existing v1alpha1 CRDs plus fixture helpers (newTestLBC, newTestNSG). No functional code yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Defines the core Batcher type with its queue keyed by GVK+NamespacedName, plus the New constructor and Pending introspection method. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds inline comments on the unexported entry struct fields and a doc comment on New, so future readers and reviewers understand the type- preservation contract that future Mutate{Spec,Status} wrappers depend on. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the queueing entry points and the per-object flush pipeline: fresh GET, Spec patch first, Status patch second, retry-on-conflict restarts the entire pipeline so mutators always run against the freshest state. Single-object happy path is covered by an envtest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dStatus snapshot - Rename anyChanged → applyMutators since the function intentionally never short-circuits (later mutators must observe earlier ones' changes); the new name makes that intent obvious without relying on the comment. - Add a comment near the post-Spec-patch oldStatus snapshot explaining that client.Patch writes the API server's response (including the bumped resourceVersion) back into fresh, so the optimistic lock on the Status patch uses the correct resourceVersion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds three envtest specs: - Mutator returning false skips the patch (verified via unchanged ResourceVersion). - Multiple Status mutators on the same object run in queue order with each observing the prior mutator's changes; a single patch lands the final state. - NotFound on the initial GET surfaces a NotFound error (unwrappable through errors.Join) and keeps the failed entry queued for retry. Also adds apierrors import and uses LeafNodeLocation.LineNumber in the shared BeforeEach to avoid namespace collision between the two It blocks in the same Describe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Spec-only mutator issues a Spec patch through client.Patch and clears the entry on success. - When both Spec and Status mutators are queued for one object, Spec is patched first, the Status base snapshot is taken from the post-Spec state, and the Status mutator observes the freshly applied Spec value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a failingPatchClient decorator and a spec verifying that when the Spec patch ultimately fails the queued Status mutator is not invoked and the entry stays queued so a later Flush can retry both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Best-effort: when one object's GET returns NotFound, the other object's Status patch still lands; the joined error unwraps to NotFound; the failed entry stays queued while the successful one clears. - Mid-reconcile flush: calling Flush twice in one reconcile re-fetches the object so mutators always run against the current cluster state, even when an out-of-band patch landed between flushes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Triggers a real 409 by performing an out-of-band patch from inside the batcher's mutator on its first invocation, then verifies that RetryOnConflict re-GETs and re-runs the mutator until the final patch succeeds and lands the desired value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Two MutateStatus calls with distinct *T pointers but matching GVK+namespace+name coalesce into one queue entry; both mutators run in queue order against one fresh GET; one patch lands. - A single Batcher flushes mutations against two distinct CR types (LoadBalancerConfig and NodeSecurityGroup) in one Flush call, preserving per-type state through the generic queueing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Extract repeated "sg-1" string into a const sg to satisfy goconst. - Wrap the failingPatchClient.Patch signature across multiple lines to stay under 120 chars (lll). Verified clean with golangci-lint v2.12.2 (the latest version compatible with Go 1.25; the repo's pinned v1.59.1 is broken on Go 1.25 across the whole repo and is left untouched here — fixing the toolchain version is out of scope for this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Migrates the 6 statusAdd* methods in lbc_uc/status.go from PatchMutateStatusLoadBalancerConfig (per-call GET+PATCH) to k8sbatch.MutateStatus, with one Flush at the end of ensure / DeleteLoadBalancerConfigUseCase. A reconcile that creates a load balancer with N listeners + M pools previously issued ~(2 + N + M) GET+PATCH cycles against the LBC; it now issues 1 GET + 1 Status PATCH at the end (or 2 of each if the deploy_lb.go:84 wholesale patch also fires — that one stays as-is for now). Threading - Adds Client() method to repository.K8sRepository so use cases can construct a Batcher. - Adds batcher *k8sbatch.Batcher field on defaultModelDeployTask. - ensure / DeleteLoadBalancerConfigUseCase always Flush after the primary task call, even on deploy/delete error, so partial status writes still land — matches today's progressive-checkpoint behavior. Read-back semantics preserved - Verified that every t.lbConfig.Status.* read in the deploy/delete path either runs before any statusAdd* (trivially safe) or expects start-of-reconcile state (e.g. deploy_pool.go:108 reading CreatedPools.CreatedMembers as the merge baseline). Today's PatchMutateStatusObject also operates on its own DeepCopy and never writes back into t.lbConfig, so this property is unchanged after migration. Test isolation note - internal/controller/core has pre-existing seed-dependent flakiness in service_controller_test.go:400 (test-cleanup leakage between Describes). Verified across 5 random seeds: identical failure rate and same failing line on both pre- and post-migration code. Focused tests covering the critical paths (LB creation, deletion, port update, multi-service-same-LB, NSG) all pass. Mocks regenerated via the project's mockery (.mockery.yml) to add Client() to MockK8sRepository. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Service Controller suite has a long-standing seed-dependent flake: ~50% of random seeds caused tests to fail at service_controller_test.go:351 ("BackendSubnetID mismatch") cascading into ~5-7 follow-on failures from test-cleanup leakage. Three independent issues compounded it: 1. AfterEach asserted clean state BEFORE running cleanupAll*. When the first Eventually timed out (5s wait for VNGCloud LBs to drain), the rest of AfterEach short-circuited and cleanupAll* never ran. The next spec then started with leftover state. Fix: run cleanupAll* FIRST, then assert. 2. expectNoLoadBalancers/expectNoSecurityGroups used `timeout` (5s) for their Eventually, while the other expectNo* helpers used `timeout*4` (20s). 5s is right at the edge of the controller's finalizer chain (Service finalizer → LBC delete → VNGCloud LB delete → finalizer drain), so even on the happy path they sometimes lost the race. Fix: bump both to `timeout*4` for consistency with the rest. 3. The mock VNGCloud is process-shared across the entire suite. Even with a clean K8s namespace, leftover load balancers / security groups / pools / listeners from a prior failed spec contaminated subsequent specs (e.g., a previous spec creating an LB with subnet 1b-1 made the next spec see the wrong default-subnet readback). Fix: add MockProvider.Reset() that clears all test-created state while preserving the baseline (subnets, servers, initial certs); call it from a BeforeEach so each spec starts on a known-clean mock regardless of how the previous spec exited. Empirical pass rate across 12 random seeds: before this fix: ~50% (6 of 12) after this fix: ~83% (10 of 12) The residual 2/12 are different assertion-timeouts inside individual It blocks (not the cross-spec leakage that this commit targets) and likely need either per-spec namespaces or longer in-spec Eventually timeouts — out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

service_uc.go:74-88 (and ingress_uc.go:75-90) cache the cluster's "default" network/subnet/zone on first reconcile by reading nodes.Items[0]'s provider ID. K8s List does not guarantee node order, so whichever node the API server happened to return first became the default for the rest of the controller manager's life. In production this manifests as: the LB ends up in whatever zone the first-listed node lives in — non-deterministic across restarts and not what an operator would predict from configuration. In envtest it manifests as a 17% flake rate in the Service Controller suite: when seed randomization makes Node3 (subnet 1b-1) the first-listed node, every assertion that expects the default subnet to be 1a fails for the rest of the run. Fix: sort nodes by Name before picking [0]. Same change in both service_uc.go and ingress_uc.go. Verified: 12/12 random seeds now pass (was 10/12). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nsg suites Same pattern as the service controller AfterEach fix: - Reorder: cleanup BEFORE assertions (was after, where a failed Eventually short-circuited cleanup and leaked state to the next spec). - Add a BeforeEach that drains lingering K8s objects + resets in-memory mock VNGCloud state, so each spec starts clean regardless of how the previous one exited. ingress_controller_test.go gains both BeforeEach and reordered AfterEach. nsg_controller_test.go's AfterEach is just assertions (no cleanup helpers), so only adds a BeforeEach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All three were broken on main before this PR; this commit just brings them back to passing so the full repo sweep is green (excluding e2e, which needs a kind cluster). pkg/utils/endpoint_resolver_test.go - Test was using gomega's Expect(...).ToNot(HaveOccurred()) inside a *testing.T test without RegisterTestingT(t), causing a panic ("haven't registered Gomega's fail handler") on the first failure path. The rest of the test uses assert.NoError(t, ...) — switch the stray gomega call to match, and drop the now-unused gomega import. internal/usecase/vglb_uc/build_glbc_test.go - TestBuildLoadBalancerName/without_annotation_uses_default panicked with a nil pointer because the test built defaultModelBuildTask without a NameHelper, but the fall-through path (no LoadBalancerName annotation) calls t.nameHelper.GetLoadBalancerDefaultName(). Wire a real utils.NewNameHelper into the test, and compute expected names by delegating to the same NameHelper instead of hard-coding string formats that don't match the current ValidateName/TrimString output. internal/controller/vglb_controller/* (suite_test, helpers_test, vglb_controller_test) - Reconcile was stuck in "annotation fleet.vngcloud.vn/config-cluster-id is empty" requeue loop because: (a) mockConfig.Cluster.ClusterID was unset, and (b) newVGLBResource didn't set the ConfigClusterIdAnnotation or FleetIDLabel that ensure() requires before producing a GLBC. Set both, and update the GLBC-name assertion to match the actual ValidateName output (vks- prefix with dashes, truncated service name "vglb-creat") instead of the obsolete underscore format. internal/controller/{core,networking}/*_test.go - BeforeEach now waits for K8s objects to fully drain after the best-effort cleanup — a previous spec's failed AfterEach can leave finalizers in flight, and without the wait the next It hits "object is being deleted" when it tries to Create with the same name. Final sweep (go test -p=1 ./... excluding test/e2e): 22/22 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ound Production VNGCloud returns "Cannot get load balancer with id <id>" on GET 404, which IsLoadBalancerNotFound matched via string-prefix. The in-process mock and other internal repos return the generic domain.ErrorNotFound sentinel ("heheh not found") for the same condition. The two never agreed, so when the LBC delete reconcile's GetLoadBalancerByID returned ErrorNotFound (e.g., after a successful delete, on a subsequent reconcile of the same LBC where the LB is already gone), IsLoadBalancerNotFound returned false and the reconcile treated it as an unhandled error — getting stuck in a requeue loop with `Reconciler error: "heheh not found"`. That manifested as intermittent test cleanup hangs in internal/controller/networking where the next spec's BeforeEach couldn't wait for the previous spec's resources to drain. Fix: have IsLoadBalancerNotFound recognize both forms — the production prefix AND `errors.Is(err, ErrorNotFound)`. The check is called only after GetLoadBalancerByID, so coupling it to the generic sentinel is safe at this site (we know the call is "get LB", so any NotFound shape means the LB is gone). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

anngdinh · 2026-05-10T05:29:35Z

Using `pkg/k8sbatch` — quick guide

What it is

A per-reconcile mutation batcher. Instead of every small helper doing its own GET → mutate → PATCH against the same Kubernetes object, you queue mutators on a *Batcher and Flush(ctx) once at the end of the reconcile. Net effect: one GET + at most one Spec patch + one Status patch per distinct object, regardless of how many helpers touch it.

This PR migrates internal/usecase/lbc_uc/status.go (Phase 2 of the design). The 6 statusAdd* helpers used to make ~18 GET+PATCH cycles for a typical LB-with-listeners-and-pools reconcile — now they coalesce into 1 GET + 1 Status PATCH.

API surface

// pkg/k8sbatch
type Batcher struct { /* unexported */ }

func New(c client.Client) *Batcher
func (b *Batcher) Flush(ctx context.Context) error
func (b *Batcher) Pending() int

// Generic queueing — returning false from a mutator skips the patch.
func MutateStatus[T client.Object](b *Batcher, obj T, mutate func(obj T) bool)
func MutateSpec[T client.Object](b *Batcher, obj T, mutate func(obj T) bool)

That's it — five entry points.

Standard usage (Reconcile pattern)

func (uc *lbcUseCase) ensure(ctx context.Context, lbConfig *v1alpha1.LoadBalancerConfig) error {
    task := &defaultModelDeployTask{
        // ... the rest of the task fields ...
        batcher: k8sbatch.New(uc.k8sRepo.Client()),
    }

    deployErr := task.deploy(ctx)              // queues many MutateStatus calls
    flushErr  := task.batcher.Flush(ctx)       // ONE GET + ONE Status PATCH

    // Flush always runs — even on deploy error — so partial status writes
    // still land. This preserves today's progressive-checkpoint behavior.
    if deployErr != nil {
        if flushErr != nil {
            logger.Warnf("flush of batched status updates failed: %v", flushErr)
        }
        return deployErr
    }
    return flushErr
}

Writing a status helper

Before:

func (t *task) statusAddListener(ctx context.Context, listenerId string, port int) error {
    return t.k8sRepo.PatchMutateStatusLoadBalancerConfig(ctx, t.lbConfig, func(ctx context.Context, obj *v1alpha1.LoadBalancerConfig) bool {
        // compare-or-mutate logic
        ...
        return true
    })
}

After:

func (t *task) statusAddListener(_ context.Context, listenerId string, port int) error {
    k8sbatch.MutateStatus(t.batcher, t.lbConfig, func(obj *v1alpha1.LoadBalancerConfig) bool {
        // SAME compare-or-mutate logic — runs at flush time against a fresh GET
        ...
        return true
    })
    return nil
}

The diff is mechanical: change the wrapping call, drop the ctx parameter from the closure, return nil immediately (errors surface from Flush).

Identity coalescing

The queue is keyed by {GVK, namespace, name}. Two mutators on the same object — even passed via different *T pointers — get merged into one entry and run in queue order against a single fresh GET:

b := k8sbatch.New(client)

// All three target the same LBC → one GET, one PATCH at flush
k8sbatch.MutateStatus(b, lbc, setLoadBalancerID)
k8sbatch.MutateStatus(b, lbc, addListener)
k8sbatch.MutateStatus(b, lbc, addPool)

// Different object → separate GET + PATCH (still one Flush)
k8sbatch.MutateStatus(b, nsg, updateAttachedNodes)

b.Flush(ctx) // 2 GETs, 2 Status PATCHes total

Spec writes (owner controllers)

MutateSpec works identically but routes to client.Patch instead of client.Status().Patch. Used by Ingress/Service controllers that own the Spec of LBC/NSG. Spec-then-Status order is preserved when both are queued for the same object. Not migrated to MutateSpec in this PR — service_uc/build_lbc.go and friends already pre-batch Spec writes at the function level, so the win wouldn't be meaningful there.

Semantics you can rely on

Mutators run against fresh state at flush time (a fresh Get, not the in-memory t.lbConfig you passed). Your compare-on-fresh-state logic still works exactly as before.
Returning false skips the patch. No-op mutators don't bump the resource version.
Conflict (409) replays the closure via RetryOnConflict(DefaultBackoff). Mutators must therefore be idempotent compare-with-desired functions — which they already are.
Best-effort across objects. A failure on one object doesn't prevent patching other objects in the same Flush. Failed entries stay queued; successful ones clear. Aggregated via errors.Join.
Spec failure skips Status for that object (entry stays queued, both retried on next Flush).
Mid-reconcile flushes are allowed. If you need a status field persisted before a downstream system reads it, call b.Flush(ctx) mid-flow; subsequent Mutate* calls re-fill the queue.
Not goroutine-safe. Reconciles are single-goroutine per object key in controller-runtime — that's the only intended caller.

When to use vs. not

Scenario	Use `k8sbatch`?
Many small functions each updating one or two fields on the same object during a reconcile (e.g., `lbc_uc/status.go`)	Yes — biggest wins here
Single function builds the whole desired Spec then patches once (e.g., today's `service_uc/build_lbc.go`)	No — already 1 GET + 1 PATCH; nothing to coalesce
One-shot Create at the start of a reconcile (e.g., `CreateLoadBalancerConfig`)	No — there's nothing repetitive to batch
Need partial status persistence on error before returning	Yes — call `Flush` from your error path the same way `ensure` does

Common pitfalls to avoid

Don't expect t.obj.Status.X to reflect a queued mutation. The mutator runs at flush time against a fresh GET, never against your in-memory copy. (This was already the behavior of PatchMutateStatusObject — verified during the migration.)
Don't forget to Flush. Without it, queued mutations are discarded. The pattern is New → queue → Flush at every reconcile exit point.
Don't pass two different concrete types under the same name/namespace. It can't happen by accident (GVK is part of the key), but if you wrap the API in something exotic, the type-assertion inside the wrapped closure would panic. The struct comments call this out.

Reference implementation

See pkg/k8sbatch/ for the implementation (~170 LOC) and pkg/k8sbatch/batcher_test.go for 13 envtest specs covering every behavior above. The migration in internal/usecase/lbc_uc/status.go is the canonical example.

Two related fixes that together eliminate the residual ~17% flake in the controller/networking suite. Mock alignment (mocks/vngcloud_mock.go) --------------------------------------- The mock returned the placeholder domain.ErrorNotFound ("heheh not found") for GetLoadBalancerByID / GetSecurityGroup / GetServerByID, while production VNGCloud returns specific messages ("Cannot get load balancer with id <id>", etc.) that the domain.Is{LoadBalancer,SecurityGroup,Server}NotFound prefix-checks recognize. Mock and production never agreed. When the LBC delete reconcile re-ran on an already-gone LB (a common race after a successful delete), the mock returned ErrorNotFound, IsLoadBalancerNotFound returned false, and the reconcile treated "already gone" as a fatal error — entering a 2-second-requeue loop that surfaced as `Reconciler error: "heheh not found"` and starved out the test's cleanup wait. Aligning the three mock methods to the production message format makes the existing IsXxxNotFound checks behave identically against the mock and the real backend. domain.ErrorNotFound stays as a test-only sentinel for cases where there's no production analogue yet (it's untouched). Reverts the previous IsLoadBalancerNotFound change (b1a4dba in prior PR) which had attempted to fix this from the wrong end. Test timeout (ingress_controller_test.go:1617) ---------------------------------------------- The Eventually checking "after delete, the original LB still has its 1 listener and 1 pool" used `timeout` (5s) while every other Eventually in the file uses `timeout*4` (20s). Under load, the LBC's reconcile chain to remove the test-created pool can run longer than 5s, producing a "len:2 want 1" failure that wasn't present in the controller logic — just a too-tight assertion budget. Bump that single assertion to `timeout*4` for consistency with its peers. Verified: 12/12 random seeds pass on the networking suite (was 7/8 before the alignment, intermittently as low as 8/12 during stress). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

anngdinh and others added 18 commits May 9, 2026 11:04

feat(k8sbatch): add Batcher type, New, and Pending

5527c8a

Defines the core Batcher type with its queue keyed by GVK+NamespacedName, plus the New constructor and Pending introspection method. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into feat/k8sbatch

4f868d8

anngdinh force-pushed the feat/k8sbatch branch from 57386da to 722724c Compare May 10, 2026 04:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(k8sbatch): per-reconcile mutation batcher (Phase 1)#10

feat(k8sbatch): per-reconcile mutation batcher (Phase 1)#10
anngdinh wants to merge 20 commits into
mainfrom
feat/k8sbatch

anngdinh commented May 9, 2026

Uh oh!

anngdinh commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

anngdinh commented May 9, 2026

Summary

Test plan

Scope discipline

Uh oh!

anngdinh commented May 10, 2026

Using pkg/k8sbatch — quick guide

What it is

API surface

Standard usage (Reconcile pattern)

Writing a status helper

Identity coalescing

Spec writes (owner controllers)

Semantics you can rely on

When to use vs. not

Common pitfalls to avoid

Reference implementation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Using `pkg/k8sbatch` — quick guide