Skip to content

feat(k8sbatch): per-reconcile mutation batcher (Phase 1)#10

Open
anngdinh wants to merge 20 commits into
mainfrom
feat/k8sbatch
Open

feat(k8sbatch): per-reconcile mutation batcher (Phase 1)#10
anngdinh wants to merge 20 commits into
mainfrom
feat/k8sbatch

Conversation

@anngdinh

@anngdinh anngdinh commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Lands pkg/k8sbatch — a generic per-reconcile batcher that coalesces queued mutations against Kubernetes objects so a single reconcile performs at most one GET + one Spec patch + one Status patch per distinct object, instead of N GETs and N PATCHes for N small mutator functions.

  • Public API: Batcher, New, Pending, Flush(ctx), generic MutateSpec[T] / MutateStatus[T].
  • Semantics: MergeFromWithOptimisticLock + retry.RetryOnConflict per object (mutators re-run against fresh state on 409). Best-effort across objects via errors.Join. Spec-fail-skips-Status (entry stays queued, both retried on next Flush).
  • Phase 1 only: No production call sites are migrated. Phases 2 (re-route existing PatchMutateStatus* helpers) and 3 (migrate reconcilers) are explicit follow-ups.

Spec: docs/superpowers/specs/2026-05-09-k8sbatch-design.md
Plan: docs/superpowers/plans/2026-05-09-k8sbatch.md

Test plan

  • 13 envtest specs pass (go test -count=1 ./pkg/k8sbatch/...) — covers happy path, mutator-returns-false, queue ordering, NotFound, Spec-only, Spec+Status ordering, Spec-fail-skips-Status, best-effort across objects, mid-reconcile flush, real API-server 409 retry, identity coalescing, and multi-CR-type batching.
  • go vet ./pkg/k8sbatch/... clean.
  • Reviewer to confirm pinned golangci-lint v1.59.1 typecheck noise on this package is the same pre-existing repo-wide issue (appears on internal/controller/... too), not a regression.

Scope discipline

  • Owner-controller Spec writes (Ingress/Service writing to LBC/NSG Spec) are intentionally not migrated; the API supports them via MutateSpec, but call-site migration is deferred.
  • The type-specific cmpopts.IgnoreFields(...) filtering used in the old patchMutateStatusObject debug-diff log is dropped (the new package is generic). Acceptable per the spec.

🤖 Generated with Claude Code

anngdinh and others added 18 commits May 9, 2026 11:04
Adds the empty pkg/k8sbatch package and a Ginkgo TestSuite that boots
envtest with the existing v1alpha1 CRDs plus fixture helpers (newTestLBC,
newTestNSG). No functional code yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines the core Batcher type with its queue keyed by GVK+NamespacedName,
plus the New constructor and Pending introspection method.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds inline comments on the unexported entry struct fields and a doc
comment on New, so future readers and reviewers understand the type-
preservation contract that future Mutate{Spec,Status} wrappers depend on.
No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the queueing entry points and the per-object flush pipeline:
fresh GET, Spec patch first, Status patch second, retry-on-conflict
restarts the entire pipeline so mutators always run against the
freshest state. Single-object happy path is covered by an envtest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dStatus snapshot

- Rename anyChanged → applyMutators since the function intentionally never
  short-circuits (later mutators must observe earlier ones' changes); the
  new name makes that intent obvious without relying on the comment.
- Add a comment near the post-Spec-patch oldStatus snapshot explaining
  that client.Patch writes the API server's response (including the bumped
  resourceVersion) back into fresh, so the optimistic lock on the Status
  patch uses the correct resourceVersion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three envtest specs:
- Mutator returning false skips the patch (verified via unchanged
  ResourceVersion).
- Multiple Status mutators on the same object run in queue order with
  each observing the prior mutator's changes; a single patch lands the
  final state.
- NotFound on the initial GET surfaces a NotFound error (unwrappable
  through errors.Join) and keeps the failed entry queued for retry.

Also adds apierrors import and uses LeafNodeLocation.LineNumber in the
shared BeforeEach to avoid namespace collision between the two It blocks
in the same Describe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Spec-only mutator issues a Spec patch through client.Patch and
  clears the entry on success.
- When both Spec and Status mutators are queued for one object,
  Spec is patched first, the Status base snapshot is taken from the
  post-Spec state, and the Status mutator observes the freshly applied
  Spec value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a failingPatchClient decorator and a spec verifying that when the
Spec patch ultimately fails the queued Status mutator is not invoked
and the entry stays queued so a later Flush can retry both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Best-effort: when one object's GET returns NotFound, the other
  object's Status patch still lands; the joined error unwraps to
  NotFound; the failed entry stays queued while the successful one
  clears.
- Mid-reconcile flush: calling Flush twice in one reconcile re-fetches
  the object so mutators always run against the current cluster
  state, even when an out-of-band patch landed between flushes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triggers a real 409 by performing an out-of-band patch from inside
the batcher's mutator on its first invocation, then verifies that
RetryOnConflict re-GETs and re-runs the mutator until the final
patch succeeds and lands the desired value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Two MutateStatus calls with distinct *T pointers but matching
  GVK+namespace+name coalesce into one queue entry; both mutators
  run in queue order against one fresh GET; one patch lands.
- A single Batcher flushes mutations against two distinct CR types
  (LoadBalancerConfig and NodeSecurityGroup) in one Flush call,
  preserving per-type state through the generic queueing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extract repeated "sg-1" string into a const sg to satisfy goconst.
- Wrap the failingPatchClient.Patch signature across multiple lines to
  stay under 120 chars (lll).

Verified clean with golangci-lint v2.12.2 (the latest version
compatible with Go 1.25; the repo's pinned v1.59.1 is broken on
Go 1.25 across the whole repo and is left untouched here — fixing
the toolchain version is out of scope for this PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrates the 6 statusAdd* methods in lbc_uc/status.go from
PatchMutateStatusLoadBalancerConfig (per-call GET+PATCH) to
k8sbatch.MutateStatus, with one Flush at the end of ensure /
DeleteLoadBalancerConfigUseCase. A reconcile that creates a load
balancer with N listeners + M pools previously issued ~(2 + N + M)
GET+PATCH cycles against the LBC; it now issues 1 GET + 1 Status
PATCH at the end (or 2 of each if the deploy_lb.go:84 wholesale
patch also fires — that one stays as-is for now).

Threading
- Adds Client() method to repository.K8sRepository so use cases can
  construct a Batcher.
- Adds batcher *k8sbatch.Batcher field on defaultModelDeployTask.
- ensure / DeleteLoadBalancerConfigUseCase always Flush after the
  primary task call, even on deploy/delete error, so partial status
  writes still land — matches today's progressive-checkpoint behavior.

Read-back semantics preserved
- Verified that every t.lbConfig.Status.* read in the deploy/delete
  path either runs before any statusAdd* (trivially safe) or expects
  start-of-reconcile state (e.g. deploy_pool.go:108 reading
  CreatedPools.CreatedMembers as the merge baseline). Today's
  PatchMutateStatusObject also operates on its own DeepCopy and never
  writes back into t.lbConfig, so this property is unchanged after
  migration.

Test isolation note
- internal/controller/core has pre-existing seed-dependent flakiness
  in service_controller_test.go:400 (test-cleanup leakage between
  Describes). Verified across 5 random seeds: identical failure rate
  and same failing line on both pre- and post-migration code.
  Focused tests covering the critical paths (LB creation, deletion,
  port update, multi-service-same-LB, NSG) all pass.

Mocks regenerated via the project's mockery (.mockery.yml) to add
Client() to MockK8sRepository.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Service Controller suite has a long-standing seed-dependent flake:
~50% of random seeds caused tests to fail at service_controller_test.go:351
("BackendSubnetID mismatch") cascading into ~5-7 follow-on failures from
test-cleanup leakage. Three independent issues compounded it:

1. AfterEach asserted clean state BEFORE running cleanupAll*. When the
   first Eventually timed out (5s wait for VNGCloud LBs to drain), the
   rest of AfterEach short-circuited and cleanupAll* never ran. The
   next spec then started with leftover state. Fix: run cleanupAll*
   FIRST, then assert.

2. expectNoLoadBalancers/expectNoSecurityGroups used `timeout` (5s) for
   their Eventually, while the other expectNo* helpers used `timeout*4`
   (20s). 5s is right at the edge of the controller's finalizer chain
   (Service finalizer → LBC delete → VNGCloud LB delete → finalizer
   drain), so even on the happy path they sometimes lost the race.
   Fix: bump both to `timeout*4` for consistency with the rest.

3. The mock VNGCloud is process-shared across the entire suite. Even
   with a clean K8s namespace, leftover load balancers / security
   groups / pools / listeners from a prior failed spec contaminated
   subsequent specs (e.g., a previous spec creating an LB with subnet
   1b-1 made the next spec see the wrong default-subnet readback).
   Fix: add MockProvider.Reset() that clears all test-created state
   while preserving the baseline (subnets, servers, initial certs);
   call it from a BeforeEach so each spec starts on a known-clean
   mock regardless of how the previous spec exited.

Empirical pass rate across 12 random seeds:
  before this fix: ~50% (6 of 12)
  after this fix:  ~83% (10 of 12)

The residual 2/12 are different assertion-timeouts inside individual
It blocks (not the cross-spec leakage that this commit targets) and
likely need either per-spec namespaces or longer in-spec Eventually
timeouts — out of scope here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
service_uc.go:74-88 (and ingress_uc.go:75-90) cache the cluster's
"default" network/subnet/zone on first reconcile by reading
nodes.Items[0]'s provider ID. K8s List does not guarantee node
order, so whichever node the API server happened to return first
became the default for the rest of the controller manager's life.

In production this manifests as: the LB ends up in whatever zone
the first-listed node lives in — non-deterministic across restarts
and not what an operator would predict from configuration.

In envtest it manifests as a 17% flake rate in the Service
Controller suite: when seed randomization makes Node3 (subnet 1b-1)
the first-listed node, every assertion that expects the default
subnet to be 1a fails for the rest of the run.

Fix: sort nodes by Name before picking [0]. Same change in both
service_uc.go and ingress_uc.go.

Verified: 12/12 random seeds now pass (was 10/12).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nsg suites

Same pattern as the service controller AfterEach fix:
- Reorder: cleanup BEFORE assertions (was after, where a failed
  Eventually short-circuited cleanup and leaked state to the next
  spec).
- Add a BeforeEach that drains lingering K8s objects + resets
  in-memory mock VNGCloud state, so each spec starts clean
  regardless of how the previous one exited.

ingress_controller_test.go gains both BeforeEach and reordered
AfterEach. nsg_controller_test.go's AfterEach is just assertions
(no cleanup helpers), so only adds a BeforeEach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three were broken on main before this PR; this commit just brings
them back to passing so the full repo sweep is green (excluding e2e,
which needs a kind cluster).

pkg/utils/endpoint_resolver_test.go
- Test was using gomega's Expect(...).ToNot(HaveOccurred()) inside a
  *testing.T test without RegisterTestingT(t), causing a panic
  ("haven't registered Gomega's fail handler") on the first failure
  path. The rest of the test uses assert.NoError(t, ...) — switch
  the stray gomega call to match, and drop the now-unused gomega
  import.

internal/usecase/vglb_uc/build_glbc_test.go
- TestBuildLoadBalancerName/without_annotation_uses_default panicked
  with a nil pointer because the test built defaultModelBuildTask
  without a NameHelper, but the fall-through path
  (no LoadBalancerName annotation) calls t.nameHelper.GetLoadBalancerDefaultName().
  Wire a real utils.NewNameHelper into the test, and compute expected
  names by delegating to the same NameHelper instead of hard-coding
  string formats that don't match the current ValidateName/TrimString
  output.

internal/controller/vglb_controller/* (suite_test, helpers_test, vglb_controller_test)
- Reconcile was stuck in "annotation fleet.vngcloud.vn/config-cluster-id
  is empty" requeue loop because:
  (a) mockConfig.Cluster.ClusterID was unset, and
  (b) newVGLBResource didn't set the ConfigClusterIdAnnotation or
      FleetIDLabel that ensure() requires before producing a GLBC.
  Set both, and update the GLBC-name assertion to match the actual
  ValidateName output (vks- prefix with dashes, truncated service
  name "vglb-creat") instead of the obsolete underscore format.

internal/controller/{core,networking}/*_test.go
- BeforeEach now waits for K8s objects to fully drain after the
  best-effort cleanup — a previous spec's failed AfterEach can
  leave finalizers in flight, and without the wait the next It
  hits "object is being deleted" when it tries to Create with the
  same name.

Final sweep (go test -p=1 ./... excluding test/e2e): 22/22 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ound

Production VNGCloud returns "Cannot get load balancer with id <id>" on
GET 404, which IsLoadBalancerNotFound matched via string-prefix.
The in-process mock and other internal repos return the generic
domain.ErrorNotFound sentinel ("heheh not found") for the same
condition. The two never agreed, so when the LBC delete reconcile's
GetLoadBalancerByID returned ErrorNotFound (e.g., after a successful
delete, on a subsequent reconcile of the same LBC where the LB is
already gone), IsLoadBalancerNotFound returned false and the reconcile
treated it as an unhandled error — getting stuck in a requeue loop
with `Reconciler error: "heheh not found"`. That manifested as
intermittent test cleanup hangs in internal/controller/networking
where the next spec's BeforeEach couldn't wait for the previous
spec's resources to drain.

Fix: have IsLoadBalancerNotFound recognize both forms — the
production prefix AND `errors.Is(err, ErrorNotFound)`. The check is
called only after GetLoadBalancerByID, so coupling it to the generic
sentinel is safe at this site (we know the call is "get LB", so any
NotFound shape means the LB is gone).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anngdinh

Copy link
Copy Markdown
Contributor Author

Using pkg/k8sbatch — quick guide

What it is

A per-reconcile mutation batcher. Instead of every small helper doing its own GET → mutate → PATCH against the same Kubernetes object, you queue mutators on a *Batcher and Flush(ctx) once at the end of the reconcile. Net effect: one GET + at most one Spec patch + one Status patch per distinct object, regardless of how many helpers touch it.

This PR migrates internal/usecase/lbc_uc/status.go (Phase 2 of the design). The 6 statusAdd* helpers used to make ~18 GET+PATCH cycles for a typical LB-with-listeners-and-pools reconcile — now they coalesce into 1 GET + 1 Status PATCH.

API surface

// pkg/k8sbatch
type Batcher struct { /* unexported */ }

func New(c client.Client) *Batcher
func (b *Batcher) Flush(ctx context.Context) error
func (b *Batcher) Pending() int

// Generic queueing — returning false from a mutator skips the patch.
func MutateStatus[T client.Object](b *Batcher, obj T, mutate func(obj T) bool)
func MutateSpec[T client.Object](b *Batcher, obj T, mutate func(obj T) bool)

That's it — five entry points.

Standard usage (Reconcile pattern)

func (uc *lbcUseCase) ensure(ctx context.Context, lbConfig *v1alpha1.LoadBalancerConfig) error {
    task := &defaultModelDeployTask{
        // ... the rest of the task fields ...
        batcher: k8sbatch.New(uc.k8sRepo.Client()),
    }

    deployErr := task.deploy(ctx)              // queues many MutateStatus calls
    flushErr  := task.batcher.Flush(ctx)       // ONE GET + ONE Status PATCH

    // Flush always runs — even on deploy error — so partial status writes
    // still land. This preserves today's progressive-checkpoint behavior.
    if deployErr != nil {
        if flushErr != nil {
            logger.Warnf("flush of batched status updates failed: %v", flushErr)
        }
        return deployErr
    }
    return flushErr
}

Writing a status helper

Before:

func (t *task) statusAddListener(ctx context.Context, listenerId string, port int) error {
    return t.k8sRepo.PatchMutateStatusLoadBalancerConfig(ctx, t.lbConfig, func(ctx context.Context, obj *v1alpha1.LoadBalancerConfig) bool {
        // compare-or-mutate logic
        ...
        return true
    })
}

After:

func (t *task) statusAddListener(_ context.Context, listenerId string, port int) error {
    k8sbatch.MutateStatus(t.batcher, t.lbConfig, func(obj *v1alpha1.LoadBalancerConfig) bool {
        // SAME compare-or-mutate logic — runs at flush time against a fresh GET
        ...
        return true
    })
    return nil
}

The diff is mechanical: change the wrapping call, drop the ctx parameter from the closure, return nil immediately (errors surface from Flush).

Identity coalescing

The queue is keyed by {GVK, namespace, name}. Two mutators on the same object — even passed via different *T pointers — get merged into one entry and run in queue order against a single fresh GET:

b := k8sbatch.New(client)

// All three target the same LBC → one GET, one PATCH at flush
k8sbatch.MutateStatus(b, lbc, setLoadBalancerID)
k8sbatch.MutateStatus(b, lbc, addListener)
k8sbatch.MutateStatus(b, lbc, addPool)

// Different object → separate GET + PATCH (still one Flush)
k8sbatch.MutateStatus(b, nsg, updateAttachedNodes)

b.Flush(ctx) // 2 GETs, 2 Status PATCHes total

Spec writes (owner controllers)

MutateSpec works identically but routes to client.Patch instead of client.Status().Patch. Used by Ingress/Service controllers that own the Spec of LBC/NSG. Spec-then-Status order is preserved when both are queued for the same object. Not migrated to MutateSpec in this PRservice_uc/build_lbc.go and friends already pre-batch Spec writes at the function level, so the win wouldn't be meaningful there.

Semantics you can rely on

  • Mutators run against fresh state at flush time (a fresh Get, not the in-memory t.lbConfig you passed). Your compare-on-fresh-state logic still works exactly as before.
  • Returning false skips the patch. No-op mutators don't bump the resource version.
  • Conflict (409) replays the closure via RetryOnConflict(DefaultBackoff). Mutators must therefore be idempotent compare-with-desired functions — which they already are.
  • Best-effort across objects. A failure on one object doesn't prevent patching other objects in the same Flush. Failed entries stay queued; successful ones clear. Aggregated via errors.Join.
  • Spec failure skips Status for that object (entry stays queued, both retried on next Flush).
  • Mid-reconcile flushes are allowed. If you need a status field persisted before a downstream system reads it, call b.Flush(ctx) mid-flow; subsequent Mutate* calls re-fill the queue.
  • Not goroutine-safe. Reconciles are single-goroutine per object key in controller-runtime — that's the only intended caller.

When to use vs. not

Scenario Use k8sbatch?
Many small functions each updating one or two fields on the same object during a reconcile (e.g., lbc_uc/status.go) Yes — biggest wins here
Single function builds the whole desired Spec then patches once (e.g., today's service_uc/build_lbc.go) No — already 1 GET + 1 PATCH; nothing to coalesce
One-shot Create at the start of a reconcile (e.g., CreateLoadBalancerConfig) No — there's nothing repetitive to batch
Need partial status persistence on error before returning Yes — call Flush from your error path the same way ensure does

Common pitfalls to avoid

  1. Don't expect t.obj.Status.X to reflect a queued mutation. The mutator runs at flush time against a fresh GET, never against your in-memory copy. (This was already the behavior of PatchMutateStatusObject — verified during the migration.)
  2. Don't forget to Flush. Without it, queued mutations are discarded. The pattern is New → queue → Flush at every reconcile exit point.
  3. Don't pass two different concrete types under the same name/namespace. It can't happen by accident (GVK is part of the key), but if you wrap the API in something exotic, the type-assertion inside the wrapped closure would panic. The struct comments call this out.

Reference implementation

See pkg/k8sbatch/ for the implementation (~170 LOC) and pkg/k8sbatch/batcher_test.go for 13 envtest specs covering every behavior above. The migration in internal/usecase/lbc_uc/status.go is the canonical example.

Two related fixes that together eliminate the residual ~17% flake in
the controller/networking suite.

Mock alignment (mocks/vngcloud_mock.go)
---------------------------------------
The mock returned the placeholder domain.ErrorNotFound ("heheh not
found") for GetLoadBalancerByID / GetSecurityGroup / GetServerByID,
while production VNGCloud returns specific messages
("Cannot get load balancer with id <id>", etc.) that the
domain.Is{LoadBalancer,SecurityGroup,Server}NotFound prefix-checks
recognize. Mock and production never agreed.

When the LBC delete reconcile re-ran on an already-gone LB (a
common race after a successful delete), the mock returned ErrorNotFound,
IsLoadBalancerNotFound returned false, and the reconcile treated
"already gone" as a fatal error — entering a 2-second-requeue loop
that surfaced as `Reconciler error: "heheh not found"` and starved
out the test's cleanup wait.

Aligning the three mock methods to the production message format
makes the existing IsXxxNotFound checks behave identically against
the mock and the real backend. domain.ErrorNotFound stays as a
test-only sentinel for cases where there's no production analogue
yet (it's untouched).

Reverts the previous IsLoadBalancerNotFound change (b1a4dba in
prior PR) which had attempted to fix this from the wrong end.

Test timeout (ingress_controller_test.go:1617)
----------------------------------------------
The Eventually checking "after delete, the original LB still has its
1 listener and 1 pool" used `timeout` (5s) while every other Eventually
in the file uses `timeout*4` (20s). Under load, the LBC's reconcile
chain to remove the test-created pool can run longer than 5s,
producing a "len:2 want 1" failure that wasn't present in the
controller logic — just a too-tight assertion budget. Bump that
single assertion to `timeout*4` for consistency with its peers.

Verified: 12/12 random seeds pass on the networking suite (was 7/8
before the alignment, intermittently as low as 8/12 during stress).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant