Skip to content

Freeze MCPServer generation per pod via downward API#5364

Draft
ChrisJBurns wants to merge 11 commits into
mainfrom
github-issue-toolhive-investigation
Draft

Freeze MCPServer generation per pod via downward API#5364
ChrisJBurns wants to merge 11 commits into
mainfrom
github-issue-toolhive-investigation

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

Summary

The apply-gate at pkg/container/kubernetes/client.go:530 (PR #5024) was silently defeated as soon as MCPServerGeneration started flowing through the /etc/runconfig/runconfig.json ConfigMap volume — that volume is mounted without subPath, so Kubernetes live-updates it inside every running pod. A restarted old-RS proxyrunner pod (OOM, drain, attach-failure c.exitFunc(1)) re-reads the file after a helm upgrade and picks up the new generation N, while still holding the old image in its CLI args. Both old and new pods then call DeployWorkload with the same ourGen=N, the strict-greater-than gate cannot tell them apart, and the stale-image apply clobbers the fresh one — manifesting as alternating ControllerRevisions and ImagePullBackOff in the bug report.

Freeze MCPServerGeneration per pod via the Kubernetes downward API. The operator stamps toolhive.stacklok.dev/mcpserver-generation: <m.Generation> onto the proxy Deployment's pod template and projects that annotation into the proxyrunner container as the THV_MCPSERVER_GENERATION env var. The proxyrunner uses the env var as an override on the file value, so two coexisting proxyrunner pods carry distinct generations again — symmetric with how the image is already frozen via the CLI positional arg.

Closes #5360.

Medium level
  • Shared contract: pkg/container/kubernetes/client.go — new exported constant EnvVarMCPServerGeneration = "THV_MCPSERVER_GENERATION" alongside the existing RunConfigMCPServerGenerationAnnotation. Doc comment on the annotation updated to describe the downward-API projection.
  • Operator: cmd/thv-operator/controllers/mcpserver_controller.go — stamps the annotation onto deploymentTemplateAnnotations (next to the existing runconfig-checksum) and injects the env var into the proxyrunner container via valueFrom.fieldRef. deploymentNeedsUpdate mirrors both fields so drift detection stays stable.
  • Proxyrunner: cmd/thv-proxyrunner/app/run.gotryLoadConfigFromFile applies the env-var override on the loaded RunConfig.MCPServerGeneration (warn + fall through on parse error).
  • Tests:
    • Lock-in regression test (pkg/container/kubernetes/client_test.go) — two DeployWorkload calls with equal gen + different images; asserts the second clobbers the first. Documents what the gate cannot defend against alone.
    • TDD red→green test (cmd/thv-proxyrunner/app/run_test.go) — confirms env value (3) wins over file value (5) when both are set. Failed before the fix.
    • Downward-API shape test (cmd/thv-operator/controllers/mcpserver_resource_overrides_test.go) — asserts the proxyrunner container declares the env var with ValueFrom.FieldRef.FieldPath pointing at the pod's mcpserver-generation annotation.
Low level
File Change
pkg/container/kubernetes/client.go Add EnvVarMCPServerGeneration constant; expand doc comment on RunConfigMCPServerGenerationAnnotation to cover the downward-API projection.
cmd/thv-proxyrunner/app/run.go Add strconv/kubernetes imports; call new applyMCPServerGenerationOverride after runner.ReadJSON; helper reads the env var, parses as int64, overrides config.MCPServerGeneration.
cmd/thv-operator/controllers/mcpserver_controller.go Add strconv import. In deploymentForMCPServer: stamp RunConfigMCPServerGenerationAnnotation on deploymentTemplateAnnotations; append the downward-API THV_MCPSERVER_GENERATION env var. Mirror both in deploymentNeedsUpdate so drift comparison stays equal.
cmd/thv-proxyrunner/app/run_test.go New file. Test TestTryLoadConfigFromFile_MCPServerGenerationEnvOverride.
pkg/container/kubernetes/client_test.go New test TestDeployWorkload_EqualGenerationDifferentImageClobbers.
cmd/thv-operator/controllers/mcpserver_resource_overrides_test.go New test TestDeploymentForMCPServer_MCPServerGenerationDownwardAPI. Update existing annotation-count and env-var assertions to account for the new annotation + env var (added at the same relative position in both build and drift paths).

Type of change

  • Bug fix
  • New feature
  • Breaking change
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other

Test plan

  • task test passes (exit 0)
  • task build passes
  • New unit tests: TestTryLoadConfigFromFile_MCPServerGenerationEnvOverride, TestDeploymentForMCPServer_MCPServerGenerationDownwardAPI, TestDeployWorkload_EqualGenerationDifferentImageClobbers
  • Existing drift-detection tests (TestMCPServerDeploymentNeedsUpdate_*) updated and pass — confirms deploymentNeedsUpdate stays in sync with deploymentForMCPServer so reconcile doesn't churn.
  • Manual live-cluster reproduction recommended before merge (kind cluster + helm upgrade with maxSurge=1, maxUnavailable=0 + induced container restart on old-RS pod; see issue shouldSkipStatefulSetApply generation gate defeated by live-updating RunConfig ConfigMap during proxy runner rolling update (regression in v0.28.1) #5360 for the exact procedure).

Special notes for reviewers

  • Upgrade ordering: the operator and proxyrunner changes are mutually independent during a rolling upgrade. Operator-only or proxyrunner-only upgrades leave the race window open until both sides are at this version — same wire-contract caveat as PR Gate proxyrunner StatefulSet apply by MCPServer generation #5024. Worth calling out in release notes.
  • Annotation value "0": when reading existing deployments built by older operators, the pod template won't carry the annotation and the downward-API env var resolves to an empty string. The proxyrunner's applyMCPServerGenerationOverride returns without modifying the config in that case, so the file value wins (existing behavior). After a single reconcile, the operator backfills the annotation and the override kicks in.
  • Position of the env var in the container env slice mattersdeploymentNeedsUpdate uses equality.Semantic.DeepEqual on the slice. I positioned the new entry in both build sites at the same relative location (after Redis password / user overrides, before embedded auth env vars).
  • Out of scope for this PR: the spec.replicas clobber dimension (tracked in [BUG] Improve MCPServer Scaling Coordination #4484) — same root cause family, different field, separate fix.
  • The fix path was: investigation → lock-in test for the broken invariant → TDD red test for the env-var override → operator/proxyrunner wiring → drift-detection updates. Each commit compiles and task test passes independently.

Generated with Claude Code

ChrisJBurns and others added 3 commits May 21, 2026 21:25
Lock-in test in pkg/container/kubernetes/client_test.go documenting
the gate's blind spot when two callers pass equal MCPServerGeneration
with different images.

TDD test in cmd/thv-proxyrunner/app/run_test.go for the env-var override
(THV_MCPSERVER_GENERATION) that the fix in #5360 will introduce. Fails
today; goes green when the override is implemented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /etc/runconfig ConfigMap volume mounts live (no subPath), so a
restarted old-RS proxyrunner pod re-reads runconfig.json after a helm
upgrade and picks up the new MCPServer generation. Both old and new
pods then call DeployWorkload with the same ourGen, defeating the
strict-greater-than gate at shouldSkipStatefulSetApply.

Honor THV_MCPSERVER_GENERATION (sourced via downward API in a
follow-up operator change) as an override on the file value, freezing
the generation per pod at creation time. Parallel to how the image is
already frozen via the CLI positional arg.

Part of #5360.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stamp the MCPServer generation as a pod-template annotation on the
proxy Deployment, and project that annotation into the proxyrunner
container as the THV_MCPSERVER_GENERATION env var via the downward
API. The env var is bound to the pod's own annotations at creation
time, so a restarted old-RS pod cannot acquire the new generation by
re-reading the live-mounted RunConfig ConfigMap.

The proxyrunner already honors this env var as an override on the
file value (see prior commit). With both sides wired, two coexisting
proxyrunner pods during a rolling update carry distinct generations
again and the strict-greater-than gate at shouldSkipStatefulSetApply
fires correctly.

Mirror the new env var and annotation in deploymentNeedsUpdate so
drift detection stays stable.

Closes #5360.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label May 21, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 21, 2026
Two new integration specs assert the operator side of the freeze
end-to-end against a real apiserver:

- On initial reconcile, the proxy Deployment's pod template carries
  the mcpserver-generation annotation set to MCPServer.metadata.generation,
  and the proxyrunner container declares THV_MCPSERVER_GENERATION via
  a downward-API FieldRef pointing at that exact annotation key.
- On a Spec.Image patch the controller bumps the annotation value to
  the new generation and keeps the FieldRef wired correctly. This is
  the rolling-update path: new pods get a strictly-greater frozen
  generation than any restarted old-RS pod.

Catches typos in either the annotation key or the FieldRef path that
would silently produce an empty env var and let the proxyrunner fall
through to the (live-mounted) file value. Complements the unit-level
TestDeploymentForMCPServer_MCPServerGenerationDownwardAPI and
TestTryLoadConfigFromFile_MCPServerGenerationEnvOverride.

Part of #5360.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.73%. Comparing base (e37e336) to head (971bbbd).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5364   +/-   ##
=======================================
  Coverage   68.72%   68.73%           
=======================================
  Files         626      626           
  Lines       63495    63519   +24     
=======================================
+ Hits        43637    43658   +21     
- Misses      16610    16616    +6     
+ Partials     3248     3245    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator Author

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice surgical fix to #5360 — the diagnosis (live-mounted ConfigMap defeating the strict-greater-than gate) is sharp, and the symmetry with how the image is already frozen via the CLI positional arg is the right framing.

The testing pyramid is excellent: lock-in regression (TestDeployWorkload_EqualGenerationDifferentImageClobbers) pinning what the gate cannot defend against, a TDD red→green for the override (TestTryLoadConfigFromFile_MCPServerGenerationEnvOverride), the downward-API shape unit test, and the envtest integration test covering both initial reconcile and spec-bump drift. The defaultAnnotations-takes-precedence merge guarantees user overrides can't accidentally clobber the new annotation.

A couple of small points inline — one defensive-validation nit on the proxyrunner override, and a heads-up that the position-mirror comment in deploymentNeedsUpdate papers over a pre-existing Redis-password drift bug that's worth tracking separately.

Comment thread cmd/thv-proxyrunner/app/run.go
Comment thread cmd/thv-operator/controllers/mcpserver_controller.go
Comment thread cmd/thv-operator/controllers/mcpserver_resource_overrides_test.go
- Defend applyMCPServerGenerationOverride against negative env values.
  metadata.generation is monotonic non-negative per K8s convention, so a
  negative override cannot come from a legitimate downward-API projection
  and must not be allowed to silently disable the apply-gate stamp.
  Add a Debug log on the success path so the override is observable in
  pod logs when diagnosing future occurrences.
- Add TestApplyMCPServerGenerationOverride exercising the empty,
  valid, zero, unparseable, and negative branches in isolation.
- Soften the position-mirror comment in deploymentNeedsUpdate. The
  prior text referenced the Redis password slot, but
  deploymentNeedsUpdate does not include the Redis password env var
  (pre-existing latent drift bug, tracked separately).
- Add a header comment to TestResourceOverrides explaining that the
  "0" mcpserver-generation annotation value comes from the fake client
  not auto-incrementing metadata.generation. Realistic generation
  tracking is exercised by the envtest in
  mcpserver_generation_freeze_integration_test.go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 21, 2026
- "unparseable" → "unparsable" (codespell) in run.go, run_test.go.
- Realign env-var map literals to gofmt's preferred column for the
  longest key (THV_MCPSERVER_GENERATION).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 21, 2026
ObjectFieldSelector.APIVersion is defaulted to "v1" by the API server on
persistence. The drift comparator at deploymentNeedsUpdate was rebuilding
the expected env var with an empty APIVersion, so equality.Semantic.DeepEqual
returned false on every reconcile and the controller perpetually rewrote
the proxy Deployment. That constant churn re-triggered the exact rolling-
update race the freeze was supposed to close — confirmed by the failing
VirtualMCPServer Optimizer + Circuit Breaker recovery spec in
test-e2e-lifecycle: stale-image proxyrunner pods kept clobbering the new
StatefulSet apply, leaving the backend mcp container in ImagePullBackOff.

Explicitly set APIVersion: "v1" on both the construction site and the
drift comparator so they match the API-server-defaulted value.

Add unit-level assertion in TestDeploymentForMCPServer_MCPServerGenerationDownwardAPI
and a new envtest spec "Does not flag spurious drift on a no-op reconcile"
that watches the Deployment's resourceVersion across multiple reconciles
and fails fast if drift detection misfires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 21, 2026
Temporary diagnostic for #5360 v1.33/v1.34 E2E failure. Will be reverted
before merge.
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 22, 2026
The debug commit (8ed6e39) confirmed the override + gate are working
correctly end-to-end on a real cluster:

  pod env=3 (latest): file=3 env=3, apply proceeds — writes GOOD image
  pod env=2 (stale): file=3 env=2, gate skips (theirs=3 > ours=2)

Revert the INFO promotion now that diagnosis is complete; both logs go
back to Debug.

The remaining E2E flake is a pre-existing race in
virtualmcp_optimizer_circuit_breaker_test.go: the test deletes the
Pending StatefulSet pod immediately after patching Spec.Image without
waiting for the proxyrunner to apply the new template, so the pod can
be recreated against the stale (bad) template and the StatefulSet
controller may not re-roll an already-unhealthy pod afterwards. Mirrors
the same guard added to virtualmcp_circuit_breaker_test.go in #5079.

Before the apply-gate, the race "resolved" by the stale-image apply
periodically resetting the template, letting the test catch a lucky
good window. With the gate in place the apply order is deterministic,
which makes the missing wait visible — hence the test fix lands here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 22, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

shouldSkipStatefulSetApply generation gate defeated by live-updating RunConfig ConfigMap during proxy runner rolling update (regression in v0.28.1)

1 participant