Skip to content

Add ClusterIP service with SessionAffinity for MCP server backend#3992

Open
JAORMX wants to merge 4 commits intomainfrom
feat/mcp-backend-session-affinity
Open

Add ClusterIP service with SessionAffinity for MCP server backend#3992
JAORMX wants to merge 4 commits intomainfrom
feat/mcp-backend-session-affinity

Conversation

@JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Mar 4, 2026

Summary

  • Adds a new regular ClusterIP service (mcp-<name>) with SessionAffinity: ClientIP for MCP server StatefulSets, alongside the existing headless service (mcp-<name>-headless)
  • The proxy-runner now targets this new service instead of the headless one, providing proper load balancing with session stickiness when the StatefulSet scales beyond 1 replica
  • Adds finalizer cleanup for the new service and Chainsaw e2e test assertions for all SSE and streamable-http scenarios

Context

The headless service bypasses kube-proxy, so DNS returns raw pod IPs with no load balancing or session affinity. If the MCP server StatefulSet scales beyond 1 replica, sessions break because requests may hit different pods. The new ClusterIP service solves this with SessionAffinity: ClientIP.

Changes

File Change
pkg/container/kubernetes/client.go New createMCPService() function, called from DeployWorkload() after createHeadlessService(). Moves SSEHeadlessServiceName assignment to the new service. Extracts serviceFieldManager constant.
cmd/thv-operator/controllers/mcpserver_controller.go Adds cleanup of mcp-<name> service in finalizeMCPServer()
test/e2e/chainsaw/operator/**/assert-mcpserver-svc.yaml 4 new assertion files verifying the service exists with sessionAffinity: ClientIP
test/e2e/chainsaw/operator/**/chainsaw-test.yaml 4 updated test files adding the new assertion step

Test plan

  • task lint passes (0 issues)
  • task test passes (unit tests)
  • Deploy operator, create MCPServer with SSE transport, verify both services exist:
    • kubectl get svc mcp-<name>-headless (headless, unchanged)
    • kubectl get svc mcp-<name> (new service with sessionAffinity: ClientIP)
  • Chainsaw e2e tests pass for single-tenancy and multi-tenancy SSE/streamable-http scenarios

🤖 Generated with Claude Code

@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 4, 2026
@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 65.06024% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.74%. Comparing base (c5fdadc) to head (049784a).

Files with missing lines Patch % Lines
...d/thv-operator/controllers/mcpserver_controller.go 0.00% 16 Missing ⚠️
pkg/container/kubernetes/client.go 83.07% 8 Missing and 3 partials ⚠️
pkg/runtime/setup.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3992      +/-   ##
==========================================
+ Coverage   68.61%   68.74%   +0.12%     
==========================================
  Files         439      439              
  Lines       44867    44897      +30     
==========================================
+ Hits        30786    30863      +77     
+ Misses      11686    11627      -59     
- Partials     2395     2407      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments for specific observations and suggestions.

Pre-existing observation: RemoveWorkload in pkg/container/kubernetes/client.go only deletes the StatefulSet, not the associated services. This was already an issue for the headless service before this PR. The operator path is covered by finalizeMCPServer. Worth tracking separately if the CLI-to-K8s path is a supported workflow.

JAORMX and others added 2 commits March 5, 2026 08:17
The proxy-runner connects to MCP server pods through a headless service
which bypasses kube-proxy and provides no load balancing or session
affinity. If the StatefulSet scales beyond 1 replica, sessions break
because requests may hit different pods.

Add a new regular ClusterIP service (mcp-<name>) with SessionAffinity:
ClientIP alongside the existing headless service. The proxy-runner now
uses this new service as its target host, getting proper load balancing
with session stickiness. The headless service is retained for DNS
discovery.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract shared applyService helper to reduce duplication between
  createHeadlessService and createMCPService
- Add explicit SessionAffinityConfig.ClientIP.TimeoutSeconds (1800s)
  instead of relying on Kubernetes 3h default
- Rename SSEHeadlessServiceName to MCPServiceName across types, setup,
  and client packages
- Rename transportTypeRequiresHeadlessService to
  transportTypeRequiresBackendServices
- Extract deleteIfExists helper in operator finalizer, removing
  nolint:gocyclo
- Fix headless+NodePort conflict: skip NodePort type promotion for
  headless services (Kubernetes rejects clusterIP=None + type=NodePort)
- Use slog.Debug instead of slog.Info for service creation success paths
- Add spec.type and sessionAffinityConfig assertions to e2e tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JAORMX JAORMX force-pushed the feat/mcp-backend-session-affinity branch from 8cf2377 to 7a0f7be Compare March 5, 2026 06:40
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 5, 2026
Set the StatefulSet as owner of both the headless and MCP ClusterIP
services so Kubernetes garbage-collects them when the StatefulSet is
deleted. This closes the resource leak gap flagged in review.

Add table-driven unit test covering both SSE and streamable-http
transports, verifying service creation, session affinity config,
and owner references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 5, 2026
Assert UID, APIVersion, Controller, and BlockOwnerDeletion fields
on owner references, not just Kind and Name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 5, 2026

// DeployWorkload implements runtime.Runtime.
//
//nolint:gocyclo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would have prefered to refactor and avoid the nolint...

func (r *MCPServerReconciler) deleteIfExists(ctx context.Context, obj client.Object, name, namespace, kind string) error {
err := r.Get(ctx, types.NamespacedName{Name: name, Namespace: namespace}, obj)
if err == nil {
if delErr := r.Delete(ctx, obj); delErr != nil && !errors.IsNotFound(delErr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have lost the logging here, is that intentional?

Copy link
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving with a pair of comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants