Add tracing entry span with W3C propagation to EPP handler #2057

sallyom · 2026-01-05T19:09:24Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Add tracing entry span with W3C propagation to EPP handler
See #1520

Does this PR introduce a user-facing change?:

EPP request handler now includes distributed tracing entry span. When enabled via the existing --tracing flag, trace spans are created and W3C trace context is propagated to downstream services, enabling end-to-end request tracing. Tracing remains opt-in with no breaking changes introduced.

netlify · 2026-01-05T19:09:31Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`deba8b2`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/695d4f70b8751f00089314f9
😎 Deploy Preview	https://deploy-preview-2057--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-01-05T19:09:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sallyom
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: sallyom <somalley@redhat.com>

sallyom · 2026-01-07T13:20:08Z

For example, in llm-d, with the GAIE entry span & propagation, a trace looks like:

and drilldown see the GAIE plugins & vLLM end-to-end trace (with other llm-d components instrumented):

Without this PR, without the entry span & propagation but with tracing enabled in GAIE, spans in individual components aren't connected:

shmuelk · 2026-01-07T15:31:46Z

pkg/epp/handlers/request.go

+	// Inject trace context headers for propagation to downstream services
+	traceHeaders := make(map[string]string)
+	propagator := otel.GetTextMapPropagator()
+	propagator.Inject(ctx, propagation.MapCarrier(traceHeaders))
+	for key, value := range traceHeaders {
+		headers = append(headers, &configPb.HeaderValueOption{
+			Header: &configPb.HeaderValue{
+				Key:      key,
+				RawValue: []byte(value),
+			},
+		})
+	}
+


I think this should only be done if the user requested tracing. I think we need to add either a command line argument to enable tracing or to add something in the EPP Configuration.

you shouldn't need to manually propagate context like this, as long as the go context.Context is correctly passed around then the otel sdk will handle propagation for you

thanks, @damemi! I wasn't sure about this, I will remove this and retest to be sure. TY again!

I'll remove the manual propagation, then will verify with llm-d:

Does vllm:llm_request span show up as a child of gateway.request?

Does the trace ID remain consistent end-to-end?

If there's an upstream traceparent, is it continued correctly?

The entry point of request handling is: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/handlers/server.go#L128C49-L128C80

Where the context in Go is wrapped in the srv extProcPb.ExternalProcessor_ProcessServer. Does OTel need the context to be explicitly defined in function interface?

ref - https://pkg.go.dev/google.golang.org/grpc#ServerStream

I did some testing with the context propagation - it seems with GAIE's architecture we need to manually propagate the trace headers. With GAIE's architecture as an Envoy External Processor it doesn't make HTTP requests directly. Without manual propagation, trace context doesn't reach downstream services. I have confirmed this with some testing. Without the manual trace propagation we see separate spans for gateway-api-inference-extension and vllm services, not the vllm child span with the propagated context headers. I'll leave the manual propagation in.

@sallyom ah that's interesting, I didn't think about how this was working with envoy so there could be some work you need to do there. Not something I've worked with before but testing tells the truth

shmuelk · 2026-01-07T15:32:39Z

pkg/epp/handlers/server.go

+
+	// Start tracing span for the request
+	tracer := otel.Tracer("gateway-api-inference-extension")
+	ctx, span := tracer.Start(ctx, "gateway.request", trace.WithSpanKind(trace.SpanKindServer))
+	defer span.End()
+


I think this should only be done if the user requested tracing. I think we need to add either a command line argument to enable tracing or to add something in the EPP Configuration.

these calls are a zero-overhead no-op unless a TracerProvider is configured. So, all you should need to gate on the user enabling is the creation of the TracerProvider itself.

For reference, this is the same way that Kubernetes components implement tracing. They actually set up a no-op tracerprovider, but having no TracerProvider configured should be effectively the same.

Either way, it's not about feature gating the tracer.Start() calls, it's about the tracerprovider

thanks, @damemi! I'll leave as/is but still open to other opinions

Currently the trace initialization is only invoked if the tracing is enabled:

https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/common/telemetry.go#L46

https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/cmd/epp/runner/runner.go#L184

If InitTracing is not invoked, a default noop provider will be used (Correct me if I was wrong here). So it should be fine to keep it the way the PR implements.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 5, 2026

k8s-ci-robot requested review from elevran and nirrozenbaum January 5, 2026 19:09

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 5, 2026

sallyom marked this pull request as draft January 5, 2026 19:18

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026

sallyom force-pushed the tracing-spans branch 2 times, most recently from 3843677 to ee6df62 Compare January 5, 2026 19:37

sallyom marked this pull request as ready for review January 6, 2026 17:44

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 6, 2026

k8s-ci-robot requested review from ahg-g and robscott January 6, 2026 17:44

Add tracing entry span with W3C propagation to EPP handler

deba8b2

Signed-off-by: sallyom <somalley@redhat.com>

sallyom force-pushed the tracing-spans branch from ee6df62 to deba8b2 Compare January 6, 2026 18:07

sallyom mentioned this pull request Jan 7, 2026

Add otel tracing instrumentation llm-d/llm-d-inference-scheduler#506

Draft

shmuelk suggested changes Jan 7, 2026

View reviewed changes

Add tracing entry span with W3C propagation to EPP handler #2057

Are you sure you want to change the base?

Add tracing entry span with W3C propagation to EPP handler #2057

Conversation

sallyom commented Jan 5, 2026

Uh oh!

netlify bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Jan 5, 2026

Uh oh!

sallyom commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

netlify bot commented Jan 5, 2026 •

edited

Loading