Skip to content

Conversation

@sallyom
Copy link
Contributor

@sallyom sallyom commented Jan 5, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Add tracing entry span with W3C propagation to EPP handler
See #1520

Does this PR introduce a user-facing change?:

EPP request handler now includes distributed tracing entry span. When enabled via the existing --tracing flag, trace spans are created and W3C trace context is propagated to downstream services, enabling end-to-end request tracing. Tracing remains opt-in with no breaking changes introduced.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 5, 2026
@netlify
Copy link

netlify bot commented Jan 5, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit deba8b2
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/695d4f70b8751f00089314f9
😎 Deploy Preview https://deploy-preview-2057--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sallyom
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 5, 2026
@sallyom sallyom marked this pull request as draft January 5, 2026 19:18
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026
@sallyom sallyom force-pushed the tracing-spans branch 2 times, most recently from 3843677 to ee6df62 Compare January 5, 2026 19:37
@sallyom sallyom marked this pull request as ready for review January 6, 2026 17:44
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 6, 2026
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and robscott January 6, 2026 17:44
Signed-off-by: sallyom <somalley@redhat.com>
@sallyom
Copy link
Contributor Author

sallyom commented Jan 7, 2026

For example, in llm-d, with the GAIE entry span & propagation, a trace looks like:
gaie-entry-span-overview
and drilldown see the GAIE plugins & vLLM end-to-end trace (with other llm-d components instrumented):
gaie-entry-trace

Without this PR, without the entry span & propagation but with tracing enabled in GAIE, spans in individual components aren't connected:
gaie-no-entry-span

Comment on lines +137 to +149
// Inject trace context headers for propagation to downstream services
traceHeaders := make(map[string]string)
propagator := otel.GetTextMapPropagator()
propagator.Inject(ctx, propagation.MapCarrier(traceHeaders))
for key, value := range traceHeaders {
headers = append(headers, &configPb.HeaderValueOption{
Header: &configPb.HeaderValue{
Key: key,
RawValue: []byte(value),
},
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only be done if the user requested tracing. I think we need to add either a command line argument to enable tracing or to add something in the EPP Configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need to manually propagate context like this, as long as the go context.Context is correctly passed around then the otel sdk will handle propagation for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @damemi! I wasn't sure about this, I will remove this and retest to be sure. TY again!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the manual propagation, then will verify with llm-d:

  1. Does vllm:llm_request span show up as a child of gateway.request?
  2. Does the trace ID remain consistent end-to-end?
  3. If there's an upstream traceparent, is it continued correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entry point of request handling is: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/handlers/server.go#L128C49-L128C80

Where the context in Go is wrapped in the srv extProcPb.ExternalProcessor_ProcessServer. Does OTel need the context to be explicitly defined in function interface?

ref - https://pkg.go.dev/google.golang.org/grpc#ServerStream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing with the context propagation - it seems with GAIE's architecture we need to manually propagate the trace headers. With GAIE's architecture as an Envoy External Processor it doesn't make HTTP requests directly. Without manual propagation, trace context doesn't reach downstream services. I have confirmed this with some testing. Without the manual trace propagation we see separate spans for gateway-api-inference-extension and vllm services, not the vllm child span with the propagated context headers. I'll leave the manual propagation in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sallyom ah that's interesting, I didn't think about how this was working with envoy so there could be some work you need to do there. Not something I've worked with before but testing tells the truth

Comment on lines +131 to +136

// Start tracing span for the request
tracer := otel.Tracer("gateway-api-inference-extension")
ctx, span := tracer.Start(ctx, "gateway.request", trace.WithSpanKind(trace.SpanKindServer))
defer span.End()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only be done if the user requested tracing. I think we need to add either a command line argument to enable tracing or to add something in the EPP Configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these calls are a zero-overhead no-op unless a TracerProvider is configured. So, all you should need to gate on the user enabling is the creation of the TracerProvider itself.

For reference, this is the same way that Kubernetes components implement tracing. They actually set up a no-op tracerprovider, but having no TracerProvider configured should be effectively the same.

Either way, it's not about feature gating the tracer.Start() calls, it's about the tracerprovider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @damemi! I'll leave as/is but still open to other opinions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the trace initialization is only invoked if the tracing is enabled:

If InitTracing is not invoked, a default noop provider will be used (Correct me if I was wrong here). So it should be fine to keep it the way the PR implements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants