LCORE-1591: Observability for Lightspeed Core#1482
LCORE-1591: Observability for Lightspeed Core#1482asimurka wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
WalkthroughA new design specification document adds OpenTelemetry-based distributed tracing for Lightspeed Core, defining traces/spans, YAML-driven enablement and OTLP export configuration, W3C trace-context propagation controls, instrumentation points (FastAPI auto-instrumentation plus manual spans for LLM, RAG, tools, moderation, streaming), and lifecycle/resilience rules. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 193-243: Fix markdown lint failures MD022 and MD031 by ensuring
there's exactly one blank line before and after each heading and that fenced
code blocks are preceded and followed by a blank line; specifically adjust
spacing around the "Configuration model:", "Pros:", "Cons:", "Important
constraints:", and the "Option 2 — Manual SDK initialization with YAML-driven
hybrid configuration" heading blocks in the observability-opentelemetry.md
content (lines shown in the review) so headings have a single blank line
separation and add/remove blank lines around the fenced code block at line ~297
to ensure one blank line before and after the triple-backtick fence.
- Line 107: Add a new "Verification / Test Plan" section in
observability-opentelemetry.md (to satisfy R10) that contains explicit
step-by-step validation checks and expected results: include subsections for
"Startup validation" (checks that tracing initializes, config/env var checks,
sample commands and expected logs), "Inbound continuation" (send an incoming
request and verify trace IDs are continued with expected span relationships),
"Outbound propagation" (make an outbound call and verify propagation headers and
downstream spans), and "Exporter failure behavior" (simulate exporter downtime
and verify retry/drop behavior and observable error logs); for each subsection
list exact commands or curl examples, environment variables to set, acceptance
criteria, and sample evidence (log snippets or trace UI screenshots).
- Line 298: Fix the typos in the documentation text: change "initializaion" to
"initialization" in the section heading "Enrichment of Llama Stack server-mode
client initializaion" and change "primarly" to "primarily" where it appears
(line referenced as 582). Update those two words in the markdown content so the
normative spec uses correct spelling.
- Around line 56-57: Change the Setext-style line so it is treated as a normal
paragraph: replace the standalone line "When tracing is disabled, no spans are
created and no tracing-related processing is performed." so it is followed by a
blank line before the thematic break (the sequence of hyphens) or remove the
hyphen-only line immediately after; ensure the hyphen separator remains a proper
horizontal rule by having an empty line above it so markdownlint MD003 no longer
interprets that sentence as a Setext heading.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 96b8e5e7-b660-45b6-8800-46ea7d3f5849
📒 Files selected for processing (1)
docs/design/observability-opentelemetry/observability-opentelemetry.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: list_outdated_dependencies
- GitHub Check: radon
- GitHub Check: unit_tests (3.13)
- GitHub Check: unit_tests (3.12)
- GitHub Check: integration_tests (3.12)
- GitHub Check: integration_tests (3.13)
- GitHub Check: build-pr
- GitHub Check: mypy
- GitHub Check: bandit
- GitHub Check: Pyright
- GitHub Check: Pylinter
- GitHub Check: E2E: server mode / ci
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: E2E: library mode / ci
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
🪛 LanguageTool
docs/design/observability-opentelemetry/observability-opentelemetry.md
[grammar] ~298-~298: Ensure spelling is correct
Context: ...hment of Llama Stack server-mode client initializaion: python client_kwargs: dict[str, Any] = { "base_url": base_url, "api_key": api_key, "timeout": config.timeout, } if distributed_tracing_to_llama_enabled() and base_url is not None: client_kwargs["http_client"] = llama_stack_httpx_async_client( base_url=base_url, timeout=config.timeout, ) self._lsc = AsyncLlamaStackClient(**client_kwargs) Library mode: This wiring applies only...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.22.0)
docs/design/observability-opentelemetry/observability-opentelemetry.md
[warning] 56-56: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
[warning] 193-193: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 198-198: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 204-204: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 209-209: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 223-223: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 231-231: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 237-237: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 243-243: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 297-297: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
docs/design/observability-opentelemetry/observability-opentelemetry.md
Outdated
Show resolved
Hide resolved
67f9919 to
e0e6b4f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
docs/design/observability-opentelemetry/observability-opentelemetry.md (2)
194-244:⚠️ Potential issue | 🟡 MinorFix markdownlint blockers (MD022/MD031) in heading/fence spacing.
Line 194/199/205/210/224/232/238/244 need blank lines around headings, and Line 298 needs a blank line after the fenced code block before prose on Line 299.
Minimal formatting fix pattern
### Configuration model: + - All tracing configuration is provided via `OTEL_*` environment variables ... -``` -Enrichment of Llama Stack server-mode client initialization: +``` + +Enrichment of Llama Stack server-mode client initialization:Also applies to: 298-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 194 - 244, The markdown has lint errors MD022/MD031 due to missing blank lines around several headings and a missing blank line after a fenced code block; add a single blank line before and after each top-level and subheading such as "Configuration model:", "Pros:", "Cons:", "Important constraints:", "Option 2 — Manual SDK initialization with YAML-driven hybrid configuration" and the other listed headings, and ensure there is an empty line after any fenced code block before the following paragraph (e.g., the fenced block that precedes "Enrichment of Llama Stack server-mode client initialization:"); this will satisfy markdownlint spacing rules without changing heading text or content.
107-108:⚠️ Potential issue | 🟠 MajorAdd an explicit verification/test plan section to satisfy R10.
The requirement on Line 108 says docs must describe verification, but there is no concrete, stepwise validation procedure with expected evidence.
Suggested structure to add
+## Verification / Test Plan + +### 1) Startup validation +- Configure `observability.otel.enabled=true` with valid `export.endpoint/protocol/service`. +- Expected: app starts, tracer initializes, no config validation errors. + +### 2) Missing mandatory export fields +- Remove one mandatory field (for example `service`). +- Expected: startup fails with validation error. + +### 3) Inbound continuation +- Send request with `traceparent`. +- Expected: server span continues same trace ID. + +### 4) Outbound propagation (service mode) +- Enable `propagation.outgoing`. +- Expected: downstream call carries W3C trace headers and links in same trace. + +### 5) Exporter failure behavior +- Point endpoint to unreachable collector. +- Expected: user-facing responses unaffected; tracing errors observable in logs. + +### 6) Sensitive data check +- Inspect span attributes/events. +- Expected: no raw prompts/retrieved content.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design/observability-opentelemetry/observability-opentelemetry.md` around lines 107 - 108, Add a new "Verification / Test Plan" section to the observability-opentelemetry.md doc under the existing "R10 – Documentation" heading that provides a concise, stepwise validation procedure: list required and optional env vars to set (e.g., OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_METRICS_EXPORTER), commands to run the service and otel-collector/test backends, exact actions to generate traces (e.g., sample HTTP requests or workflows calling instrumented functions), how to confirm traces/spans appear in the selected backend (Jaeger/OTLP/console) and sample log/trace evidence to look for, and pass/fail criteria (e.g., span counts, presence of service.name, correct attributes); include minimal example verification commands and expected outputs so reviewers can follow and mark R10 satisfied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 178-180: The document currently gives conflicting normative
guidance: Step 1 endorses Option 3 (hybrid YAML+env) while later text (around
"Line 250") favors env-driven auto-instrumentation (Option 2) causing ambiguity;
pick one normative path and update the doc so Step 1, the summary paragraph (the
sentence mentioning Option 1/2/3), and the later lines referencing env-driven
auto-instrumentation consistently state the chosen approach (e.g., if keeping
Option 3, change the "env-driven auto-instrumentation is generally preferred"
sentence to explain that env-vars are supported but the canonical approach is
configuration-driven via LCORE and requires explicit tracing fields; if choosing
Option 2, change Step 1 and the summary to promote env-driven
auto-instrumentation and note when hybrid YAML+env is acceptable), and ensure
references to "opentelemetry-instrument", "LCORE", "Option 2", and "Option 3"
are updated for consistency.
- Around line 175-176: The doc contains contradictory guidance about missing
mandatory OTEL fields: one place says startup should fail when enabled:true is
missing endpoint/protocol/service name, another suggests the app may instead
“refuse tracing”; pick one normative behavior (e.g., "startup MUST fail" or
"startup MAY continue but disable tracing") and make both occurrences consistent
by replacing the ambiguous phrase with the chosen policy; update the paragraphs
that mention "If `enabled: true`, YAML must contain mandatory endpoint,
protocol, and service name; startup should fail if missing" and the later "or
refuse tracing" sentence so they both state the exact chosen action, and ensure
any precedence wording about YAML vs OTEL_* env vars remains consistent with
that decision.
---
Duplicate comments:
In `@docs/design/observability-opentelemetry/observability-opentelemetry.md`:
- Around line 194-244: The markdown has lint errors MD022/MD031 due to missing
blank lines around several headings and a missing blank line after a fenced code
block; add a single blank line before and after each top-level and subheading
such as "Configuration model:", "Pros:", "Cons:", "Important constraints:",
"Option 2 — Manual SDK initialization with YAML-driven hybrid configuration" and
the other listed headings, and ensure there is an empty line after any fenced
code block before the following paragraph (e.g., the fenced block that precedes
"Enrichment of Llama Stack server-mode client initialization:"); this will
satisfy markdownlint spacing rules without changing heading text or content.
- Around line 107-108: Add a new "Verification / Test Plan" section to the
observability-opentelemetry.md doc under the existing "R10 – Documentation"
heading that provides a concise, stepwise validation procedure: list required
and optional env vars to set (e.g., OTEL_SERVICE_NAME,
OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_METRICS_EXPORTER), commands to run the service
and otel-collector/test backends, exact actions to generate traces (e.g., sample
HTTP requests or workflows calling instrumented functions), how to confirm
traces/spans appear in the selected backend (Jaeger/OTLP/console) and sample
log/trace evidence to look for, and pass/fail criteria (e.g., span counts,
presence of service.name, correct attributes); include minimal example
verification commands and expected outputs so reviewers can follow and mark R10
satisfied.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a79fc6b4-a3ec-477e-b931-22b2a9fabdad
📒 Files selected for processing (1)
docs/design/observability-opentelemetry/observability-opentelemetry.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build-pr
- GitHub Check: E2E: server mode / ci
- GitHub Check: E2E: library mode / ci
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
🪛 markdownlint-cli2 (0.22.0)
docs/design/observability-opentelemetry/observability-opentelemetry.md
[warning] 194-194: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 199-199: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 205-205: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 210-210: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 224-224: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 232-232: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 238-238: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 244-244: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
[warning] 298-298: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
tisnik
left a comment
There was a problem hiding this comment.
It looks ok!
Q. for discussion: what happen if we won't include any Llama Stack traces? IMHO not much harm will be done as the only relevant + visible part is LCORE. In Llama Stack, the flow can change in future, so tracing will be a bit unstable from (for example) point of view of engineer, who wants to improve performance.
Context: almost all LS calls are pretty low level, so even LCORE traces might have enough granularity.
But I'm open for discussion.
| - When both YAML and env vars define the same concern, **YAML mandatory fields take precedence** for endpoint, protocol, and service name; `OTEL_*` variables control optional advanced settings (sampling, headers, TLS files, instrumentor-only options). | ||
|
|
||
| **Recommendation:** Option 1 is rejected for maintainability. Option 2 remains viable if LCORE is run with `opentelemetry-instrument`. Option 3 leverages LCORE’s configuration-driven design to ensure mandatory fields are always explicit when tracing is enabled. | ||
|
|
There was a problem hiding this comment.
option #2 seems promising.
Q - will those options be visible in config endpoint? If not, how we help admins with identifying tracing-related problems?
Q. for the future - how to handle configuration changes in this situation?
There was a problem hiding this comment.
No, these options won't be visible in config endpoint by default. We can possibly implement some "injection" into v1/config response model if desired. Problem is that there is a lot of configurable options for OTEL (see this link) but it is possible to collect all environment variables with OTEL_ prefix and inject them into config.
To the best of my knowledge, OTEL_* environment variables are collected during process startup. As a result, any change to these variables requires the service process to be restarted/redeployed to apply changes.
|
|
||
| **Recommendation:** | ||
| - Continue using the `/metrics` endpoint for all operational metrics. | ||
| - Expand the set of Prometheus metrics as needed to cover additional components (e.g., LLM calls, RAG retrieval, tool execution) as product needs evolve. |
There was a problem hiding this comment.
yes, it would be very valuable
| **Recommendation:** | ||
| - Continue using the `/metrics` endpoint for all operational metrics. | ||
| - Expand the set of Prometheus metrics as needed to cover additional components (e.g., LLM calls, RAG retrieval, tool execution) as product needs evolve. | ||
| - Ensure metrics align with existing naming conventions and maintain low cardinality to avoid high-memory cost in Prometheus servers. |
There was a problem hiding this comment.
is it a real problem now? Or just warning/note for the future. Because now (IIRC) the cardinality selectors are just provider+model...
There was a problem hiding this comment.
I think it isn't a real problem rn.
I have an impression that the general expectation in distributed systems is that when you have multi-service integration, spans from downstream services should be part of the same trace to preserve end-to-end observability. From what I’ve seen, Llama Stack already provides fairly rich and detailed tracing, and it would be valuable to leverage that (at least in server mode) rather than losing that level of visibility. Even with LLS spans, LCORE traces would remain complete and consistent on their own, but LLS spans would provide optional deeper insight into request-level behavior (especially latency breakdowns for more complex calls). |
Tbh, I don't think so. From what I can see in Jaeger, Llama Stack traces are pretty granular, with a lot of internal spans exposed per request. |
I agree, but my point is who will benefit from it. Admin/devops who look for perf. problems wont be probably interested in LS internals. More interesting is (IMHO) "is the RAG what caused the slowdown" or "which MCP call is the most time expensive?" etc. OTOH we migth be interested knowing tracing about LS providers, that would be good, indeed. |
Yeah, I agree but exactly for this we need LLS traces, right? Otherwise this would be completely hidden under a single LCORE responses request span. |
|
@asimurka just wondering why the spike document is not included in this PR? |
Description
Adds the feature architecture document for OpenTelemetry tracing in Lightspeed Core. Feature Specification only; no code changes.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit