Skip to content

OTel omissions/gaps: chat content events, gen_ai.conversation.id, prompt templates, error status #1035

@planetf1

Description

@planetf1

Summary

Mellea's backend-scope spans emit the core OTel GenAI semantic-convention metadata (gen_ai.usage.*, gen_ai.request.model, gen_ai.response.finish_reasons, gen_ai.operation.name) — which is great. While adding LLM-interaction debugging features to otelite (a local OTel receiver and dashboard for LLM development), we noticed several in-spec or de facto standard attributes/events that don't appear to be emitted today. Filling these gaps would improve debuggability in any OTel backend — Jaeger, Honeycomb, Grafana Tempo, Datadog, Arize, Phoenix, otelite — not just otelite.

⚠️ Requires verification. Observations are based on Mellea's public telemetry/tracing docs and inspection of instrumented output. A walk of mellea/telemetry.py and the backend span wrappers is needed to confirm what is / isn't emitted.

Observed vs expected

1. Chat content — gen_ai.*.message events

Observed: no transcript of the conversation is visible alongside Mellea-traced LLM calls in OTel backends. Metadata (tokens, model, latency) is present, but prompts and responses are not.

Expected (per OTel GenAI semconv): content capture via span events (current spec revision) with these names, each carrying structured role + content + optional tool_calls:

  • gen_ai.system.message
  • gen_ai.user.message
  • gen_ai.assistant.message
  • gen_ai.tool.message
  • gen_ai.choice

Without these, the single most common LLM-debug question ("what did the model actually say?") can only be answered by adding bespoke logging, which defeats the purpose of OTel instrumentation.

For prior art on the privacy/off-by-default pattern: OpenLLMetry's TRACELOOP_TRACE_CONTENT gates content capture behind an env var. A MELLEA_TRACE_CONTENT opt-in would fit naturally alongside the existing MELLEA_TRACE_APPLICATION / MELLEA_TRACE_BACKEND.

2. gen_ai.conversation.id on session spans

Observed: sessions are scoped via start_session() as a context manager, but no conversation identifier appears to propagate onto child backend spans. Cross-trace correlation requires joining by trace_id or service_name, which fragments at process / retry boundaries.

Expected: emit gen_ai.conversation.id (or the evolving gen_ai.request.conversation.id — spec is moving) on every span inside a session, using the session's identifier. OTel backends can then group LLM activity per conversation without instrumentation changes.

3. Prompt template attributes

Observed: Mellea has explicit template surfaces — @generative (docstring becomes prompt) and m.instruct() (template + user_variables). Currently there's no way to group calls by template in a downstream tool; every call looks like a one-off unless users manually tag them.

Expected: emit on LLM spans:

  • llm.prompt_template.template — the raw template / docstring
  • llm.prompt_template.variables — the substitution dict as JSON
  • llm.prompt_template.version (optional)

These are OpenInference attrs, honoured widely (Traceloop, Langfuse, Arize, Phoenix, otelite). Lets users answer "which template is most expensive?" and "which template hits max_tokens most often?" across the board.

4. Error status on failed API calls

Observed: failed LLM requests don't consistently map to OTel's standard ERROR span status, which means generic error-rate widgets in observability backends can't count them.

Expected: on API failure, set span status to ERROR (standard OTel) and optionally emit gen_ai.response.error. This makes Mellea LLM failures visible in any generic "error rate by model" or alerting rule without Mellea-specific code.

5. Sampling / retry as standard attributes

Observed: strategy_type and num_generate_logs are framework-specific attrs that generic tools don't know how to interpret.

Expected: alongside the existing Mellea-specific attrs, emit standard fallbacks on each backend span:

  • gen_ai.request.attempt (or the widely-used attempt) per call, starting at 1 and incrementing across sampling retries
  • gen_ai.request.choice_count when the strategy requests N candidates

Generic retry/sampling aggregation then works without code changes in the observability backend.

Verification checklist

  • Walk mellea/telemetry.py / backend wrappers to confirm current emission coverage for each of the five items above
  • Decide default on/off for content capture (privacy vs debuggability) — recommend off, opt-in via MELLEA_TRACE_CONTENT
  • Pick the conversation-ID attribute name (spec is evolving — gen_ai.conversation.id vs newer variants)
  • Confirm prompt templates can be reliably captured for both @generative (function docstring) and instruct (template string) paths, including any prompt-template versioning
  • Audit failure paths for consistent ERROR span status mapping

Why this matters for users

Each item is in-spec (OTel GenAI semconv) or de facto standard (OpenInference llm.prompt_template.*). None requires Mellea to deviate from its standards-focused approach. Collectively they unlock:

  • Transcript visibility in any OTel backend (not just framework-specific UIs)
  • Conversation-level queries across process and retry boundaries
  • Template-level cost and outcome aggregation
  • Generic error-rate / alerting on LLM failures
  • Retry and sampling observability without bespoke parsing

What the extra telemetry would unlock in tooling

Any OTel backend benefits from these attributes/events — Jaeger, Honeycomb, Grafana Tempo, Datadog, Arize Phoenix, Langfuse — but to make the debugging experience concrete, here's what a Mellea user would gain in otelite (the local OTel receiver + LLM-focused dashboard where these gaps surfaced), using features that already ship today:

  • gen_ai.*.message events → full chat transcripts inline on the trace, rendered as role-coloured chat bubbles without any backend adapter. Without these, users see token counts but can't answer "what did the model actually say?".
  • gen_ai.conversation.id → one-click "show all signals (logs + traces + metrics) for this conversation" — the kind of filter that makes multi-turn debugging tractable.
  • llm.prompt_template.template / .variables → group aggregate cost, latency, error rate, and truncation by template — "which prompt hits max_tokens most often?" becomes a single click.
  • Standard ERROR span status → feeds generic per-model error-rate widgets and alerting rules; today a failed Mellea LLM call has to be found by hand.
  • gen_ai.request.attempt / choice_count → retry-rate gauges and per-attempt cost breakdown work without framework-specific parsing.

otelite (a personal-project local observability tool for LLM developers) ships with first-class support for all of these today — cost estimation against a Claude 4.x pricing table, cache-tier breakdown (5m vs 1h), finish-reason / truncation analytics, per-model latency percentiles (P50/P95/P99), tool-use success rates, server-side attribute filtering, and chat transcript rendering. It takes OTel GenAI attributes/events as they are — no adapter layer — which is why gaps in Mellea's current emission show up directly in its dashboard. Pointing Mellea at otelite is a one-liner (pip install "mellea[telemetry]" + OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317) documented on its Setup tab, so the round-trip to verify any of the items above is short.

Source of observations

Gaps identified while building otelite. otelite already handles each of these attributes/events when emitted; they just don't seem to be coming from Mellea today. The verification checklist above should be walked against Mellea's telemetry source to confirm.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions