Summary
Mellea's backend-scope spans emit the core OTel GenAI semantic-convention metadata (gen_ai.usage.*, gen_ai.request.model, gen_ai.response.finish_reasons, gen_ai.operation.name) — which is great. While adding LLM-interaction debugging features to otelite (a local OTel receiver and dashboard for LLM development), we noticed several in-spec or de facto standard attributes/events that don't appear to be emitted today. Filling these gaps would improve debuggability in any OTel backend — Jaeger, Honeycomb, Grafana Tempo, Datadog, Arize, Phoenix, otelite — not just otelite.
⚠️ Requires verification. Observations are based on Mellea's public telemetry/tracing docs and inspection of instrumented output. A walk of mellea/telemetry.py and the backend span wrappers is needed to confirm what is / isn't emitted.
Observed vs expected
1. Chat content — gen_ai.*.message events
Observed: no transcript of the conversation is visible alongside Mellea-traced LLM calls in OTel backends. Metadata (tokens, model, latency) is present, but prompts and responses are not.
Expected (per OTel GenAI semconv): content capture via span events (current spec revision) with these names, each carrying structured role + content + optional tool_calls:
gen_ai.system.message
gen_ai.user.message
gen_ai.assistant.message
gen_ai.tool.message
gen_ai.choice
Without these, the single most common LLM-debug question ("what did the model actually say?") can only be answered by adding bespoke logging, which defeats the purpose of OTel instrumentation.
For prior art on the privacy/off-by-default pattern: OpenLLMetry's TRACELOOP_TRACE_CONTENT gates content capture behind an env var. A MELLEA_TRACE_CONTENT opt-in would fit naturally alongside the existing MELLEA_TRACE_APPLICATION / MELLEA_TRACE_BACKEND.
2. gen_ai.conversation.id on session spans
Observed: sessions are scoped via start_session() as a context manager, but no conversation identifier appears to propagate onto child backend spans. Cross-trace correlation requires joining by trace_id or service_name, which fragments at process / retry boundaries.
Expected: emit gen_ai.conversation.id (or the evolving gen_ai.request.conversation.id — spec is moving) on every span inside a session, using the session's identifier. OTel backends can then group LLM activity per conversation without instrumentation changes.
3. Prompt template attributes
Observed: Mellea has explicit template surfaces — @generative (docstring becomes prompt) and m.instruct() (template + user_variables). Currently there's no way to group calls by template in a downstream tool; every call looks like a one-off unless users manually tag them.
Expected: emit on LLM spans:
llm.prompt_template.template — the raw template / docstring
llm.prompt_template.variables — the substitution dict as JSON
llm.prompt_template.version (optional)
These are OpenInference attrs, honoured widely (Traceloop, Langfuse, Arize, Phoenix, otelite). Lets users answer "which template is most expensive?" and "which template hits max_tokens most often?" across the board.
4. Error status on failed API calls
Observed: failed LLM requests don't consistently map to OTel's standard ERROR span status, which means generic error-rate widgets in observability backends can't count them.
Expected: on API failure, set span status to ERROR (standard OTel) and optionally emit gen_ai.response.error. This makes Mellea LLM failures visible in any generic "error rate by model" or alerting rule without Mellea-specific code.
5. Sampling / retry as standard attributes
Observed: strategy_type and num_generate_logs are framework-specific attrs that generic tools don't know how to interpret.
Expected: alongside the existing Mellea-specific attrs, emit standard fallbacks on each backend span:
gen_ai.request.attempt (or the widely-used attempt) per call, starting at 1 and incrementing across sampling retries
gen_ai.request.choice_count when the strategy requests N candidates
Generic retry/sampling aggregation then works without code changes in the observability backend.
Verification checklist
Why this matters for users
Each item is in-spec (OTel GenAI semconv) or de facto standard (OpenInference llm.prompt_template.*). None requires Mellea to deviate from its standards-focused approach. Collectively they unlock:
- Transcript visibility in any OTel backend (not just framework-specific UIs)
- Conversation-level queries across process and retry boundaries
- Template-level cost and outcome aggregation
- Generic error-rate / alerting on LLM failures
- Retry and sampling observability without bespoke parsing
What the extra telemetry would unlock in tooling
Any OTel backend benefits from these attributes/events — Jaeger, Honeycomb, Grafana Tempo, Datadog, Arize Phoenix, Langfuse — but to make the debugging experience concrete, here's what a Mellea user would gain in otelite (the local OTel receiver + LLM-focused dashboard where these gaps surfaced), using features that already ship today:
gen_ai.*.message events → full chat transcripts inline on the trace, rendered as role-coloured chat bubbles without any backend adapter. Without these, users see token counts but can't answer "what did the model actually say?".
gen_ai.conversation.id → one-click "show all signals (logs + traces + metrics) for this conversation" — the kind of filter that makes multi-turn debugging tractable.
llm.prompt_template.template / .variables → group aggregate cost, latency, error rate, and truncation by template — "which prompt hits max_tokens most often?" becomes a single click.
- Standard
ERROR span status → feeds generic per-model error-rate widgets and alerting rules; today a failed Mellea LLM call has to be found by hand.
gen_ai.request.attempt / choice_count → retry-rate gauges and per-attempt cost breakdown work without framework-specific parsing.
otelite (a personal-project local observability tool for LLM developers) ships with first-class support for all of these today — cost estimation against a Claude 4.x pricing table, cache-tier breakdown (5m vs 1h), finish-reason / truncation analytics, per-model latency percentiles (P50/P95/P99), tool-use success rates, server-side attribute filtering, and chat transcript rendering. It takes OTel GenAI attributes/events as they are — no adapter layer — which is why gaps in Mellea's current emission show up directly in its dashboard. Pointing Mellea at otelite is a one-liner (pip install "mellea[telemetry]" + OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317) documented on its Setup tab, so the round-trip to verify any of the items above is short.
Source of observations
Gaps identified while building otelite. otelite already handles each of these attributes/events when emitted; they just don't seem to be coming from Mellea today. The verification checklist above should be walked against Mellea's telemetry source to confirm.
Summary
Mellea's backend-scope spans emit the core OTel GenAI semantic-convention metadata (
gen_ai.usage.*,gen_ai.request.model,gen_ai.response.finish_reasons,gen_ai.operation.name) — which is great. While adding LLM-interaction debugging features to otelite (a local OTel receiver and dashboard for LLM development), we noticed several in-spec or de facto standard attributes/events that don't appear to be emitted today. Filling these gaps would improve debuggability in any OTel backend — Jaeger, Honeycomb, Grafana Tempo, Datadog, Arize, Phoenix, otelite — not just otelite.Observed vs expected
1. Chat content —
gen_ai.*.messageeventsObserved: no transcript of the conversation is visible alongside Mellea-traced LLM calls in OTel backends. Metadata (tokens, model, latency) is present, but prompts and responses are not.
Expected (per OTel GenAI semconv): content capture via span events (current spec revision) with these names, each carrying structured
role+content+ optionaltool_calls:gen_ai.system.messagegen_ai.user.messagegen_ai.assistant.messagegen_ai.tool.messagegen_ai.choiceWithout these, the single most common LLM-debug question ("what did the model actually say?") can only be answered by adding bespoke logging, which defeats the purpose of OTel instrumentation.
For prior art on the privacy/off-by-default pattern: OpenLLMetry's
TRACELOOP_TRACE_CONTENTgates content capture behind an env var. AMELLEA_TRACE_CONTENTopt-in would fit naturally alongside the existingMELLEA_TRACE_APPLICATION/MELLEA_TRACE_BACKEND.2.
gen_ai.conversation.idon session spansObserved: sessions are scoped via
start_session()as a context manager, but no conversation identifier appears to propagate onto child backend spans. Cross-trace correlation requires joining by trace_id or service_name, which fragments at process / retry boundaries.Expected: emit
gen_ai.conversation.id(or the evolvinggen_ai.request.conversation.id— spec is moving) on every span inside a session, using the session's identifier. OTel backends can then group LLM activity per conversation without instrumentation changes.3. Prompt template attributes
Observed: Mellea has explicit template surfaces —
@generative(docstring becomes prompt) andm.instruct()(template +user_variables). Currently there's no way to group calls by template in a downstream tool; every call looks like a one-off unless users manually tag them.Expected: emit on LLM spans:
llm.prompt_template.template— the raw template / docstringllm.prompt_template.variables— the substitution dict as JSONllm.prompt_template.version(optional)These are OpenInference attrs, honoured widely (Traceloop, Langfuse, Arize, Phoenix, otelite). Lets users answer "which template is most expensive?" and "which template hits
max_tokensmost often?" across the board.4. Error status on failed API calls
Observed: failed LLM requests don't consistently map to OTel's standard
ERRORspan status, which means generic error-rate widgets in observability backends can't count them.Expected: on API failure, set span status to
ERROR(standard OTel) and optionally emitgen_ai.response.error. This makes Mellea LLM failures visible in any generic "error rate by model" or alerting rule without Mellea-specific code.5. Sampling / retry as standard attributes
Observed:
strategy_typeandnum_generate_logsare framework-specific attrs that generic tools don't know how to interpret.Expected: alongside the existing Mellea-specific attrs, emit standard fallbacks on each backend span:
gen_ai.request.attempt(or the widely-usedattempt) per call, starting at 1 and incrementing across sampling retriesgen_ai.request.choice_countwhen the strategy requests N candidatesGeneric retry/sampling aggregation then works without code changes in the observability backend.
Verification checklist
mellea/telemetry.py/ backend wrappers to confirm current emission coverage for each of the five items aboveMELLEA_TRACE_CONTENTgen_ai.conversation.idvs newer variants)@generative(function docstring) andinstruct(template string) paths, including any prompt-template versioningERRORspan status mappingWhy this matters for users
Each item is in-spec (OTel GenAI semconv) or de facto standard (OpenInference
llm.prompt_template.*). None requires Mellea to deviate from its standards-focused approach. Collectively they unlock:What the extra telemetry would unlock in tooling
Any OTel backend benefits from these attributes/events — Jaeger, Honeycomb, Grafana Tempo, Datadog, Arize Phoenix, Langfuse — but to make the debugging experience concrete, here's what a Mellea user would gain in otelite (the local OTel receiver + LLM-focused dashboard where these gaps surfaced), using features that already ship today:
gen_ai.*.messageevents → full chat transcripts inline on the trace, rendered as role-coloured chat bubbles without any backend adapter. Without these, users see token counts but can't answer "what did the model actually say?".gen_ai.conversation.id→ one-click "show all signals (logs + traces + metrics) for this conversation" — the kind of filter that makes multi-turn debugging tractable.llm.prompt_template.template/.variables→ group aggregate cost, latency, error rate, and truncation by template — "which prompt hitsmax_tokensmost often?" becomes a single click.ERRORspan status → feeds generic per-model error-rate widgets and alerting rules; today a failed Mellea LLM call has to be found by hand.gen_ai.request.attempt/choice_count→ retry-rate gauges and per-attempt cost breakdown work without framework-specific parsing.otelite (a personal-project local observability tool for LLM developers) ships with first-class support for all of these today — cost estimation against a Claude 4.x pricing table, cache-tier breakdown (5m vs 1h), finish-reason / truncation analytics, per-model latency percentiles (P50/P95/P99), tool-use success rates, server-side attribute filtering, and chat transcript rendering. It takes OTel GenAI attributes/events as they are — no adapter layer — which is why gaps in Mellea's current emission show up directly in its dashboard. Pointing Mellea at otelite is a one-liner (
pip install "mellea[telemetry]"+OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317) documented on its Setup tab, so the round-trip to verify any of the items above is short.Source of observations
Gaps identified while building otelite. otelite already handles each of these attributes/events when emitted; they just don't seem to be coming from Mellea today. The verification checklist above should be walked against Mellea's telemetry source to confirm.