OTel omissions/gaps: chat content events, gen_ai.conversation.id, prompt templates, error status

## Summary

Mellea's backend-scope spans emit the core OTel GenAI semantic-convention metadata (`gen_ai.usage.*`, `gen_ai.request.model`, `gen_ai.response.finish_reasons`, `gen_ai.operation.name`) — which is great. While adding LLM-interaction debugging features to [otelite](https://github.com/planetf1/otelite) (a local OTel receiver and dashboard for LLM development), we noticed several **in-spec or de facto standard** attributes/events that don't appear to be emitted today. Filling these gaps would improve debuggability in any OTel backend — Jaeger, Honeycomb, Grafana Tempo, Datadog, Arize, Phoenix, otelite — not just otelite.

> **⚠️ Requires verification.** Observations are based on Mellea's public telemetry/tracing docs and inspection of instrumented output. A walk of `mellea/telemetry.py` and the backend span wrappers is needed to confirm what is / isn't emitted.

## Observed vs expected

### 1. Chat content — `gen_ai.*.message` events

**Observed:** no transcript of the conversation is visible alongside Mellea-traced LLM calls in OTel backends. Metadata (tokens, model, latency) is present, but prompts and responses are not.

**Expected (per OTel GenAI semconv):** content capture via span events (current spec revision) with these names, each carrying structured `role` + `content` + optional `tool_calls`:
- `gen_ai.system.message`
- `gen_ai.user.message`
- `gen_ai.assistant.message`
- `gen_ai.tool.message`
- `gen_ai.choice`

Without these, the single most common LLM-debug question ("what did the model actually say?") can only be answered by adding bespoke logging, which defeats the purpose of OTel instrumentation.

For prior art on the privacy/off-by-default pattern: [OpenLLMetry's `TRACELOOP_TRACE_CONTENT`](https://github.com/traceloop/openllmetry/tree/main/packages/opentelemetry-instrumentation-openai) gates content capture behind an env var. A `MELLEA_TRACE_CONTENT` opt-in would fit naturally alongside the existing `MELLEA_TRACE_APPLICATION` / `MELLEA_TRACE_BACKEND`.

### 2. `gen_ai.conversation.id` on session spans

**Observed:** sessions are scoped via `start_session()` as a context manager, but no conversation identifier appears to propagate onto child backend spans. Cross-trace correlation requires joining by trace_id or service_name, which fragments at process / retry boundaries.

**Expected:** emit `gen_ai.conversation.id` (or the evolving `gen_ai.request.conversation.id` — spec is moving) on every span inside a session, using the session's identifier. OTel backends can then group LLM activity per conversation without instrumentation changes.

### 3. Prompt template attributes

**Observed:** Mellea has explicit template surfaces — `@generative` (docstring becomes prompt) and `m.instruct()` (template + `user_variables`). Currently there's no way to group calls by template in a downstream tool; every call looks like a one-off unless users manually tag them.

**Expected:** emit on LLM spans:
- `llm.prompt_template.template` — the raw template / docstring
- `llm.prompt_template.variables` — the substitution dict as JSON
- `llm.prompt_template.version` (optional)

These are OpenInference attrs, honoured widely (Traceloop, Langfuse, Arize, Phoenix, otelite). Lets users answer "which template is most expensive?" and "which template hits `max_tokens` most often?" across the board.

### 4. Error status on failed API calls

**Observed:** failed LLM requests don't consistently map to OTel's standard `ERROR` span status, which means generic error-rate widgets in observability backends can't count them.

**Expected:** on API failure, set span status to `ERROR` (standard OTel) and optionally emit `gen_ai.response.error`. This makes Mellea LLM failures visible in any generic "error rate by model" or alerting rule without Mellea-specific code.

### 5. Sampling / retry as standard attributes

**Observed:** `strategy_type` and `num_generate_logs` are framework-specific attrs that generic tools don't know how to interpret.

**Expected:** alongside the existing Mellea-specific attrs, emit standard fallbacks on each backend span:
- `gen_ai.request.attempt` (or the widely-used `attempt`) per call, starting at 1 and incrementing across sampling retries
- `gen_ai.request.choice_count` when the strategy requests N candidates

Generic retry/sampling aggregation then works without code changes in the observability backend.

## Verification checklist

- [ ] Walk `mellea/telemetry.py` / backend wrappers to confirm current emission coverage for each of the five items above
- [ ] Decide default on/off for content capture (privacy vs debuggability) — recommend off, opt-in via `MELLEA_TRACE_CONTENT`
- [ ] Pick the conversation-ID attribute name (spec is evolving — `gen_ai.conversation.id` vs newer variants)
- [ ] Confirm prompt templates can be reliably captured for both `@generative` (function docstring) and `instruct` (template string) paths, including any prompt-template versioning
- [ ] Audit failure paths for consistent `ERROR` span status mapping

## Why this matters for users

Each item is in-spec (OTel GenAI semconv) or de facto standard (OpenInference `llm.prompt_template.*`). None requires Mellea to deviate from its standards-focused approach. Collectively they unlock:

- **Transcript visibility** in any OTel backend (not just framework-specific UIs)
- **Conversation-level queries** across process and retry boundaries
- **Template-level cost and outcome aggregation**
- **Generic error-rate / alerting** on LLM failures
- **Retry and sampling observability** without bespoke parsing

## What the extra telemetry would unlock in tooling

Any OTel backend benefits from these attributes/events — Jaeger, Honeycomb, Grafana Tempo, Datadog, Arize Phoenix, Langfuse — but to make the debugging experience concrete, here's what a Mellea user would gain in [otelite](https://github.com/planetf1/otelite) (the local OTel receiver + LLM-focused dashboard where these gaps surfaced), using features that already ship today:

- **`gen_ai.*.message` events** → full chat transcripts inline on the trace, rendered as role-coloured chat bubbles without any backend adapter. Without these, users see token counts but can't answer "what did the model actually say?".
- **`gen_ai.conversation.id`** → one-click "show all signals (logs + traces + metrics) for this conversation" — the kind of filter that makes multi-turn debugging tractable.
- **`llm.prompt_template.template` / `.variables`** → group aggregate cost, latency, error rate, and truncation *by template* — "which prompt hits `max_tokens` most often?" becomes a single click.
- **Standard `ERROR` span status** → feeds generic per-model error-rate widgets and alerting rules; today a failed Mellea LLM call has to be found by hand.
- **`gen_ai.request.attempt` / `choice_count`** → retry-rate gauges and per-attempt cost breakdown work without framework-specific parsing.

otelite (a personal-project local observability tool for LLM developers) ships with first-class support for all of these today — cost estimation against a Claude 4.x pricing table, cache-tier breakdown (5m vs 1h), finish-reason / truncation analytics, per-model latency percentiles (P50/P95/P99), tool-use success rates, server-side attribute filtering, and chat transcript rendering. It takes OTel GenAI attributes/events as they are — no adapter layer — which is why gaps in Mellea's current emission show up directly in its dashboard. Pointing Mellea at otelite is a one-liner (`pip install "mellea[telemetry]"` + `OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317`) documented on its Setup tab, so the round-trip to verify any of the items above is short.

## Source of observations

Gaps identified while building [otelite](https://github.com/planetf1/otelite). otelite already handles each of these attributes/events when emitted; they just don't seem to be coming from Mellea today. The verification checklist above should be walked against Mellea's telemetry source to confirm.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTel omissions/gaps: chat content events, gen_ai.conversation.id, prompt templates, error status #1035

Summary

Observed vs expected

1. Chat content — `gen_ai.*.message` events

2. `gen_ai.conversation.id` on session spans

3. Prompt template attributes

4. Error status on failed API calls

5. Sampling / retry as standard attributes

Verification checklist

Why this matters for users

What the extra telemetry would unlock in tooling

Source of observations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

OTel omissions/gaps: chat content events, gen_ai.conversation.id, prompt templates, error status #1035

Description

Summary

Observed vs expected

1. Chat content — gen_ai.*.message events

2. gen_ai.conversation.id on session spans

3. Prompt template attributes

4. Error status on failed API calls

5. Sampling / retry as standard attributes

Verification checklist

Why this matters for users

What the extra telemetry would unlock in tooling

Source of observations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Chat content — `gen_ai.*.message` events

2. `gen_ai.conversation.id` on session spans