AgentV produces output_messages arrays containing tool calls, assistant responses, and timestamps during evaluation runs. This data is valuable for debugging and monitoring but currently stays within AgentV's result files.
Industry frameworks (LangWatch, Mastra, Google ADK, Azure SDK) have adopted OpenTelemetry semantic conventions for LLM observability. Langfuse is an open-source platform that accepts traces in a compatible format.
Stakeholders:
- AgentV users who need to debug agent behavior
- Teams integrating AgentV into existing LLMOps workflows
- Developers comparing agent configurations across runs
Goals:
- Export
output_messagesto Langfuse as structured traces - Follow OpenTelemetry GenAI semantic conventions where applicable
- Provide opt-in content capture for privacy-sensitive environments
- Keep export logic decoupled from core evaluation flow
Non-Goals:
- Full OpenTelemetry SDK integration (deferred)
- Real-time streaming of traces during execution
- Bi-directional sync with Langfuse (import traces)
- Support for other observability platforms in this change (extensible design only)
What: Import langfuse npm package and use its native trace/span API.
Why:
- Langfuse SDK handles authentication, batching, and flush automatically
- Avoids complexity of OTEL collector setup
- Direct mapping to Langfuse concepts (traces, generations, spans)
- Can add OTEL exporter later as separate capability
Alternatives considered:
- Full OTEL SDK + OTLP exporter: More portable but requires collector infrastructure
- Custom HTTP calls: Fragile, no batching, reinvents SDK features
Mapping:
| AgentV Concept | Langfuse Concept | Notes |
|---|---|---|
| Evaluation run | Trace | One trace per eval case |
eval_id |
trace.name |
Identifies the test case |
target |
trace.metadata.target |
Which provider was used |
| Assistant message with content | Generation | LLM response |
| Tool call | Span (type: "tool") | Individual tool invocation |
score |
Score | Attached to trace |
Langfuse Trace Structure:
Trace: eval_id="case-001"
├── Generation: "assistant response"
│ ├── input: [user messages]
│ ├── output: "response text"
│ └── usage: { input_tokens, output_tokens }
├── Span: tool="search" (type: tool)
│ ├── input: { query: "..." }
│ └── output: "results..."
├── Span: tool="read_file" (type: tool)
│ └── ...
└── Score: name="eval_score", value=0.85
Use gen_ai.* prefixed attributes where applicable:
// Generation attributes
'gen_ai.request.model': target.model,
'gen_ai.usage.input_tokens': usage?.input_tokens,
'gen_ai.usage.output_tokens': usage?.output_tokens,
// Tool span attributes
'gen_ai.tool.name': toolCall.tool,
'gen_ai.tool.call.id': toolCall.id,
// Trace metadata
'agentv.eval_id': evalCase.id,
'agentv.target': target.name,
'agentv.dataset': evalCase.dataset,Default: Do not capture message content or tool inputs/outputs.
Opt-in: Set LANGFUSE_CAPTURE_CONTENT=true to include:
- User message content
- Assistant response content
- Tool call inputs and outputs
Rationale: Traces may contain PII, secrets, or proprietary data. Following Azure SDK and Google ADK patterns of opt-in content capture.
Approach: Flush traces after each eval case completes (not batched across cases).
Why:
- Ensures traces are visible in Langfuse promptly
- Avoids data loss if process crashes
- Trade-off: Slightly higher network overhead (acceptable for eval workloads)
Configuration: No user-facing config in v1. Can add --langfuse-batch later if needed.
┌─────────────────────────────────────────────────────────────────┐
│ agentv run │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Provider │───▶│ Orchestrator │───▶│ EvaluationResult │ │
│ │ Response │ │ │ │ + outputMessages │ │
│ └──────────────┘ └──────────────┘ └────────┬─────────┘ │
│ │ │
│ ┌────────────────────────▼──────┐ │
│ │ LangfuseExporter │ │
│ │ (if --langfuse enabled) │ │
│ └────────────────┬──────────────┘ │
│ │ │
└────────────────────────────────────────────┼────────────────────┘
│
▼
┌─────────────────┐
│ Langfuse │
│ Platform │
└─────────────────┘
# Enable Langfuse export
agentv run eval.yaml --langfuse
# With custom host (self-hosted Langfuse)
LANGFUSE_HOST=https://langfuse.mycompany.com agentv run eval.yaml --langfuse
# With content capture
LANGFUSE_CAPTURE_CONTENT=true agentv run eval.yaml --langfuse| Variable | Required | Description |
|---|---|---|
LANGFUSE_PUBLIC_KEY |
Yes (if --langfuse) | Langfuse project public key |
LANGFUSE_SECRET_KEY |
Yes (if --langfuse) | Langfuse project secret key |
LANGFUSE_HOST |
No | Custom Langfuse host (default: cloud) |
LANGFUSE_CAPTURE_CONTENT |
No | Enable content capture (default: false) |
import { LangfuseExporter } from '@agentv/core/observability';
const exporter = new LangfuseExporter({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
host: process.env.LANGFUSE_HOST,
captureContent: process.env.LANGFUSE_CAPTURE_CONTENT === 'true',
});
// Export a single result
await exporter.export(evaluationResult, outputMessages);
// Flush pending traces
await exporter.flush();| Risk | Mitigation |
|---|---|
| Langfuse SDK version churn | Pin to stable version, document upgrade path |
| Network failures during export | Log warning, don't fail evaluation; traces are optional |
| Large traces with many tool calls | Langfuse handles batching internally; monitor payload sizes |
| Content capture leaking secrets | Default to off; document clearly in CLI help |
No migration required. This is a new optional feature. Existing users are unaffected unless they enable --langfuse.
- Should we support
--langfuse-session-idto group multiple eval runs? (Defer to user feedback) - Should token usage be estimated if provider doesn't return it? (Defer - not all providers report usage)
- Should we add a
--dry-run-langfuseto preview traces without sending? (Nice to have, not v1)