Integrate Langfuse observability into VQA evaluation and update tracing replacing Opik as default by aravind-3105 · Pull Request #34 · VectorInstitute/interpretability-llms-agents

aravind-3105 · 2026-03-18T21:13:29Z

Summary

This pull request migrates the observability and tracing integration in the Agentic VQA Evaluation framework from Opik to Langfuse. All documentation, code references, environment variables, and integration logic have been updated to use Langfuse as the LLM observability platform, replacing Opik. The migration is comprehensive, covering user instructions, dependency management, file structure, and all agent classes.

Clickup Ticket(s): Link(s) if applicable.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Migration from Opik to Langfuse:

All references to Opik in the documentation (README.md) have been replaced with Langfuse, including setup instructions, feature descriptions, and environment variables.
The observability integration folder has been renamed from opik_integration to langfuse_integration, and all code references in agent classes (planner_agent.py, vision_agent.py, verifier_agent.py) have been updated accordingly.
Function and argument names in agent classes have been changed from opik_trace to lf_trace to reflect the new integration, ensuring all tracing and span operations now use Langfuse.
Environment variable setup and usage have been updated: LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY now control observability, replacing the previous Opik variables. Documentation and code reflect that Langfuse is optional and the framework runs unchanged if these variables are not set.
Example MEP artifacts and references in the documentation now use lf_trace_id instead of opik_trace_id.

Dependency and setup updates:

The dependency group for installation has been renamed from agentic-xai-eval to ref6-agentic-xai-eval in the setup instructions.
The dependency table now lists langfuse instead of opik.

These changes ensure the framework is fully integrated with Langfuse for LLM observability, providing a more modern and flexible tracing and experiment comparison experience.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

…ng Opik references and updating tracing mechanisms

…e Langfuse integration, and tidy up code formatting

…ructure

Copilot

Pull request overview

This PR migrates the Agentic ChartQAPro VQA evaluation framework’s observability layer from Opik to Langfuse, updating tracing, score logging, retroactive ingestion, and documentation/notebooks to reflect the new integration.

Changes:

Replaced Opik tracing + clients with Langfuse equivalents across pipeline runners, agents, tools, and evaluation passes.
Added Langfuse integration modules (client, tracing wrappers, ingestion) and removed the Opik integration package.
Updated docs and notebooks to reference Langfuse setup, environment variables, and lf_trace_id in MEPs.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`uv.lock`	Removes Opik and its transitive deps; locks Langfuse and related deps.
`pyproject.toml`	Removes Opik from the `agentic-xai-eval` dependency group.
`implementations/mechanistic_interpretability/README.md`	Trims trailing whitespace.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/tools/vision_qa_tool.py`	Switches tracing import + tool trace handle field to Langfuse.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/tools/ocr_reader_tool.py`	Switches tracing import + tool trace handle field to Langfuse.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/runner/run_generate_meps.py`	Uses Langfuse client/dataset/prompts/tracing; records `lf_trace_id`; updates trace output.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/opik_integration/tracing.py`	Deleted (replaced by Langfuse tracing wrappers).
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/opik_integration/ingest.py`	Deleted (replaced by Langfuse ingestion).
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/opik_integration/client.py`	Deleted (replaced by Langfuse client).
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/opik_integration/__init__.py`	Deleted (package removed).
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/mep/schema.py`	Renames `opik_trace_id` to `lf_trace_id` in the MEP schema.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/tracing.py`	Adds Langfuse v4 tracing wrappers (`sample_trace`, `open_llm_span`, `close_span`, scoring).
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/prompts.py`	Updates prompt management from Opik to Langfuse APIs + CLI instructions.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/ingest.py`	Adds retroactive ingestion from MEP JSON into Langfuse traces + scoring.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/dataset.py`	Updates dataset registration to Langfuse dataset/items APIs.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/client.py`	Adds Langfuse client singleton keyed off `LANGFUSE_PUBLIC_KEY/SECRET_KEY`.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/__init__.py`	Adds Langfuse integration package marker.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/eval/eval_topk.py`	Logs top-k hit metrics back to Langfuse trace scores.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/eval/eval_outputs.py`	Logs accuracy/judge/latency scores back to Langfuse trace scores.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/eval/error_taxonomy.py`	Logs taxonomy failure category scores back to Langfuse trace scores.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/eval/dashboard.py`	Updates default MEP directory path in Streamlit UI.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/agents/vision_agent.py`	Renames trace parameter to `lf_trace` and swaps tracing imports to Langfuse.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/agents/verifier_agent.py`	Renames trace parameter to `lf_trace` and swaps tracing imports to Langfuse.
`implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/agents/planner_agent.py`	Renames trace parameter to `lf_trace` and swaps tracing imports to Langfuse.
`implementations/agentic_vqa_eval/run_pipeline.ipynb`	Updates “health check” and pipeline cells to use Langfuse client + env vars + `lf_trace_id`.
`implementations/agentic_vqa_eval/analysis.ipynb`	Makes prerequisite commands robust to execution from any repo directory; fixes empty-failure edge case.
`implementations/agentic_vqa_eval/README.md`	Rewrites observability docs from Opik → Langfuse and updates example commands/IDs.

Comments suppressed due to low confidence (2)

implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/dataset.py:50

The list comprehension here is used only for side effects (creating dataset items) and builds an unnecessary list. Prefer a simple for loop so failures can be handled/continued per item and to avoid extra memory usage.
implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/dataset.py:46
register_dataset() always calls client.create_dataset(name=...). If the dataset already exists (e.g., you re-run an experiment with the same split), this will likely raise and prevent runs from starting. Consider making this idempotent by checking for existence first or by catching/ignoring the specific "already exists" error.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/tools/ocr_reader_tool.py

implementations/agentic_vqa_eval/README.md

implementations/agentic_vqa_eval/src/agentic_chartqapro_eval/langfuse_integration/tracing.py

…nd update tracing parameters for observability.

… missing attributes propagation

…ration

aravind-3105 · 2026-03-19T21:48:46Z

Added missing instrumentation for the various model providers (GenAI and OpenAI) to capture detailed traces, based on @fcogidi's feedback.

Integrate Langfuse observability into agentic VQA evaluation, replaci…

d3ea391

…ng Opik references and updating tracing mechanisms

aravind-3105 self-assigned this Mar 18, 2026

aravind-3105 added the enhancement New feature or request label Mar 18, 2026

aravind-3105 added 3 commits March 18, 2026 18:36

Refactor observability integration: remove Opik references, streamlin…

baaac24

…e Langfuse integration, and tidy up code formatting

Update README and notebooks from opik to langfuse.

1df04a8

Update MEP directory path in dashboard.py for consistency with new st…

40d7e95

…ructure

aravind-3105 requested a review from Copilot March 19, 2026 13:55

Copilot AI reviewed Mar 19, 2026

View reviewed changes

aravind-3105 added 3 commits March 19, 2026 11:11

Rename Opik references to Langfuse in agentic VQA evaluation agents a…

4928ce4

…nd update tracing parameters for observability.

Add integration test instructions to README for API key validation

01f6051

Update Langfuse integration to require version 4 and add fallback for…

335b468

… missing attributes propagation

aravind-3105 force-pushed the langchain-modification branch from ffab700 to 335b468 Compare March 19, 2026 15:43

aravind-3105 added 2 commits March 19, 2026 12:33

Fix integration test command path in README for consistency

1035709

Add Google GenAI and OpenAI instrumentation support in Langfuse integ…

926c088

…ration

aravind-3105 force-pushed the langchain-modification branch from a860acf to 926c088 Compare March 19, 2026 21:45

aravind-3105 merged commit 74423d6 into main Mar 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Langfuse observability into VQA evaluation and update tracing replacing Opik as default#34

Integrate Langfuse observability into VQA evaluation and update tracing replacing Opik as default#34
aravind-3105 merged 9 commits intomainfrom
langchain-modification

aravind-3105 commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aravind-3105 commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aravind-3105 commented Mar 18, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aravind-3105 commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants