feat: Add Anamnesis provider for claude-mem memory systems by gene-jelly · Pull Request #30 · supermemoryai/memorybench

gene-jelly · 2026-03-04T16:19:58Z

Summary

Adds a new anamnesis provider that benchmarks claude-mem memory systems — the observation-based memory layer used by Claude Code.

Ingest: Transforms benchmark conversations into claude-mem observations (raw text or LLM-extracted via Claude CLI)
Index: Embeds observations into ChromaDB via direct Python script (avoids version mismatches with chroma-mcp)
Search: Hybrid semantic (ChromaDB vectors) + keyword (SQLite LIKE) search with namespace isolation
Clear: Cleans both SQLite observations and ChromaDB embeddings
CLI model support: Added opus-cli, sonnet-cli, haiku-cli model aliases for answering via Claude CLI subprocess

Benchmark Results (LoCoMo)

Answering Model	Judge	Accuracy	Hit@10	MRR
GPT-4o (conservative prompt)	GPT-4o	40% (8/20)	85%	0.515
GPT-4o (assertive prompt)	GPT-4o	70% (14/20)	90%	0.515
GPT-4.1-mini	GPT-4.1-mini	70% (14/20)	100%	0.612
GPT-5.2	GPT-4.1-mini	80% (16/20)	100%	0.594
GPT-5.2	GPT-5.2	75% (15/20)	100%	0.594

By question type (GPT-5.2 / gpt-4.1-mini judge):

Single-hop: 87.5% (7/8)
Multi-hop: 80% (8/10)
Temporal: 50% (1/2)

Evaluation uses binary LLM-as-Judge scoring (stricter than token F1). Comparable methodology to Mem0's published LoCoMo results (66.9%).

Architecture

LoCoMo conversations → SQLite observations → ChromaDB embeddings
                                                    ↓
                        Search query → ChromaDB semantic + SQLite keyword → RRF-style merge

Key design decisions

Direct ChromaDB access via Python scripts (embed.py, search.py) instead of HTTP API calls, since claude-mem's worker doesn't expose an embedding endpoint
CHROMA_PYTHON env var lets users point to the same Python environment as their chroma-mcp process (important for ChromaDB version compatibility)
Namespace isolation: Each benchmark question gets its own namespace, preventing cross-contamination between questions and with the user's real observations
Extraction mode (ANAMNESIS_EXTRACTION=true): Uses Claude CLI in print mode to extract structured observations, matching how claude-mem processes conversations in production
Assertive answer prompt: Targeted instructions for complete extraction, temporal date conversion, and counterfactual reasoning

Environment variables

Variable	Default	Description
`ANAMNESIS_WORKER_URL`	`http://localhost:37777`	claude-mem worker URL
`ANAMNESIS_DB`	`~/.claude-mem/claude-mem.db`	SQLite database path
`ANAMNESIS_EXTRACTION`	`false`	Enable LLM extraction mode
`CHROMA_PYTHON`	`python3`	Python with matching ChromaDB version
`CHROMA_PATH`	`~/.claude-mem/vector-db`	ChromaDB persistent storage path

Files

src/providers/anamnesis/index.ts — Provider implementation (612 lines)
src/providers/anamnesis/prompts.ts — Custom answer prompts optimized for observation format
src/providers/anamnesis/embed.py — ChromaDB embedding script
src/providers/anamnesis/search.py — ChromaDB semantic search script
src/types/provider.ts — Added "anamnesis" to ProviderName
src/providers/index.ts — Registered AnamnesisProvider
src/utils/config.ts — Added anamnesis config
src/utils/models.ts — Added CLI model aliases + GPT-5.2 config
src/orchestrator/phases/answer.ts — CLI subprocess support + 10min timeout

Test plan

20-question LoCoMo run: 80% accuracy (GPT-5.2), 100% Hit@10
Prompt iteration: 40% → 70% → 80% through targeted prompt engineering
Multi-model comparison: GPT-4o, GPT-4.1-mini, GPT-5.2
Multi-judge comparison: gpt-4.1-mini vs gpt-5.2 (95% agreement)
Embedding verified: all documents embedded successfully
Hybrid search verified: semantic + keyword fallback working
Full 50-question LoCoMo run (in progress)

🤖 Generated with Claude Code

- Add AnamnesisProvider for testing claude-mem memory system - Include session date prefix in observation narratives (e.g., [Conversation Date: 8 May, 2023]) - This enables resolving relative temporal references ("yesterday" → specific date) - Multi-hop temporal questions now pass (q0: 0.11 MRR → 1.00 MRR) - Batched indexing for efficient ChromaDB embedding 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Matches production claude-mem observation structure: - XML output format with <facts> array for discrete details - Explicit emphasis on preserving EXACT DATES and temporal info - Subtitle, concepts, and structured facts fields - 100% accuracy on 3-question temporal test (vs 66.67% with old JSON format) The key insight: dates embedded in narrative get lost during summarization, but dates as discrete facts in an array remain searchable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Filter semantic search to memorybench project to avoid cross-project result pollution. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Pass containerTag as namespace to isolate each benchmark question's observations, preventing cross-contamination in semantic search results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Infrastructure to run MemoryBench entirely via Claude CLI subprocesses, eliminating API key requirements for Anthropic: - Add "cli" provider type to ModelConfig with sonnet-cli, haiku-cli, opus-cli aliases - Create CliJudge class using subprocess for evaluation - Add generateTextViaCli() helper in answer phase - Implement parallel extraction with 5-way concurrency - Fix budget limits ($0.05 → $1.00) to account for CLI overhead - Add manual timeout handler (spawn timeout doesn't kill process) Note: Benchmark runs still hang at extraction phase - root cause undiagnosed. This commit preserves the infrastructure for future debugging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The anamnesis provider's awaitIndexing called /api/sync/observations which never existed on the worker. Search went through the worker API which couldn't filter by namespace. Both now use Python scripts that call ChromaDB directly (using the same Python env as chroma-mcp for version compatibility). - embed.py: Reads observations from SQLite, upserts into ChromaDB - search.py: Semantic vector search with namespace filtering - index.ts: Updated awaitIndexing, search, and clear methods - clear() now removes embeddings from ChromaDB alongside SQLite Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace machine-specific uv Python path with CHROMA_PYTHON env var - Remove personal name reference from provider docstring - Falls back to system python3 when CHROMA_PYTHON is not set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merges upstream changes (filesystem/rag providers, ConcurrentExecutor, memorybench skill) with our anamnesis + CLI provider additions. - Adopt upstream's ConcurrentExecutor for indexing phase (replaces our batch hack) - Add all provider types to union: anamnesis + filesystem + rag - Fix CliJudge.getModel() return type annotation - Fix spawn args in anamnesis clear() — pass as positional, not options - Fix remaining hardcoded CHROMA_PYTHON path to use env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sentry · 2026-03-04T16:31:12Z

src/providers/anamnesis/index.ts

+
+        // Get IDs before deleting (needed for ChromaDB cleanup)
+        const ids = db.query(
+            `SELECT id FROM observations WHERE namespace = ? OR project = 'memorybench'`
+        ).all(containerTag) as Array<{ id: number }>
+
+        // Delete from SQLite
+        const result = db.run(
+            `DELETE FROM observations WHERE namespace = ? OR project = 'memorybench'`,
+            [containerTag]
+        )


Bug: The clear() method's SQL query uses OR project = 'memorybench', which will cause all benchmark data to be deleted, not just data for the specified namespace.
_{Severity: HIGH}

Suggested Fix

Remove the OR project = 'memorybench' condition from the DELETE statement in the clear() method. The query should only use WHERE namespace = ? to ensure data deletion is isolated to the specified container.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/providers/anamnesis/index.ts#L578-L590 Potential issue: The `clear()` method in the `anamnesis` provider uses the SQL condition `WHERE namespace = ? OR project = 'memorybench'`. Since all benchmark observations are stored with `project = 'memorybench'`, invoking this method for any container will delete not only that container's data but all observations from all containers. While the method is not currently called in the main execution flow, its existence poses a significant risk of accidental mass data deletion if used by future code or external cleanup scripts.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

src/orchestrator/phases/answer.ts

+async function generateTextViaCli(prompt: string, modelAlias: string): Promise<string> {
+  return new Promise((resolve, reject) => {
+    const claude = spawn("claude", [
+      "-p", prompt,
+      "--output-format", "json",
+      "--model", modelAlias,
+      "--max-budget-usd", "1.00",
+    ], {
+      timeout: 180000,
+      cwd: process.cwd(),


The original prompt told the answering model to say "I don't have enough information" when uncertain, causing 67% of failures. The new prompt instructs the model to extract ALL available information and only refuse when observations are completely irrelevant. Benchmark result: 70% accuracy on LoCoMo 20-question sample (was 40%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Enhanced answer prompt with targeted instructions for complete extraction, temporal date conversion, and counterfactual reasoning (80% accuracy on LoCoMo 20q with GPT-5.2, up from 40% with conservative prompt) - Added GPT-5.2 model config (reasoning model, no temperature) - Increased CLI subprocess timeout from 3min to 10min for larger models Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sentry · 2026-03-05T17:15:05Z

src/providers/anamnesis/embed.py

+
+    # Connect to ChromaDB
+    client = chromadb.PersistentClient(path=VECTOR_PATH)
+    col = client.get_collection(COLLECTION)


Bug: The use of get_collection() can raise a ValueError if the ChromaDB collection is missing, causing semantic search to fail silently and fall back to keyword search.
_{Severity: MEDIUM}

Suggested Fix

Replace the call to client.get_collection(COLLECTION) with client.get_or_create_collection(COLLECTION). This ensures the collection is created if it does not already exist, preventing the ValueError and ensuring semantic search functionality is robust.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/providers/anamnesis/embed.py#L44 Potential issue: The code calls `client.get_collection(COLLECTION)` in `embed.py` and `search.py`, which will raise a `ValueError` if the ChromaDB collection does not exist. This can occur in scenarios like a fresh installation or a corrupted state. The current error handling catches this exception but leads to silent failures. In the embedding phase, the error is logged as a warning, and embeddings are skipped. In the search phase, it falls back to a keyword-only search. This results in a silent degradation of the semantic search functionality, potentially impacting benchmark accuracy without clear indication to the user.

Increase search results from 10 to 19 (full session coverage) and improve answer prompt with stronger extraction, date precision, and exact-term matching instructions. Results on 50-question LoCoMo benchmark (conv-26): v1 (search=10): 74.0% (37/50) v2 (search=15): 82.0% (41/50) v3 (search=19): 86.0% (43/50) By question type: multi-hop: 87.5% (21/24) temporal: 100.0% (7/7) single-hop: 78.9% (15/19) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sentry · 2026-03-05T20:14:29Z

src/providers/anamnesis/search.py

+    results = col.query(
+        query_texts=[query],
+        n_results=min(limit * 3, 100),  # Overfetch to account for filtering
+        where={"namespace": namespace},
+        include=["documents", "distances", "metadatas"],
+    )


Bug: The semantic search script can crash when requesting more results from ChromaDB than are available, causing a silent fallback to a less accurate keyword-only search.
_{Severity: HIGH}

Suggested Fix

In search.py, wrap the ChromaDB query in a try-except block to handle the NotEnoughElementsException. Alternatively, before querying, determine the number of available documents and request min(limit * 3, num_available_documents).

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/providers/anamnesis/search.py#L28-L33 Potential issue: The `search.py` script requests a fixed multiple of the search limit (`limit * 3`, which is 57) from ChromaDB. However, ChromaDB can raise a `NotEnoughElementsException` if the number of available documents in the filtered namespace is less than the requested amount. This is a common scenario as each question has its own namespace with a relatively small number of documents. The unhandled exception causes the script to crash. The calling process catches this failure and silently falls back to a keyword-only search, significantly degrading search quality without any indication to the user or developer.

Add BLIP image captions in concise [shared image: ...] format and improve answer prompt with: earliest event matching, explicit date resolution, image description term extraction. Ingest/indexing/search phases complete for 50q run (gpt52-50q-v5-mar5) but answer/evaluate blocked by OpenAI quota. Resume with same run ID. Results so far: - v1 (search=10, baseline prompt): 74% (37/50) - v2 (search=15, +rescan/dates): 82% (41/50) - v3 (search=19, +exact terms): 86% (43/50) ← current best - v4 (search=19, +targeted extraction): 86% (different dist, reverted) - captions (full BLIP captions): 82% (41/50, captions hurt) - v5 (selective captions + prompt v5): PENDING Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gene-jelly and others added 8 commits December 29, 2025 15:29

fix: Add project filter to anamnesis provider search

1edf7a0

Filter semantic search to memorybench project to avoid cross-project result pollution. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sentry bot reviewed Mar 4, 2026

View reviewed changes

gene-jelly and others added 2 commits March 4, 2026 12:12

sentry bot reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Anamnesis provider for claude-mem memory systems#30

feat: Add Anamnesis provider for claude-mem memory systems#30
gene-jelly wants to merge 12 commits intosupermemoryai:mainfrom
gene-jelly:feat/anamnesis-provider

gene-jelly commented Mar 4, 2026 •

edited

Loading

Uh oh!

sentry bot Mar 4, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

sentry bot Mar 5, 2026

Uh oh!

sentry bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gene-jelly commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Results (LoCoMo)

Architecture

Key design decisions

Environment variables

Files

Test plan

Uh oh!

sentry bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

sentry bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

sentry bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gene-jelly commented Mar 4, 2026 •

edited

Loading