This repository contains:
- Industry Benchmarks: LongMemEval and LoComo benchmark results for the Hindsight memory system
- Model Leaderboard: Comparative performance metrics for LLMs on fact extraction tasks
- Visualization Tools: Interactive web interface to explore results
hindsight-benchmarks/
├── benchmark-runner/ # Python CLI tools for benchmarking
│ ├── src/hindsight_benchmark/ # Fast benchmark (speed/cost/reliability)
│ ├── quality_benchmark/ # Quality benchmark (accuracy via Hindsight)
│ │ ├── run_quality_benchmark.py
│ │ ├── locomo_quality.json
│ │ └── README.md
│ ├── datasets/
│ ├── benchmark_models.json
│ └── pyproject.toml
├── visualizer/ # Next.js web application
│ ├── app/
│ │ ├── page.tsx # Home page with both sections
│ │ ├── longmemeval/ # Industry benchmark pages
│ │ ├── locomo/ # Industry benchmark pages
│ │ └── leaderboard/ # Model leaderboard pages
│ ├── components/
│ └── lib/
└── results/ # Shared results directory
├── longmemeval.json.gz # Industry benchmark
├── locomo.json.gz # Industry benchmark
├── model-results/ # Fast benchmark results
└── quality/ # Quality benchmark results
cd visualizer
npm install
npm run dev
# Open http://localhost:9998cd benchmark-runner
uv run hindsight-benchmark --dataset simple
# Results saved to ../results/model-results/The Model Leaderboard compares LLM performance using two complementary benchmarks:
Direct model testing for operational metrics:
- Speed (25%): Response latency and throughput
- Cost (20%): Pricing per million tokens
- Reliability (15%): Schema conformance rate
Tests model performance within Hindsight using LoComo conversations:
- Quality (40%): Answer accuracy on conversation recall tasks
- Measures real-world memory system performance
- Runs through Hindsight API to test model in context
Models are tested on 20 diverse conversation scenarios. Each test requires:
- Extracting structured facts from a conversation
- Returning results in a specific JSON schema format
- Valid JSON output with correct schema structure
Running:
cd benchmark-runner
uv run hindsight-benchmark --dataset simple
# Results saved to ../results/model-results/Model configurations are defined in benchmark-runner/benchmark_models.json.
Measures accuracy by running a LoComo conversation through Hindsight with the model configured.
Prerequisites:
- Hindsight API running with the model to test
- hindsight-client Python package installed
- OpenAI API key for LLM judge (recommended)
Running:
cd benchmark-runner/quality_benchmark
# Start Hindsight with your model
cd /path/to/hindsight-wt1
export HINDSIGHT_API_LLM_MODEL=gpt-4o-mini
python -m hindsight_api
# Run quality benchmark
cd /path/to/hindsight-benchmarks/benchmark-runner/quality_benchmark
python run_quality_benchmark.py \
--api-url http://localhost:8888 \
--model-id gpt-4o-mini \
--provider-id openai \
--judge-api-key $OPENAI_API_KEYSee benchmark-runner/quality_benchmark/README.md for detailed instructions.
Navigate to /leaderboard in the visualizer to see:
- Interactive sortable table of all models
- Score breakdowns by dimension
- Non-viable models section
- Detailed metrics for each model
Explore the results yourself on the Benchmarks Visualizer
LongMemEval is a comprehensive benchmark designed to evaluate long-term memory capabilities in conversational AI systems. It tests the system's ability to retrieve and reason about information across multiple conversation sessions.
Explore the Dataset: You can explore the LongMemEval dataset using the LongMemEval Inspector.
The table below shows performance across different memory systems on the LongMemEval benchmark (S setting, 500 questions):
| Method | Single-session User | Single-session Assistant | Single-session Preference | Knowledge Update | Temporal Reasoning | Multi-session | Overall |
|---|---|---|---|---|---|---|---|
| Full-context (GPT-4o) | 81.4% | 94.6% | 20.0% | 78.2% | 45.1% | 44.3% | 60.2% |
| Full-context (OSS-20B) | 38.6% | 80.4% | 20.0% | 60.3% | 31.6% | 21.1% | 39.0% |
| Zep (GPT-4o) | 92.9% | 80.4% | 56.7% | 83.3% | 62.4% | 57.9% | 71.2% |
| Supermemory (GPT-4o) | 97.1% | 96.4% | 70.0% | 88.5% | 76.7% | 71.4% | 81.6% |
| Supermemory (GPT-5) | 97.1% | 100.0% | 76.7% | 87.2% | 81.2% | 75.2% | 84.6% |
| Supermemory (Gemini-3) | 98.6% | 98.2% | 70.0% | 89.7% | 82.0% | 76.7% | 85.2% |
| Hindsight (OSS-20B) | 95.7% | 94.6% | 66.7% | 84.6% | 79.7% | 79.7% | 83.6% |
| Hindsight (OSS-120B) | 100.0% | 98.2% | 86.7% | 92.3% | 85.7% | 81.2% | 89.0% |
| Hindsight (Gemini-3) | 97.1% | 96.4% | 80.0% | 94.9% | 91.0% | 87.2% | 91.4% |
Key Highlights:
- Hindsight with Gemini-3 Pro achieves 91.4% overall accuracy, the best result across all systems and model backbones
- Hindsight with OSS-120B achieves 89.0%, outperforming Supermemory with GPT-4o (81.6%) and GPT-5 (84.6%)
- +44.6 percentage point improvement: Hindsight with OSS-20B (83.6%) vs Full-context OSS-20B baseline (39.0%) demonstrates that the memory architecture, not model size, drives performance
- The largest gains appear in long-horizon categories: multi-session improves from 21.1% to 79.7%, temporal reasoning from 31.6% to 79.7%
- Even with a smaller open-source 20B model, Hindsight surpasses Full-context GPT-4o (60.2%) and matches Supermemory+GPT-4o (81.6%)
Cost Efficiency: Exceptionally low costs achieved through sophisticated token reduction techniques in the Retain pipeline and LLM-free memory recalls - retrieving memories incurs zero LLM cost, enabling unlimited recall operations in production.
Infrastructure: Local MacBook with PostgreSQL - no specialized cloud infrastructure required
To reproduce these results, visit the main Hindsight repository:
github.com/vectorize-io/hindsight
Follow the benchmark instructions in the repository documentation.
LoComo (Long Conversation Memory) is a benchmark designed to test memory systems on long, multi-turn conversations with questions requiring recall of specific details from earlier in the dialogue.
The table below shows accuracy (%) by question type and overall for prior memory systems and Hindsight with different backbone models:
| Method | Single-Hop | Multi-Hop | Open Domain | Temporal | Overall |
|---|---|---|---|---|---|
| Backboard | 89.36 | 75.00 | 91.20 | 91.90 | 90.00 |
| Memobase (v0.0.37) | 70.92 | 46.88 | 77.17 | 85.05 | 75.78 |
| Zep | 74.11 | 66.04 | 67.71 | 79.79 | 75.14 |
| Mem0-Graph | 65.71 | 47.19 | 75.71 | 58.13 | 68.44 |
| Mem0 | 67.13 | 51.15 | 72.93 | 55.51 | 66.88 |
| LangMem | 62.23 | 47.92 | 71.12 | 23.43 | 58.10 |
| OpenAI | 63.79 | 42.92 | 62.29 | 21.71 | 52.90 |
| Hindsight (OSS-20B) | 74.11 | 64.58 | 90.96 | 76.32 | 83.18 |
| Hindsight (OSS-120B) | 76.79 | 62.50 | 93.68 | 79.44 | 85.67 |
| Hindsight (Gemini-3) | 86.17 | 70.83 | 95.12 | 83.80 | 89.61 |
Key Highlights:
- Across all backbone sizes, Hindsight consistently outperforms prior open memory systems such as Memobase, Zep, Mem0, and LangMem
- Hindsight raises overall accuracy from 75.78% (Memobase) to 83.18% with OSS-20B and 85.67% with OSS-120B
- Hindsight with Gemini-3 attains 89.61% overall accuracy and the highest Open Domain score (95.12%), closely matching Backboard's 90.00% overall performance
- These results demonstrate that the gains from Hindsight's memory architecture on LongMemEval transfer to realistic, multi-session human conversations
Note: We skipped the Adversarial category as it is almost impossible to evaluate reliably due to the subjective and ambiguous nature of the questions in that category.
While Hindsight achieves solid performance on LoComo, we do not consider this benchmark to be a reliable indicator of memory system quality due to significant flaws in the dataset design and evaluation methodology.
Known Issues with LoComo:
- Missing and Flawed Ground Truth - Some categories have missing ground truth answers, speaker attribution errors, and inconsistencies in what is marked as correct
- Ambiguous Questions - Many questions have multiple valid interpretations and lack sufficient specificity to have a single correct answer
- Insufficient Challenge - Conversations are too short (16k-26k tokens), fitting within modern LLM context windows, failing to genuinely test memory retrieval capabilities
- Limited Evaluation Scope - Lacks critical tests for knowledge updates and temporal reasoning that are essential for real-world memory systems
- Data Quality Issues - Multimodal errors (image references without descriptions), poor conversation design, and unrealistic dialogue patterns
References:
- [https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/]
- [https://www.kdjingpai.com/en/ai-zhinengtijiyiban/]
For these reasons, we recommend focusing on LongMemEval as a more reliable indicator of memory system performance. LongMemEval provides better-quality ground truth, more realistic conversation scenarios, and a broader evaluation of memory capabilities.
To reproduce these results, visit the main Hindsight repository:
github.com/vectorize-io/hindsight
To visualize the benchmark results:
cd visualizer
npm install
npm run devThen open http://localhost:9998 in your browser.
The visualizer provides:
- 📊 Interactive benchmark overview with category breakdowns
- 🔍 Advanced filtering (by category, correctness, item ID)
- 📝 Detailed question-level analysis with reasoning and retrieved memories
- 🎯 Beautiful, responsive UI built with Next.js and Tailwind CSS
For deployment options and more details, see visualizer/README.md.
