-
Notifications
You must be signed in to change notification settings - Fork 0
Add BEAM benchmark dataset (ICLR 2026 — long-term memory evaluation) #12
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add BEAM as a benchmark dataset alongside LoCoMo and LongMemEval.
BEAM is from the ICLR 2026 paper "Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs" (University of Alberta + UMass Amherst).
Why BEAM
- Scale: 100 conversations ranging from 128K to 10M tokens (vs LoCoMo's ~10K tokens / 81 QA pairs)
- Breadth: Tests 10 distinct memory abilities — abstention, contradiction resolution, event ordering, info extraction, instruction following, knowledge update, multi-hop reasoning, preference following, summarization, temporal reasoning
- Rigor: 2,000 human-validated probing questions with nugget-based evaluation
- Multi-domain: Coding, math, health, finance, personal — not just casual/personal conversations
- Key finding: Even 1M-token context window LLMs degrade substantially on long conversations. RAG alone doesn't fix it. Structured external memory (their LIGHT framework) improves 3.5-12.7%.
This validates the core thesis behind Basic Memory — structured knowledge graphs beat raw context windows.
Resources
- Paper: https://arxiv.org/abs/2510.27246
- Code + data: https://github.com/mohammadtavakoli78/BEAM
- Dataset (HuggingFace): https://huggingface.co/datasets/Mohammadta/BEAM
- 10M subset: https://huggingface.co/datasets/Mohammadta/BEAM-10M
Implementation Notes
- Dataset has 4 size tiers: 128K (20 chats), 500K (35), 1M (35), 10M (10)
- Evaluation uses nugget scoring (atomic semantic units) + Kendall tau-b for event ordering
- Will need a converter similar to
locomo_to_corpus.py - Consider starting with 128K tier for fast iteration, then scaling up
- Their eval scripts are in the repo — could adapt or wrap
Context
Drew is working on a fork of supermemory's memorybench (which covers LoCoMo, LongMemEval, ConvoMem). BEAM would give us a much more comprehensive evaluation at scale in our own repo.
Related competitive intel: ByteRover and OpenViking are both positioning as "memory for agents" — having strong benchmark numbers across multiple datasets strengthens BM's story.
Labels suggestion
enhancement, benchmarks
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request