Add BEAM benchmark dataset (ICLR 2026 — long-term memory evaluation)

## Summary

Add [BEAM](https://github.com/mohammadtavakoli78/BEAM) as a benchmark dataset alongside LoCoMo and LongMemEval.

BEAM is from the ICLR 2026 paper ["Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs"](https://arxiv.org/abs/2510.27246) (University of Alberta + UMass Amherst).

## Why BEAM

- **Scale**: 100 conversations ranging from 128K to 10M tokens (vs LoCoMo's ~10K tokens / 81 QA pairs)
- **Breadth**: Tests 10 distinct memory abilities — abstention, contradiction resolution, event ordering, info extraction, instruction following, knowledge update, multi-hop reasoning, preference following, summarization, temporal reasoning
- **Rigor**: 2,000 human-validated probing questions with nugget-based evaluation
- **Multi-domain**: Coding, math, health, finance, personal — not just casual/personal conversations
- **Key finding**: Even 1M-token context window LLMs degrade substantially on long conversations. RAG alone doesn't fix it. Structured external memory (their LIGHT framework) improves 3.5-12.7%.

This validates the core thesis behind Basic Memory — structured knowledge graphs beat raw context windows.

## Resources

- Paper: https://arxiv.org/abs/2510.27246
- Code + data: https://github.com/mohammadtavakoli78/BEAM
- Dataset (HuggingFace): https://huggingface.co/datasets/Mohammadta/BEAM
- 10M subset: https://huggingface.co/datasets/Mohammadta/BEAM-10M

## Implementation Notes

- Dataset has 4 size tiers: 128K (20 chats), 500K (35), 1M (35), 10M (10)
- Evaluation uses nugget scoring (atomic semantic units) + Kendall tau-b for event ordering
- Will need a converter similar to `locomo_to_corpus.py`
- Consider starting with 128K tier for fast iteration, then scaling up
- Their eval scripts are in the repo — could adapt or wrap

## Context

Drew is working on a fork of supermemory's memorybench (which covers LoCoMo, LongMemEval, ConvoMem). BEAM would give us a much more comprehensive evaluation at scale in our own repo.

Related competitive intel: ByteRover and OpenViking are both positioning as "memory for agents" — having strong benchmark numbers across multiple datasets strengthens BM's story.

## Labels suggestion

enhancement, benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BEAM benchmark dataset (ICLR 2026 — long-term memory evaluation) #12

Summary

Why BEAM

Resources

Implementation Notes

Context

Labels suggestion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add BEAM benchmark dataset (ICLR 2026 — long-term memory evaluation) #12

Description

Summary

Why BEAM

Resources

Implementation Notes

Context

Labels suggestion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions