This document is the canonical operator runbook for benchmark execution in
basic-memory-benchmarks.
It covers:
- current benchmark workflows and commands,
- current manual commit-to-commit comparison workflow,
- planned (not yet implemented) revision matrix workflow.
| Area | Status |
|---|---|
Single run execution (run retrieval, run full, run judge) |
Implemented |
just one-command pipelines (bench-full, bench-full-judge) |
Implemented |
| Artifact generation and publish/compare commands | Implemented |
Manual BM revision comparison via worktrees + --bm-local-path |
Implemented workflow, manual orchestration |
bm-bench run revision-matrix |
Planned, not implemented yet |
- deterministic retrieval evaluation for BM and comparator providers,
- reproducible latency and quality tracking over time,
- publishable artifacts with provenance.
- Official headline: LoCoMo categories 1-4 (
official_headlinein summaries) - Adversarial breakout: LoCoMo category 5 (
adversarial_breakout)
- Same query set for all providers in the same run.
- Same
top_kfor all providers in the same run. - No provider-specific query rewriting for headline runs.
- Provider failures/skips must be explicit in artifacts (
provider-status.json).
- benchmark repo: clone of
basicmachines-co/basic-memory-benchmarks - BM local repo: set
BM_LOCAL_PATHenv var (or in.env) to your localbasic-memorycheckout
.envis auto-loaded byjust(set dotenv-load := true).- For
mem0-local, setOPENAI_API_KEY.
cd /path/to/basic-memory-benchmarks
just syncIf you plan to run judge metrics:
just sync-judgeLoCoMo source and converted outputs are created by:
just bench-prepare-short
just bench-prepare-longbench-fullbench-full-judgebench-prepare-shortbench-prepare-longbench-run-shortbench-run-longbench-run-fullbench-judgebench-validatebench-publishbench-comparebench-latest-run
Top-level commands:
datasets fetchconvert locomorun retrievalrun fullrun judgecomparevalidate-artifactspublish
cd /path/to/basic-memory-benchmarks
just bench-fullThis runs:
just syncjust bench-prepare-longjust bench-run-full
cd /path/to/basic-memory-benchmarks
just bench-full-judgeThis runs:
just sync-judgejust bench-prepare-longjust bench-run-full-judge
Short (quick25):
just bench-prepare-short
just bench-run-shortLong (full LoCoMo):
just bench-prepare-long
just bench-run-longStrict provider mode (fail run if any provider fails/skips):
just bench-run-short-strict
just bench-run-long-strict- Resolve BM command:
- default:
bm - local override:
uv run --project <bm_local_path> basic-memory
- default:
- Create/reuse benchmark project:
basic-memory project add bm-bench-<run_id> <corpus_dir>
- Reindex:
- prefer
reindex --search --embeddings - fallback to
reindex --search
- prefer
- Wait for readiness:
- if supported, poll
status --json --project <name> --local
- if supported, poll
- Start a warm MCP stdio session:
- one long-lived
basic-memory mcpprocess per provider run
- one long-lived
- Execute
search_notescalls over MCP for each query. - Cleanup MCP session.
- Requires
OPENAI_API_KEY. - Uses deterministic user namespace:
bm-bench-<run_id>-mem0
- Ingests markdown corpus with metadata:
source_doc_id,source_path,conversation_id,dataset_id
- Calls
Memory.searchfor each query. - Cleans provider state via
delete_all(user_id=...).
Each run writes to benchmarks/runs/<run_id>/.
Required files:
manifest.jsonprovider-status.jsonper-query-retrieval.jsonlretrieval-summary.jsonsummary.md
Optional judge files:
per-query-judge.jsonljudge-summary.json
From manifest.json:
benchmark_git_shabm_sourcebm_resolved_shabm_local_pathprovider_versionsdataset.checksum_sha256
Get latest run:
just bench-latest-runValidate artifacts:
just bench-validate run_dir="$(just bench-latest-run)"Publish bundle:
just bench-publish run_dir="$(just bench-latest-run)"Use this workflow today to compare BM revisions while keeping benchmark tooling fixed.
BM_REPO=/path/to/basic-memory
WT_ROOT=/path/to/basic-memory-benchmarks/benchmarks/worktrees/basic-memory
mkdir -p "$WT_ROOT"
git -C "$BM_REPO" worktree add "$WT_ROOT/pre_fusion" f5a0e942^
git -C "$BM_REPO" worktree add "$WT_ROOT/fusion" f5a0e942
git -C "$BM_REPO" worktree add "$WT_ROOT/context_step1" f9b2a075
git -C "$BM_REPO" worktree add "$WT_ROOT/context_step2" 9331126b
git -C "$BM_REPO" worktree add "$WT_ROOT/current" HEADcd /path/to/basic-memory-benchmarks
just sync
just bench-prepare-short
just bench-prepare-longRun ID convention:
- short:
<revision>-short-r1 - long:
<revision>-long-r1
Example for one revision (fusion) and long dataset:
uv run bm-bench run retrieval \
--run-id fusion-long-r1 \
--dataset-id locomo \
--dataset-path benchmarks/datasets/locomo/locomo10.json \
--corpus-dir benchmarks/generated/locomo/docs \
--queries-path benchmarks/generated/locomo/queries.json \
--providers bm-local \
--bm-local-path "$WT_ROOT/fusion" \
--strict-providersExample for one revision (fusion) and short dataset:
uv run bm-bench run retrieval \
--run-id fusion-short-r1 \
--dataset-id locomo-c1-quick25 \
--dataset-path benchmarks/datasets/locomo/locomo10.json \
--corpus-dir benchmarks/generated/locomo-c1/docs \
--queries-path benchmarks/generated/locomo-c1/queries.quick25.json \
--providers bm-local \
--bm-local-path "$WT_ROOT/fusion" \
--strict-providersRepeat for:
pre_fusion(f5a0e942^)fusion(f5a0e942)context_step1(f9b2a075)context_step2(9331126b)current(HEAD)
Long anchor:
uv run bm-bench run retrieval \
--run-id mem0-anchor-long-r1 \
--dataset-id locomo \
--dataset-path benchmarks/datasets/locomo/locomo10.json \
--corpus-dir benchmarks/generated/locomo/docs \
--queries-path benchmarks/generated/locomo/queries.json \
--providers mem0-local \
--allow-provider-skipShort anchor:
uv run bm-bench run retrieval \
--run-id mem0-anchor-short-r1 \
--dataset-id locomo-c1-quick25 \
--dataset-path benchmarks/datasets/locomo/locomo10.json \
--corpus-dir benchmarks/generated/locomo-c1/docs \
--queries-path benchmarks/generated/locomo-c1/queries.quick25.json \
--providers mem0-local \
--allow-provider-skipjust bench-compare \
"benchmarks/runs/pre_fusion-long-r1/retrieval-summary.json" \
"benchmarks/runs/fusion-long-r1/retrieval-summary.json" \
bm-local \
recall_at_5Recommended metrics to compare:
recall_at_5recall_at_10mrrmean_latency_msp95_latency_ms
Use a summary table with baseline deltas, for example:
| Revision | Dataset | Recall@5 | Recall@10 | MRR | Delta R@5 vs pre_fusion | Delta MRR vs pre_fusion |
|---|---|---|---|---|---|---|
| pre_fusion | long | ... | ... | ... | 0.000 | 0.000 |
| fusion | long | ... | ... | ... | ... | ... |
| context_step1 | long | ... | ... | ... | ... | ... |
| context_step2 | long | ... | ... | ... | ... | ... |
| current | long | ... | ... | ... | ... | ... |
Status: planned.
Planned defaults:
- worktree-based BM revision execution,
- parallel workers:
2, - datasets:
both(short + long), - replicates:
1, - BM per revision + fixed mem0 anchor.
Planned output root:
benchmarks/matrices/<matrix_id>/
Planned command shape:
uv run bm-bench run revision-matrix \
--bm-repo-path /path/to/basic-memory \
--revisions pre_fusion=f5a0e942^ \
--revisions fusion=f5a0e942 \
--revisions context_step1=f9b2a075 \
--revisions context_step2=9331126b \
--revisions current=HEAD \
--baseline pre_fusion \
--datasets both \
--workers 2 \
--replicates 1 \
--providers-mode bm-only-mem0-anchorSymptoms:
provider-status.jsonshowsbm-localstateerror- reason contains
basic-memory project add ... returned non-zero exit status
Checks:
- verify path exists and is a BM repo:
ls /path/to/basic-memory/pyproject.toml
- verify command works directly:
uv run --project /path/to/basic-memory basic-memory --version
- rerun with explicit local path:
--bm-local-path /path/to/basic-memory
- use strict mode while debugging:
--strict-providers
- Some BM environments support
status --json; some older ones do not. - Provider auto-detects support.
- If unsupported, benchmark still runs after reindex without JSON readiness polling.
skipped: expected gate not met (for example missingOPENAI_API_KEYfor mem0).error: provider attempted execution and failed.
- Full LoCoMo runs are expected to take significantly longer than quick25 runs.
- Use
bench-run-shortfor quick checks before full runs.
BM-only rerun with explicit revision worktree:
uv run bm-bench run retrieval \
--providers bm-local \
--bm-local-path "$WT_ROOT/fusion" \
--run-id fusion-long-r1-retry \
--dataset-id locomo \
--dataset-path benchmarks/datasets/locomo/locomo10.json \
--corpus-dir benchmarks/generated/locomo/docs \
--queries-path benchmarks/generated/locomo/queries.jsonuv solves environment/dependency reproducibility. Worktrees solve source isolation. For commit-to-commit benchmarking, worktrees make each revision explicit, auditable, and safe to run in parallel.
Use strict mode (--strict-providers) for regression investigations and CI gates where silent skips are unacceptable. Use allow-skip mode for exploratory local runs where external credentials may be missing.
just bench-publish run_dir="$(just bench-latest-run)"Or target a specific run directory:
just bench-publish run_dir="benchmarks/runs/<run_id>"Command surface checks:
just --list
uv run bm-bench --help
uv run bm-bench run --helpDry-run checks:
just --dry-run bench-run-short
just --dry-run bench-run-long
just --dry-run bench-full
just --dry-run bench-full-judgeArtifact field checks:
latest=$(just bench-latest-run)
cat "$latest/manifest.json"
cat "$latest/provider-status.json"Comparison check:
just bench-compare \
"benchmarks/runs/<baseline_run>/retrieval-summary.json" \
"benchmarks/runs/<candidate_run>/retrieval-summary.json" \
bm-local \
recall_at_5