🔌 Hermes plugin → github.com/Jnocode/recall-memory-hermes — 安裝指引與 Hermes Agent 整合設定
Better contextual retrieval for AI agents. Three-path RRF retrieval (ANN + keyword SQL JOIN + FTS5) in pure SQLite. No LLM at query time. ~80ms latency. 1400 real memories.
from recall import retrieve_relevant
store.add("User prefers docker-compose over Dockerfile")
results = retrieve_relevant("How should I deploy?", store)
# → "User prefers docker-compose over Dockerfile"recall. uses nomic-embed-text-v1.5 (768-dim) running in LM Studio. No LLM — embedding models are tiny (~150MB), fast, and cost zero tokens.
Download from lmstudio.ai.
| Step | Screenshot / Cmd |
|---|---|
| Open LM Studio → Models tab | — |
Search nomic-embed-text-v1.5 → Download |
~150MB |
| Switch to Local Inference Server tab | — |
Select nomic-embed-text-v1.5 in the model dropdown |
— |
| Click Start Server | port defaults to 1234 |
| Verify it's working: | curl http://127.0.0.1:1234/v1/models |
Expected response:
{"object":"list","data":[{"id":"nomic-embed-text-v1.5","object":"model",...}]}That's it. No API keys, no cloud services, no GPU required beyond what LM Studio needs (~2GB VRAM, also runs on CPU).
Default port is 1234. To change it, set EMBED_PORT in src/recall/embed.py:
EMBED_PORT = 1234 # change to match your LM Studio portGraceful degradation kicks in automatically:
- retrieval (
mcp_recall_recall/recall query) falls back to keyword + FTS5 search — no crash, just no ANN path - storage (
mcp_recall_store_memory/recall add) saves memories without embeddings — still findable via keywords - CLI (
recall add/stats/delete) unaffected — doesn't use embeddings at all
No error, no crash, no data loss. Just slightly less precise results.
pip install numpy
pip install -e .
recall add "User prefers docker-compose for local dev"
recall query "How to deploy?"Or via MCP server for Hermes Agent / Antigravity IDE / Gemini CLI:
If recall is installed in the same Python env as Hermes:
# ~/.hermes/config.yaml
mcp_servers:
recall:
command: "python"
args: ["-m", "recall.recall_mcp"]
timeout: 30
cwd: "/path/to/recall-memory" # optional, needed for DB path resolution# ~/.hermes/config.yaml
mcp_servers:
recall:
command: docker
args:
- run
- -i
- --rm
- --network=host
- -v
- recall-data:/data
- recall-memory:latest
timeout: 30Build the image first:
cd /path/to/recall-memory
docker compose buildstore.py — SQLite backend + tier management
embed.py — Nomic Embed via LM Studio REST API (768-dim)
retrieve.py — Three-path RRF retrieval + tier router
cli.py — Typer CLI (add / query / stats / delete / gc)
recall_mcp.py — MCP server for agent integration
Memories are split into three tiers to reduce compute and memory:
| Tier | Capacity | Retrieval | Compute Cost |
|---|---|---|---|
| Hot | ~500 | ANN + keywords + FTS5 (3-path RRF) | Highest |
| Warm | ~5000 | keywords + FTS5 only (2-path RRF) | Medium |
| Cold | Unlimited | Not indexed, fill-gap fallback only | ~Zero |
- Hot: full vectors in ANN index. Fastest search.
- Warm: keyword/FTS5 only, no vectors. 66–99% less ANN work.
- Cold: doesn't participate in normal queries. Only searched when hot+warm results are insufficient.
Promotion/demotion is automatic based on access frequency. Cold memories are sampled every N queries for keyword overlap—if relevant, they're promoted back to warm. No cron, no UI, no configuration needed.
Three parallel retrieval paths, fused via RRF (Reciprocal Rank Fusion):
Path V: Vector search (ANN) — sqlite-vec cosine similarity (hot tier only)
Path K: Keyword SQL JOIN — multi-hop keyword expansion (all tiers)
Path F: FTS5 full-text search — porter tokenizer + unicode61 (all tiers)
Tier router → hot 3-path → warm 2-path → cold fill-gap
No LLM calls at query time. No vector database. Just SQLite.
| Dependency | Required? | Notes |
|---|---|---|
| Python ≥3.10 | ✅ | — |
| numpy | ✅ | Cosine similarity + vector ops |
| typer | ✅ | CLI interface |
| sqlite-vec | ✅ | SQLite extension for ANN |
| LM Studio (port 1234) | ✅ | Runs nomic-embed-text-v1.5. See Prerequisites above. |
| pytest | ❌ | Only needed for development (pip install -e ".[dev]") |
| sentence-transformers | ❌ | Not used. The actual embedding calls go through LM Studio's HTTP API. |
pip install numpy
pip install -e . # installs recall-memory package + pulls sqlite-vecrecall stats
# → Memories: 0 Keywords: 0recall add "content" # Store a memory
recall query "question" # Retrieve relevant memories (tiered)
recall query "question" --include-cold # Force search cold tier too
recall stats # Store statistics
recall stats --verbose # + tier distribution
recall gc --dry-run # Preview eviction candidates
recall gc # Run garbage collection
recall delete <id> # Remove a memory| Tool | Parameters | Returns |
|---|---|---|
recall |
query: str (required), k: int (default 5), include_cold: bool (default false) |
{memories: [...], count: int} |
store_memory |
content: str (required), session_id: str, tag: str |
{id: str, status: "stored"} |
memory_stats |
(none) | {memories: int, keywords: int, tiers: {hot, warm, cold}} |
gc_memory |
dry_run: bool (default false) |
{evicted/ candidates: int, db_size_mb: float} |
v0.2.0 introduced tiered storage to reduce compute and memory. Here's what happens under the hood — you don't need to configure anything.
Query flow:
You: recall query "docker compose"
→ Hot tier (3-path RRF: ANN + keywords + FTS5) ← ~500 fastest memories
→ Warm tier (2-path RRF: keywords + FTS5 only) ← ~5000 fallback
→ Cold tier (keywords + FTS5, promoted on hit) ← everything else
→ Results returned
- Hot: memories with vector embeddings. ANN search runs here. ~80ms.
- Warm: no vectors, but keyword + FTS5 still work. Slightly lower relevance.
- Cold: doesn't participate in normal queries. Only used if hot+warm results are insufficient.
Promotion/demotion happens automatically:
- A memory you frequently query gets promoted to higher tiers
- Unused memories gradually shift to lower tiers over time
- Cold tier is sampled every 20 queries — if a cold memory's keywords match your query, it gets promoted back to warm
When to use --include-cold:
If you're searching for something very old or obscure that didn't appear in results, add this flag to force a full scan.
When to run gc:
Never, unless you care about disk space. Auto-triggers at 80MB DB size.
recall gc --dry-run previews what would be deleted.
recall gc actually deletes low-score memories (score < 0.5, rarely accessed).
What tiered storage does NOT change:
- Query syntax is identical
- No configuration files to edit
- No cron jobs or background processes
- Schema migration is automatic on
pip install --upgrade
No. Three layers of protection:
- 24h cooldown — a demoted memory cannot be promoted back for 24 hours
- Lifetime threshold —
access_count ≥ 3required before promotion triggers - Batch operation —
replenish_hot()runs during writes, not on the query path
SQLite WAL mode guarantees every reader sees a complete snapshot of the transaction as it began. There is no "tier updated but vector not yet written" intermediate state.
However, promote is not a single atomic operation:
UPDATE tier='hot'→ commitINSERT vec_embedding→ commit
A crash between step 1 and step 2 leaves a tier=hot memory with no vector. This memory is still retrievable via keyword+FTS5 — it just won't appear in ANN search results until reindexed.
These are two different numbers:
- 80MB is the eviction threshold for the main DB file (auto-delete low-score memories), not the WAL size
- WAL is a temporary journal; auto-checkpoint (~4MB default) flushes it back to the main DB and clears it
Promote/demote does not fire on every write. It triggers in two cases:
- A cold memory is hit during query (cold→warm promote)
- The main DB exceeds 80MB (GC demotion)
Each promote is 1-2 INSERT/DELETE statements, not hundreds of rows. The current DB is 32MB — far below the 80MB threshold.
Each memory write (including all indexes) is ~9KB. At ~17 new memories per day, that's ~56MB per year. Modern SSDs are rated for hundreds of TBW — this amount is below the noise floor.
Currently no. GC checks DB size (_gc_if_needed()) after store() commits, and at 32MB < 80MB threshold it's just a stat() call (<1ms).
If the DB eventually exceeds 80MB, GC runs eviction before store() returns. At that point you can raise the threshold or disable auto-GC and run recall gc manually.
24h is a conservative default to prevent thrashing. A memory demoted to cold was likely not accessed for a long time — if it becomes relevant again within 24 hours, lazy sampling (every 20 queries) will catch it and promote it back. Adjust COOLDOWN_HOURS in store.py to change.
| Aspect | recall-sqlite | Mem0 | Honcho |
|---|---|---|---|
| Query-time LLM | Zero | Every call | Every call |
| Forgetting mechanism | ✅ Auto tier demotion | ❌ None | ❌ None |
| Vector DB | None (SQLite) | Qdrant/PGVector | PostgreSQL |
| API Key required | No | Yes | Yes |
| Offline capable | ✅ (graceful fallback) | ❌ | ❌ |
| Data storage | Single SQLite file | Self-hosted | Cloud/self-hosted |
| p50 latency | ~80ms | ~890ms | ~1,420ms |
Not currently on the roadmap. recall-sqlite is designed as a local-first, single-device memory layer. The SQLite backend is intentionally simple — no conflict resolution, no cloud sync, no distributed locking.
Sync could theoretically be layered on top (SQLite files are portable), but it would require careful handling of concurrent writes across devices. If you need multi-device memory, Honcho or Supermemory are better fits today.
Production-ready MVP with tiered storage (v0.2.0). Tested against AIngram (tied on 1400 memories × 40 queries).
Memories: 1400 (from Honcho)
Keywords: 10560
Latency: ~80ms/query (hot), ~60ms/query (warm fill-gap)
ANN scan: -66% (now) → -99% (at 50K memories)
Memory: ~1.5MB fixed for hot tier vs linear growth
Eval: recall@5 comparable to AIngram with full extractor
pip install --upgrade recall-sqliteSchema migration is automatic — SQLite ALTER TABLE runs on first start.
No manual steps needed. Your existing memories are preserved and will start
in the "hot" tier.
To verify:
recall stats --verbose
# Should show the same memory count with tier distributionpip install recall-sqlite==0.1.0| Decision | Rationale |
|---|---|
| Three-path RRF | ANN + SQL JOIN + FTS5 covers different failure modes |
| No LLM re-rank | Extra latency + cost; not needed for retrieval quality |
| SQLite first | Zero-deployment, portable, git-committable |
| Nomic embed via LM Studio | 768-dim, better than MiniLM, no Python packaging hell |
| RRF fusion | No weight tuning needed; standard IR technique |
| System | R@5 (40 mems) | R@5 (1400 mems) | Latency |
|---|---|---|---|
| recall. | 0.579 | ~0.58 | ~80ms |
| AIngram | 0.583 | ~0.58 | ~27ms |
Both systems tied on identical embedding model. recall.'s advantage: three-path architecture (AIngram uses two-path when extractor is unavailable).
Apache 2.0