Four-layer hybrid search and knowledge graph for AI coding assistants.
BM25 + vector embeddings + RAPTOR directory summaries + graph expansion — fused into a single MCP tool that gives Claude, Copilot, and Cursor a real understanding of your codebase.
Works with Claude Code, GitHub Copilot (VS Code 1.99+), Cursor, Windsurf, and Claude Desktop.
Zero configuration — indexes on first use, stays in sync automatically.
AI assistants are powerful editors, but they navigate code like a tourist:
- Grep finds text — not meaning.
"find authentication logic"returns every file containing the word "auth" - File reads are isolated — Claude sees a file but not its dependencies, callers, or the patterns your team established
- No memory of your project — every session starts from scratch
CodeSeeker fixes this. It indexes your codebase once and gives AI assistants a queryable knowledge graph they can use on every turn.
A 4-stage pipeline runs on every query:
Query: "find JWT refresh token logic"
│
▼ Stage 1 — Hybrid retrieval
┌─────────────────────────────────────────────────────┐
│ BM25 (exact symbols, camelCase tokenized) │
│ + │
│ Vector search (384-dim Xenova embeddings) │
│ ↓ │
│ Reciprocal Rank Fusion: score = Σ 1/(60 + rank_i) │
│ Top-30 results, including RAPTOR directory nodes │
└─────────────────────────────────────────────────────┘
│
▼ Stage 2 — RAPTOR cascade (conditional)
┌─────────────────────────────────────────────────────┐
│ IF best directory-summary score ≥ 0.5: │
│ → narrow results to that directory automatically │
│ ELSE: all 30 results pass through unchanged │
│ Effect: "what does auth/ do?" scopes to auth/ │
│ "jwt.ts decode function" bypasses this │
└─────────────────────────────────────────────────────┘
│
▼ Stage 3 — Scoring and deduplication
┌─────────────────────────────────────────────────────┐
│ Dedup: keep highest-score chunk per file │
│ Source files: +0.10 (definition sites matter) │
│ Test files: −0.15 (prevent test dominance) │
│ Symbol boost: +0.20 (query token in filename) │
│ Multi-chunk: up to +0.30 (file has many hits) │
└─────────────────────────────────────────────────────┘
│
▼ Stage 4 — Graph expansion
┌─────────────────────────────────────────────────────┐
│ Top-10 results → follow IMPORTS/CALLS/EXTENDS edges │
│ Structural neighbors scored at source × 0.7 │
│ Avg graph connectivity: 20.8 edges/node │
└─────────────────────────────────────────────────────┘
│
▼
auth/jwt.ts (0.94), auth/refresh.ts (0.89), ...
The knowledge graph is built from AST-parsed imports at index time. It's what powers analyze dependencies, dead-code detection, and graph expansion in every search.
| Approach | Strengths | Limitations |
|---|---|---|
| Grep / ripgrep | Fast, universal | No semantic understanding |
| Vector search only | Finds similar code | Misses structural relationships |
| Serena | Precise LSP symbol navigation, 30+ languages | No semantic search, no cross-file reasoning |
| Codanna | Fast symbol lookup, good call graphs | Semantic search needs JSDoc — undocumented code gets no embeddings; no BM25, no RAPTOR, Windows experimental |
| CodeSeeker | BM25 + embedding fusion + RAPTOR + graph + coding standards + multi-language AST | Requires initial indexing (30s–5min) |
What LSP tools can't do:
- "Find code that handles errors like this" → semantic pattern search
- "What validation approach does this project use?" → auto-detected coding standards
- "Show me everything related to authentication" → graph traversal across indirect dependencies
What vector-only search misses:
- Direct import/export chains
- Class inheritance hierarchies
- Which files actually depend on which
The standard way to configure any MCP server — no global install required:
{
"mcpServers": {
"codeseeker": {
"command": "npx",
"args": ["-y", "codeseeker", "serve", "--mcp"]
}
}
}Add this to your MCP config file (see below for per-client locations) and restart your editor.
npm install -g codeseeker
codeseeker install --vscode # or --cursor, --windsurfFor Claude Code CLI users — adds auto-sync hooks and slash commands:
/plugin install codeseeker@github:jghiringhelli/codeseeker#pluginSlash commands: /codeseeker:init, /codeseeker:reindex
{
"name": "My Project",
"image": "mcr.microsoft.com/devcontainers/javascript-node:18",
"postCreateCommand": "npm install -g codeseeker && codeseeker install --vscode"
}Ask your AI assistant: "What CodeSeeker tools do you have?"
You should see: search, analyze, index — CodeSeeker's three tools.
📋 MCP Configuration by client
The MCP config JSON is the same for all clients — only the file location differs:
| Client | Config file |
|---|---|
| VS Code (Claude Code / Copilot) | .vscode/mcp.json in your project, or ~/.vscode/mcp.json globally |
| Cursor | .cursor/mcp.json in your project |
| Claude Desktop | ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows) |
| Windsurf | .windsurf/mcp.json in your project |
{
"mcpServers": {
"codeseeker": {
"command": "npx",
"args": ["-y", "codeseeker", "serve", "--mcp"]
}
}
}🖥️ CLI Standalone Usage (without AI assistant)
npm install -g codeseeker
cd your-project
codeseeker init
codeseeker -c "how does authentication work in this project?"Once configured, Claude has access to these MCP tools (used automatically):
| Tool | Actions / Usage | What It Does |
|---|---|---|
search |
{query} |
Hybrid search: vector + BM25 text + path-match, fused with RRF; RAPTOR directory summaries surface for abstract queries |
search |
{query, search_type: "graph"} |
Hybrid search + Graph RAG — follows import/call/extends edges to surface structurally connected files |
search |
{query, search_type: "vector"} |
Pure embedding cosine-similarity search (no BM25 or path scoring) |
search |
{query, search_type: "fts"} |
Pure BM25 text search with CamelCase tokenisation and synonym expansion |
search |
{query, read: true} |
Search + read file contents in one step |
search |
{filepath} |
Read a file with its related code automatically included |
analyze |
{action: "dependencies", filepath} |
Traverse the knowledge graph (imports, calls, extends) |
analyze |
{action: "standards"} |
Your project's detected patterns (validation, error handling) |
analyze |
{action: "duplicates"} |
Find duplicate/similar code blocks across your codebase |
analyze |
{action: "dead_code"} |
Detect unused exports, functions, and classes |
index |
{action: "init", path} |
Manually trigger indexing (rarely needed) |
index |
{action: "sync", changes} |
Update index for specific files |
index |
{action: "exclude", paths} |
Dynamically exclude/include files from the index |
index |
{action: "status"} |
List indexed projects with file/chunk counts |
You don't invoke these manually—Claude uses them automatically when searching code or analyzing relationships.
You don't need to manually index. When Claude uses any CodeSeeker tool, the tool automatically checks if the project is indexed. If not, it indexes on first use.
User: "Find the authentication logic"
│
▼
┌─────────────────────────────────────┐
│ Claude calls search({query: ...}) │
│ │ │
│ ▼ │
│ Project indexed? ──No──► Index now │
│ │ (auto) │
│ Yes │ │
│ │◀───────────────────┘ │
│ ▼ │
│ Return search results │
└─────────────────────────────────────┘
First search on a new project takes 30 seconds to several minutes (depending on size). Subsequent searches are instant.
📊 Component ablation study (v2.0.0) — measured impact of each retrieval layer
18 hand-labelled queries across two real-world codebases:
| Corpus | Language | Files | Queries | Query types |
|---|---|---|---|---|
| Conclave | TypeScript (pnpm monorepo) | 201 | 10 | Symbol lookup, cross-file chains, out-of-scope |
| ImperialCommander2 | C# / Unity | 199 | 8 | Class lookup, controller wiring, file I/O |
Each query has one or more mustFind targets (exact file basenames) and optional mustNotFind targets (scope leak check). Queries were run on a real index built from source — real Xenova embeddings, real graph, real RAPTOR L2 nodes — to reflect production conditions.
Metrics: MRR (Mean Reciprocal Rank), P@1 (Precision at 1), R@5 (Recall at 5), F1@3.
| Configuration | MRR | P@1 | P@3 | R@5 | F1@3 | Notes |
|---|---|---|---|---|---|---|
| Hybrid baseline (BM25 + embed + RAPTOR, no graph) | 75.2% | 61.1% | 29.6% | 91.7% | 44.4% | Production default |
| + graph 1-hop | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | ±0% ranking, adds structural neighbors |
| + graph 2-hop | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | Scope leaks on unrelated queries |
| No RAPTOR (graph 1-hop) | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | RAPTOR contributes +0.3% |
BM25 + embedding fusion (RRF)
The workhorse. Handles ~94% of ranking quality on its own. BM25 catches exact symbol names and camelCase tokens; vector embeddings catch semantic similarity when names differ. Fused with Reciprocal Rank Fusion to combine both signals without manual weight tuning.
RAPTOR (hierarchical directory summaries)
Generates per-directory embedding nodes by mean-pooling all file embeddings in a folder. Acts as a post-filter: when a directory summary scores ≥ 0.5 against the query, results are narrowed to that directory's files. Measured contribution: +0.3% MRR on symbol queries. Fires conservatively — only when the directory is an obvious match. Its real value is on abstract queries ("what does the payments module do?") which don't appear in this benchmark; for those queries it prevents broad scattering across the entire codebase.
Knowledge graph (import/dependency edges)
Average connectivity: 20.8 file→file edges per node across both TS and C# codebases. Measured ranking impact: ±0% MRR for 1-hop expansion. The graph doesn't move MRR because the semantic layer already finds the right files — the graph's neighbors are usually already in the top-15. Its value is structural: the analyze dependencies action and explicit graph search type give Claude traversable import chains, inheritance hierarchies, and dependency paths that embeddings alone cannot provide.
Type boost / penalty scoring
Source files get +0.10 score boost; test files get −0.15 penalty; lock files and docs get −0.05 penalty. Without this, integration.test.ts would rank above dag-engine.ts for exact symbol queries because test files import and exercise every symbol in the source. The penalty corrects this without eliminating test files from results.
Monorepo directory exclusion fix
The single highest-impact change in v1.12.0: removing packages/ from the default exclusion list. For pnpm/yarn/lerna monorepos where all source lives under packages/, this exclusion was silently dropping all source files. Effect: 10% → 72% MRR on the Conclave monorepo benchmark.
| Query | Target | Issue | Root cause |
|---|---|---|---|
cv-prompts |
orchestrator.ts |
rank 97+ even with 2-hop graph | prompt-builder.test.ts outscores prompt-builder.ts semantically; source file never enters top-10, so we can't graph-walk from it to orchestrator.ts. Test-file dominance on cross-file queries. |
cv-exec-mode |
types.ts |
rank 11–12 | types.ts is a pure type-export file; low keyword density. Found within R@5 (rank ≤ 15). |
Reproduce with:
npm run build
node scripts/real-bench.jsRequires C:\workspace\claude\conclave and C:\workspace\ImperialCommander2 to be present locally (or update paths in scripts/real-bench.js).
CodeSeeker analyzes your codebase and extracts patterns:
{
"validation": {
"email": {
"preferred": "z.string().email()",
"usage_count": 12,
"files": ["src/auth.ts", "src/user.ts"]
}
},
"react-patterns": {
"state": {
"preferred": "useState<T>()",
"usage_count": 45
}
}
}Detected pattern categories:
- validation: Zod, Yup, Joi, validator.js, custom regex
- error-handling: API error responses, try-catch patterns, custom Error classes
- logging: Console, Winston, Bunyan, structured logging
- testing: Jest/Vitest setup, assertion patterns
- react-patterns: Hooks (useState, useEffect, useMemo, useCallback, useRef)
- state-management: Redux Toolkit, Zustand, React Context, TanStack Query
- api-patterns: Fetch, Axios, Express routes, Next.js API routes
When Claude writes new code, it follows your existing conventions instead of inventing new ones.
If Claude notices files that shouldn't be indexed (like Unity's Library folder, build outputs, or generated files), it can dynamically exclude them:
// Exclude Unity Library folder and generated files
index({
action: "exclude",
project: "my-unity-game",
paths: ["Library/**", "Temp/**", "*.generated.cs"],
reason: "Unity build artifacts"
})
Exclusions are persisted in .codeseeker/exclusions.json and automatically respected during reindexing.
CodeSeeker helps you maintain a clean codebase by finding duplicate code and detecting dead code.
Ask Claude to find similar code blocks that could be consolidated:
"Find duplicate code in my project"
"Are there any similar functions that could be merged?"
"Show me copy-pasted code that should be refactored"
CodeSeeker uses vector similarity to find semantically similar code—not just exact matches. It detects:
- Copy-pasted functions with minor variations
- Similar validation logic across files
- Repeated patterns that could be extracted into utilities
Ask Claude to identify unused code that can be safely removed:
"Find dead code in this project"
"What functions are never called?"
"Show me unused exports"
CodeSeeker analyzes the knowledge graph to find:
- Exported functions/classes that are never imported
- Internal functions with no callers
- Orphaned files with no incoming dependencies
Example workflow:
User: "Use CodeSeeker to clean up this project"
Claude: I'll analyze your codebase for cleanup opportunities.
Found 3 duplicate code blocks:
- validateEmail() in auth.ts and user.ts (92% similar)
- formatDate() appears in 4 files with minor variations
- Error handling pattern repeated in api/*.ts
Found 2 dead code files:
- src/utils/legacy-helper.ts (0 imports)
- src/services/unused-service.ts (exported but never imported)
Would you like me to:
1. Consolidate the duplicate validators into a shared utility?
2. Remove the dead code files?
| Language | Parser | Relationship Extraction |
|---|---|---|
| TypeScript/JavaScript | Babel AST | Excellent |
| Python | Tree-sitter | Excellent |
| Java | Tree-sitter | Excellent |
| C# | Regex | Good |
| Go | Regex | Good |
| Rust, C/C++, Ruby, PHP | Regex | Basic |
Tree-sitter parsers install automatically when needed.
The plugin installs hooks that automatically update the index:
| Event | What Happens |
|---|---|
| Claude edits a file | Index updated automatically |
Claude runs git pull/checkout/merge |
Full reindex triggered |
You run /codeseeker:reindex |
Manual full reindex |
You don't need to do anything—the plugin handles sync automatically.
- Claude-initiated changes: Claude can call
index({action: "sync"})tool - Manual changes: Not automatically detected—ask Claude to reindex periodically
| Setup | Claude Edits | Git Operations | Manual Edits |
|---|---|---|---|
| Plugin (Claude Code) | Auto | Auto | Manual |
| MCP (Cursor, Desktop) | Ask Claude | Ask Claude | Ask Claude |
| CLI | Auto | Auto | Manual |
Good fit:
- Large codebases (10K+ files) where Claude struggles to find relevant code
- Projects with established patterns you want Claude to follow
- Complex dependency chains across multiple files
- Teams wanting consistent AI-generated code
Less useful:
- Greenfield projects with little existing code
- Single-file scripts
- Projects where you're actively changing architecture
┌──────────────────────────────────────────────────────────┐
│ Claude Code │
│ │ │
│ MCP Protocol │
│ │ │
│ ┌──────────────────────▼──────────────────────────┐ │
│ │ CodeSeeker MCP Server │ │
│ │ ┌─────────────┬─────────────┬────────────────┐ │ │
│ │ │ Vector │ Knowledge │ Coding │ │ │
│ │ │ Search │ Graph │ Standards │ │ │
│ │ │ (SQLite) │ (SQLite) │ (JSON) │ │ │
│ │ └─────────────┴─────────────┴────────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
All data stored locally in .codeseeker/. No external services required.
For large teams (100K+ files, shared indexes), server mode supports PostgreSQL + Neo4j. See Storage Documentation.
For the complete technical internals — exact scoring formulas, MCP tool schema, graph edge types, RAPTOR threshold logic, pipeline stages, analysis confidence tiers — see the Technical Architecture Manual.
- Verify npm and npx work:
npx -y codeseeker --version - Check MCP config file syntax (valid JSON, no trailing commas)
- Restart your editor/Claude application completely
- Check that Node.js is installed:
node --version(need v18+)
First-time indexing of large projects (50K+ files) can take 5+ minutes. Subsequent uses are instant.
- Ask Claude: "What CodeSeeker tools do you have?"
- If no tools appear, check MCP config file exists and has correct syntax
- Restart your IDE completely (not just reload window)
- Check Claude/Copilot MCP connection status in IDE
Open an issue: GitHub Issues
- Integration Guide - How all components connect
- Architecture - Technical deep dive
- CLI Commands - Full command reference
| Client | MCP Support | Config |
|---|---|---|
| Claude Code (VS Code) | ✅ | .vscode/mcp.json or plugin |
| GitHub Copilot (VS Code 1.99+) | ✅ | .vscode/mcp.json |
| Cursor | ✅ | .cursor/mcp.json |
| Windsurf | ✅ | .windsurf/mcp.json |
| Claude Desktop | ✅ | claude_desktop_config.json |
| Visual Studio | ✅ | codeseeker install --vs |
Claude Code and GitHub Copilot share the same
.vscode/mcp.json— configure once, works for both.
If CodeSeeker is useful to you, consider sponsoring the project.
MIT License. See LICENSE.
CodeSeeker gives Claude the code understanding that grep and embeddings alone can't provide.