Skip to content

jghiringhelli/codeseeker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

171 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeSeeker

Four-layer hybrid search and knowledge graph for AI coding assistants.
BM25 + vector embeddings + RAPTOR directory summaries + graph expansion — fused into a single MCP tool that gives Claude, Copilot, and Cursor a real understanding of your codebase.

npm version License: MIT TypeScript

Works with Claude Code, GitHub Copilot (VS Code 1.99+), Cursor, Windsurf, and Claude Desktop.
Zero configuration — indexes on first use, stays in sync automatically.

The Problem

AI assistants are powerful editors, but they navigate code like a tourist:

  • Grep finds text — not meaning. "find authentication logic" returns every file containing the word "auth"
  • File reads are isolated — Claude sees a file but not its dependencies, callers, or the patterns your team established
  • No memory of your project — every session starts from scratch

CodeSeeker fixes this. It indexes your codebase once and gives AI assistants a queryable knowledge graph they can use on every turn.

How It Works

A 4-stage pipeline runs on every query:

Query: "find JWT refresh token logic"
        │
        ▼  Stage 1 — Hybrid retrieval
   ┌─────────────────────────────────────────────────────┐
   │ BM25 (exact symbols, camelCase tokenized)           │
   │   +                                                 │
   │ Vector search (384-dim Xenova embeddings)           │
   │   ↓                                                 │
   │ Reciprocal Rank Fusion: score = Σ 1/(60 + rank_i)  │
   │ Top-30 results, including RAPTOR directory nodes    │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 2 — RAPTOR cascade (conditional)
   ┌─────────────────────────────────────────────────────┐
   │ IF best directory-summary score ≥ 0.5:              │
   │   → narrow results to that directory automatically  │
   │ ELSE: all 30 results pass through unchanged         │
   │ Effect: "what does auth/ do?" scopes to auth/       │
   │         "jwt.ts decode function" bypasses this      │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 3 — Scoring and deduplication
   ┌─────────────────────────────────────────────────────┐
   │ Dedup: keep highest-score chunk per file            │
   │ Source files:  +0.10  (definition sites matter)     │
   │ Test files:    −0.15  (prevent test dominance)      │
   │ Symbol boost:  +0.20  (query token in filename)     │
   │ Multi-chunk:   up to +0.30  (file has many hits)    │
   └─────────────────────────────────────────────────────┘
        │
        ▼  Stage 4 — Graph expansion
   ┌─────────────────────────────────────────────────────┐
   │ Top-10 results → follow IMPORTS/CALLS/EXTENDS edges │
   │ Structural neighbors scored at source × 0.7        │
   │ Avg graph connectivity: 20.8 edges/node             │
   └─────────────────────────────────────────────────────┘
        │
        ▼
   auth/jwt.ts (0.94), auth/refresh.ts (0.89), ...

The knowledge graph is built from AST-parsed imports at index time. It's what powers analyze dependencies, dead-code detection, and graph expansion in every search.

What Makes It Different

Approach Strengths Limitations
Grep / ripgrep Fast, universal No semantic understanding
Vector search only Finds similar code Misses structural relationships
Serena Precise LSP symbol navigation, 30+ languages No semantic search, no cross-file reasoning
Codanna Fast symbol lookup, good call graphs Semantic search needs JSDoc — undocumented code gets no embeddings; no BM25, no RAPTOR, Windows experimental
CodeSeeker BM25 + embedding fusion + RAPTOR + graph + coding standards + multi-language AST Requires initial indexing (30s–5min)

What LSP tools can't do:

  • "Find code that handles errors like this" → semantic pattern search
  • "What validation approach does this project use?" → auto-detected coding standards
  • "Show me everything related to authentication" → graph traversal across indirect dependencies

What vector-only search misses:

  • Direct import/export chains
  • Class inheritance hierarchies
  • Which files actually depend on which

Installation

Recommended: npx (no install needed)

The standard way to configure any MCP server — no global install required:

{
  "mcpServers": {
    "codeseeker": {
      "command": "npx",
      "args": ["-y", "codeseeker", "serve", "--mcp"]
    }
  }
}

Add this to your MCP config file (see below for per-client locations) and restart your editor.

npm global install

npm install -g codeseeker
codeseeker install --vscode      # or --cursor, --windsurf

🔌 Claude Code Plugin

For Claude Code CLI users — adds auto-sync hooks and slash commands:

/plugin install codeseeker@github:jghiringhelli/codeseeker#plugin

Slash commands: /codeseeker:init, /codeseeker:reindex

☁️ Devcontainers / GitHub Codespaces

{
  "name": "My Project",
  "image": "mcr.microsoft.com/devcontainers/javascript-node:18",
  "postCreateCommand": "npm install -g codeseeker && codeseeker install --vscode"
}

✅ Verify

Ask your AI assistant: "What CodeSeeker tools do you have?"

You should see: search, analyze, index — CodeSeeker's three tools.

Advanced Installation Options

📋 MCP Configuration by client

The MCP config JSON is the same for all clients — only the file location differs:

Client Config file
VS Code (Claude Code / Copilot) .vscode/mcp.json in your project, or ~/.vscode/mcp.json globally
Cursor .cursor/mcp.json in your project
Claude Desktop ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows)
Windsurf .windsurf/mcp.json in your project
{
  "mcpServers": {
    "codeseeker": {
      "command": "npx",
      "args": ["-y", "codeseeker", "serve", "--mcp"]
    }
  }
}
🖥️ CLI Standalone Usage (without AI assistant)
npm install -g codeseeker
cd your-project
codeseeker init
codeseeker -c "how does authentication work in this project?"

What You Get

Once configured, Claude has access to these MCP tools (used automatically):

Tool Actions / Usage What It Does
search {query} Hybrid search: vector + BM25 text + path-match, fused with RRF; RAPTOR directory summaries surface for abstract queries
search {query, search_type: "graph"} Hybrid search + Graph RAG — follows import/call/extends edges to surface structurally connected files
search {query, search_type: "vector"} Pure embedding cosine-similarity search (no BM25 or path scoring)
search {query, search_type: "fts"} Pure BM25 text search with CamelCase tokenisation and synonym expansion
search {query, read: true} Search + read file contents in one step
search {filepath} Read a file with its related code automatically included
analyze {action: "dependencies", filepath} Traverse the knowledge graph (imports, calls, extends)
analyze {action: "standards"} Your project's detected patterns (validation, error handling)
analyze {action: "duplicates"} Find duplicate/similar code blocks across your codebase
analyze {action: "dead_code"} Detect unused exports, functions, and classes
index {action: "init", path} Manually trigger indexing (rarely needed)
index {action: "sync", changes} Update index for specific files
index {action: "exclude", paths} Dynamically exclude/include files from the index
index {action: "status"} List indexed projects with file/chunk counts

You don't invoke these manually—Claude uses them automatically when searching code or analyzing relationships.

How Indexing Works

You don't need to manually index. When Claude uses any CodeSeeker tool, the tool automatically checks if the project is indexed. If not, it indexes on first use.

User: "Find the authentication logic"
        │
        ▼
┌─────────────────────────────────────┐
│ Claude calls search({query: ...})  │
│         │                           │
│         ▼                           │
│ Project indexed? ──No──► Index now  │
│         │                  (auto)   │
│        Yes                   │      │
│         │◀───────────────────┘      │
│         ▼                           │
│ Return search results               │
└─────────────────────────────────────┘

First search on a new project takes 30 seconds to several minutes (depending on size). Subsequent searches are instant.


Search Quality Research

📊 Component ablation study (v2.0.0) — measured impact of each retrieval layer

Setup

18 hand-labelled queries across two real-world codebases:

Corpus Language Files Queries Query types
Conclave TypeScript (pnpm monorepo) 201 10 Symbol lookup, cross-file chains, out-of-scope
ImperialCommander2 C# / Unity 199 8 Class lookup, controller wiring, file I/O

Each query has one or more mustFind targets (exact file basenames) and optional mustNotFind targets (scope leak check). Queries were run on a real index built from source — real Xenova embeddings, real graph, real RAPTOR L2 nodes — to reflect production conditions.

Metrics: MRR (Mean Reciprocal Rank), P@1 (Precision at 1), R@5 (Recall at 5), F1@3.

Ablation results

Configuration MRR P@1 P@3 R@5 F1@3 Notes
Hybrid baseline (BM25 + embed + RAPTOR, no graph) 75.2% 61.1% 29.6% 91.7% 44.4% Production default
+ graph 1-hop 74.9% 61.1% 29.6% 91.7% 44.4% ±0% ranking, adds structural neighbors
+ graph 2-hop 74.9% 61.1% 29.6% 91.7% 44.4% Scope leaks on unrelated queries
No RAPTOR (graph 1-hop) 74.9% 61.1% 29.6% 91.7% 44.4% RAPTOR contributes +0.3%

What each layer actually does

BM25 + embedding fusion (RRF)
The workhorse. Handles ~94% of ranking quality on its own. BM25 catches exact symbol names and camelCase tokens; vector embeddings catch semantic similarity when names differ. Fused with Reciprocal Rank Fusion to combine both signals without manual weight tuning.

RAPTOR (hierarchical directory summaries)
Generates per-directory embedding nodes by mean-pooling all file embeddings in a folder. Acts as a post-filter: when a directory summary scores ≥ 0.5 against the query, results are narrowed to that directory's files. Measured contribution: +0.3% MRR on symbol queries. Fires conservatively — only when the directory is an obvious match. Its real value is on abstract queries ("what does the payments module do?") which don't appear in this benchmark; for those queries it prevents broad scattering across the entire codebase.

Knowledge graph (import/dependency edges)
Average connectivity: 20.8 file→file edges per node across both TS and C# codebases. Measured ranking impact: ±0% MRR for 1-hop expansion. The graph doesn't move MRR because the semantic layer already finds the right files — the graph's neighbors are usually already in the top-15. Its value is structural: the analyze dependencies action and explicit graph search type give Claude traversable import chains, inheritance hierarchies, and dependency paths that embeddings alone cannot provide.

Type boost / penalty scoring
Source files get +0.10 score boost; test files get −0.15 penalty; lock files and docs get −0.05 penalty. Without this, integration.test.ts would rank above dag-engine.ts for exact symbol queries because test files import and exercise every symbol in the source. The penalty corrects this without eliminating test files from results.

Monorepo directory exclusion fix
The single highest-impact change in v1.12.0: removing packages/ from the default exclusion list. For pnpm/yarn/lerna monorepos where all source lives under packages/, this exclusion was silently dropping all source files. Effect: 10% → 72% MRR on the Conclave monorepo benchmark.

Known limitations

Query Target Issue Root cause
cv-prompts orchestrator.ts rank 97+ even with 2-hop graph prompt-builder.test.ts outscores prompt-builder.ts semantically; source file never enters top-10, so we can't graph-walk from it to orchestrator.ts. Test-file dominance on cross-file queries.
cv-exec-mode types.ts rank 11–12 types.ts is a pure type-export file; low keyword density. Found within R@5 (rank ≤ 15).

Benchmark script

Reproduce with:

npm run build
node scripts/real-bench.js

Requires C:\workspace\claude\conclave and C:\workspace\ImperialCommander2 to be present locally (or update paths in scripts/real-bench.js).

Auto-Detected Coding Standards

CodeSeeker analyzes your codebase and extracts patterns:

{
  "validation": {
    "email": {
      "preferred": "z.string().email()",
      "usage_count": 12,
      "files": ["src/auth.ts", "src/user.ts"]
    }
  },
  "react-patterns": {
    "state": {
      "preferred": "useState<T>()",
      "usage_count": 45
    }
  }
}

Detected pattern categories:

  • validation: Zod, Yup, Joi, validator.js, custom regex
  • error-handling: API error responses, try-catch patterns, custom Error classes
  • logging: Console, Winston, Bunyan, structured logging
  • testing: Jest/Vitest setup, assertion patterns
  • react-patterns: Hooks (useState, useEffect, useMemo, useCallback, useRef)
  • state-management: Redux Toolkit, Zustand, React Context, TanStack Query
  • api-patterns: Fetch, Axios, Express routes, Next.js API routes

When Claude writes new code, it follows your existing conventions instead of inventing new ones.

Managing Index Exclusions

If Claude notices files that shouldn't be indexed (like Unity's Library folder, build outputs, or generated files), it can dynamically exclude them:

// Exclude Unity Library folder and generated files
index({
  action: "exclude",
  project: "my-unity-game",
  paths: ["Library/**", "Temp/**", "*.generated.cs"],
  reason: "Unity build artifacts"
})

Exclusions are persisted in .codeseeker/exclusions.json and automatically respected during reindexing.

Code Cleanup Tools

CodeSeeker helps you maintain a clean codebase by finding duplicate code and detecting dead code.

Finding Duplicate Code

Ask Claude to find similar code blocks that could be consolidated:

"Find duplicate code in my project"
"Are there any similar functions that could be merged?"
"Show me copy-pasted code that should be refactored"

CodeSeeker uses vector similarity to find semantically similar code—not just exact matches. It detects:

  • Copy-pasted functions with minor variations
  • Similar validation logic across files
  • Repeated patterns that could be extracted into utilities

Finding Dead Code

Ask Claude to identify unused code that can be safely removed:

"Find dead code in this project"
"What functions are never called?"
"Show me unused exports"

CodeSeeker analyzes the knowledge graph to find:

  • Exported functions/classes that are never imported
  • Internal functions with no callers
  • Orphaned files with no incoming dependencies

Example workflow:

User: "Use CodeSeeker to clean up this project"

Claude: I'll analyze your codebase for cleanup opportunities.

Found 3 duplicate code blocks:
- validateEmail() in auth.ts and user.ts (92% similar)
- formatDate() appears in 4 files with minor variations
- Error handling pattern repeated in api/*.ts

Found 2 dead code files:
- src/utils/legacy-helper.ts (0 imports)
- src/services/unused-service.ts (exported but never imported)

Would you like me to:
1. Consolidate the duplicate validators into a shared utility?
2. Remove the dead code files?

Language Support

Language Parser Relationship Extraction
TypeScript/JavaScript Babel AST Excellent
Python Tree-sitter Excellent
Java Tree-sitter Excellent
C# Regex Good
Go Regex Good
Rust, C/C++, Ruby, PHP Regex Basic

Tree-sitter parsers install automatically when needed.

Keeping the Index in Sync

With Claude Code Plugin

The plugin installs hooks that automatically update the index:

Event What Happens
Claude edits a file Index updated automatically
Claude runs git pull/checkout/merge Full reindex triggered
You run /codeseeker:reindex Manual full reindex

You don't need to do anything—the plugin handles sync automatically.

With MCP Server Only (Cursor, Claude Desktop)

  • Claude-initiated changes: Claude can call index({action: "sync"}) tool
  • Manual changes: Not automatically detected—ask Claude to reindex periodically

Sync Summary

Setup Claude Edits Git Operations Manual Edits
Plugin (Claude Code) Auto Auto Manual
MCP (Cursor, Desktop) Ask Claude Ask Claude Ask Claude
CLI Auto Auto Manual

When CodeSeeker Helps Most

Good fit:

  • Large codebases (10K+ files) where Claude struggles to find relevant code
  • Projects with established patterns you want Claude to follow
  • Complex dependency chains across multiple files
  • Teams wanting consistent AI-generated code

Less useful:

  • Greenfield projects with little existing code
  • Single-file scripts
  • Projects where you're actively changing architecture

Architecture

┌──────────────────────────────────────────────────────────┐
│                     Claude Code                          │
│                         │                                │
│                    MCP Protocol                          │
│                         │                                │
│  ┌──────────────────────▼──────────────────────────┐    │
│  │              CodeSeeker MCP Server               │    │
│  │  ┌─────────────┬─────────────┬────────────────┐ │    │
│  │  │   Vector    │  Knowledge  │    Coding      │ │    │
│  │  │   Search    │    Graph    │   Standards    │ │    │
│  │  │  (SQLite)   │  (SQLite)   │   (JSON)       │ │    │
│  │  └─────────────┴─────────────┴────────────────┘ │    │
│  └─────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────┘

All data stored locally in .codeseeker/. No external services required.

For large teams (100K+ files, shared indexes), server mode supports PostgreSQL + Neo4j. See Storage Documentation.

For the complete technical internals — exact scoring formulas, MCP tool schema, graph edge types, RAPTOR threshold logic, pipeline stages, analysis confidence tiers — see the Technical Architecture Manual.

Troubleshooting

MCP server not connecting

  1. Verify npm and npx work: npx -y codeseeker --version
  2. Check MCP config file syntax (valid JSON, no trailing commas)
  3. Restart your editor/Claude application completely
  4. Check that Node.js is installed: node --version (need v18+)

Indexing seems slow

First-time indexing of large projects (50K+ files) can take 5+ minutes. Subsequent uses are instant.

Tools not appearing in Claude

  1. Ask Claude: "What CodeSeeker tools do you have?"
  2. If no tools appear, check MCP config file exists and has correct syntax
  3. Restart your IDE completely (not just reload window)
  4. Check Claude/Copilot MCP connection status in IDE

Still stuck?

Open an issue: GitHub Issues

Documentation

Supported Platforms

Client MCP Support Config
Claude Code (VS Code) .vscode/mcp.json or plugin
GitHub Copilot (VS Code 1.99+) .vscode/mcp.json
Cursor .cursor/mcp.json
Windsurf .windsurf/mcp.json
Claude Desktop claude_desktop_config.json
Visual Studio codeseeker install --vs

Claude Code and GitHub Copilot share the same .vscode/mcp.json — configure once, works for both.

Support

If CodeSeeker is useful to you, consider sponsoring the project.

License

MIT License. See LICENSE.


CodeSeeker gives Claude the code understanding that grep and embeddings alone can't provide.

About

MCP server with semantic search + knowledge graph for Claude Code, Cursor, and Copilot

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors