CodeRAG runs fully locally with no API key out of the box. Everything below is
optional — reach for it only when you want to change a default, scale up, or turn on the
optional LLM answer. Settings come from CODERAG_* environment variables (or a .env
file — copy example.env and edit). A few CLI flags
(--watched-dir, --store-dir, --provider, --model) override the matching variable
for a single command.
CodeRAG uses models in two independent places. Keeping them straight removes most of the confusion about "OpenAI or Anthropic":
| Role | What it does | Default | Required? |
|---|---|---|---|
| Embedding model | Turns code + your query into vectors — this is the search. | Local ONNX bge-small via fastembed. |
Always used. Local, no key. |
| Answer LLM | Optional. Writes a grounded, cited prose answer from the retrieved chunks (coderag search --answer, the UI's answer box). |
Off. | Only if you want generated answers. |
Search works with only the embedding model. The answer LLM is a separate, optional layer — and it, too, can be a local model. You never need a cloud account to use CodeRAG.
Local embeddings, no API key, no answer LLM. Just index and search:
pip install -e .
coderag index --watched-dir /path/to/repo
coderag search "where is retry/backoff handled" --watched-dir /path/to/repoNothing to configure. Want a stronger (still local, still keyless) embedding model? See Local embedding models below.
Add a --answer that's written by a local LLM. Any OpenAI-compatible server works —
you point CodeRAG at it with OPENAI_BASE_URL and name the model with
CODERAG_CHAT_MODEL. The answer backend stays openai (it means "the OpenAI protocol",
not the OpenAI company); no API key is needed for a local server.
# 1. Run a local model server, e.g. Ollama:
ollama serve
ollama pull llama3.1
# 2. Point CodeRAG at it (in your shell or .env):
export OPENAI_BASE_URL=http://localhost:11434/v1 # Ollama's OpenAI-compatible endpoint
export CODERAG_CHAT_MODEL=llama3.1 # the model name your server serves
# 3. Search with a locally-generated answer:
coderag search "how is the vector index persisted" --answerOther local servers expose the same OpenAI-compatible API — only the base URL and model name change:
| Server | Typical OPENAI_BASE_URL |
Notes |
|---|---|---|
| Ollama | http://localhost:11434/v1 |
ollama pull <model>; key ignored. |
| LM Studio | http://localhost:1234/v1 |
Start the local server from the app. |
| vLLM | http://localhost:8000/v1 |
vllm serve <model>; OpenAI-compatible. |
| LocalAI | http://localhost:8080/v1 |
Drop-in OpenAI replacement. |
A local server usually ignores the API key, so you can leave
OPENAI_API_KEYunset — CodeRAG sends a harmless placeholder. Set it only if your server enforces one.
pip install -e ".[openai]"
export OPENAI_API_KEY=sk-...
# Cloud embeddings (optional — the local default is fine for most repos):
coderag index --provider openai # uses CODERAG_OPENAI_MODEL (text-embedding-3-small)
# Cloud answer:
coderag search "where is auth handled" --answer # uses CODERAG_CHAT_MODEL (gpt-4o-mini)The answer LLM can be Claude instead of an OpenAI-compatible model. (Embeddings still come from the local default or OpenAI — Anthropic is answer-only here.)
pip install -e ".[anthropic]"
export CODERAG_LLM_PROVIDER=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
coderag search "where is auth handled" --answer # uses CODERAG_ANTHROPIC_MODELBeyond the built-in fastembed models, you can serve embeddings from a local
OpenAI-compatible endpoint (e.g. Ollama embedding models,
text-embeddings-inference). Use the openai provider pointed at your local URL:
export OPENAI_BASE_URL=http://localhost:11434/v1
export CODERAG_OPENAI_MODEL=nomic-embed-text
coderag index --provider openaiChanging the embedding model (its dimension) triggers a one-time index rebuild — that's expected and safe (the LanceDB store is rebuildable — re-indexing recreates it from source).
The default BAAI/bge-small-en-v1.5 is the smallest/fastest; larger local models score
better on code retrieval. List the curated, keyless options:
coderag eval --list-modelsThen select one (downloaded once, cached under CODERAG_CACHE_DIR):
export CODERAG_MODEL=jinaai/jina-embeddings-v2-base-code # best out-of-the-box local code retriever
coderag index --fullWant sharper top results without a cloud call? Turn on the optional local cross-encoder
reranker (no key, ~30 ms/query): export CODERAG_RERANK=1.
Grouped by area. Default is what you get with the variable unset; everything here is optional.
| Variable | Default | Meaning |
|---|---|---|
CODERAG_PROVIDER |
fastembed |
Embedding backend: fastembed (local) · openai (OpenAI API or any OpenAI-compatible/local server) · fake (deterministic, tests). |
CODERAG_MODEL |
BAAI/bge-small-en-v1.5 |
Local fastembed model id (coderag eval --list-models). |
CODERAG_OPENAI_MODEL |
text-embedding-3-small |
Embedding model when provider=openai (cloud or local server). |
CODERAG_CACHE_DIR |
~/.cache/coderag |
Where downloaded local models are cached. |
CODERAG_EMBED_BATCH |
64 |
Texts embedded per batch during indexing. |
| Variable | Default | Meaning |
|---|---|---|
CODERAG_LLM_PROVIDER |
openai |
Answer backend: openai (OpenAI API or any OpenAI-compatible/local server) · anthropic. |
CODERAG_CHAT_MODEL |
gpt-4o-mini |
Chat model for the openai backend — set to your local model name (e.g. llama3.1) when using OPENAI_BASE_URL. |
OPENAI_API_KEY |
– | OpenAI key. Needed for the OpenAI cloud (embeddings or answers); optional for a local server. |
OPENAI_BASE_URL |
– | Point the OpenAI client at a self-hosted/local server (Ollama, vLLM, LM Studio, LocalAI). Enables local embeddings and local answers. |
ANTHROPIC_API_KEY |
– | Anthropic (Claude) key — required when CODERAG_LLM_PROVIDER=anthropic. |
CODERAG_ANTHROPIC_MODEL |
claude-opus-4-8 |
Claude model for the anthropic answer backend. |
CODERAG_ANSWER_MAX_TOKENS |
1024 |
Max tokens generated per answer. |
| Variable | Default | Meaning |
|---|---|---|
CODERAG_WATCHED_DIR |
cwd | Codebase to index/search. |
CODERAG_STORE_DIR |
<watched-dir>/.coderag |
Where the LanceDB store lives. Derived from the watched dir (not the cwd) so the index is found no matter where you run coderag from. |
CODERAG_INDEX_ALL_TEXT |
false |
Index any UTF-8 text file (docs/config/extensionless), not just code. Binary files are always skipped. |
| Variable | Default | Meaning |
|---|---|---|
CODERAG_TOP_K |
8 |
Results returned. |
CODERAG_FETCH_K |
50 |
Candidates pulled from each retriever before fusion. |
CODERAG_RRF_K |
60 |
Reciprocal Rank Fusion constant. |
CODERAG_DENSE_WEIGHT |
1.0 |
Weight of dense (vector) results in fusion. |
CODERAG_LEXICAL_WEIGHT |
1.0 |
Weight of BM25 (keyword) results in fusion. |
CODERAG_RERANK |
false |
Enable the optional local cross-encoder reranker (no key; ~30 ms/query). |
CODERAG_RERANK_MODEL |
Xenova/ms-marco-MiniLM-L-12-v2 |
Local reranker model (coderag eval --list-models). |
CODERAG_RERANK_CANDIDATES |
50 |
Fused hits reranked before trimming to top_k. |
CODERAG_ADAPTIVE_FUSION |
false |
Tilt fusion weights by query type (dense-up for NL, BM25-up for code). |
CODERAG_NL_DENSE_WEIGHT / CODERAG_NL_LEXICAL_WEIGHT |
1.0 / 0.4 |
Adaptive weights for natural-language queries. |
CODERAG_CODE_DENSE_WEIGHT / CODERAG_CODE_LEXICAL_WEIGHT |
1.0 / 1.0 |
Adaptive weights for identifier/code queries. |
CODERAG_GRAPH_EXPANSION |
false |
Enrich the pool with 1-hop call-graph neighbors of the top hits — the definitions of what each hit calls (its callees). No key, no schema change. Small, consistent symbol-level lift across flask/requests/click. |
CODERAG_GRAPH_SEEDS |
5 |
Top fused hits to expand callees from. |
CODERAG_GRAPH_NEIGHBORS |
5 |
Max new callee definitions pulled per seed. |
CODERAG_GRAPH_WEIGHT |
0.15 |
Fusion weight of the neighbor list (down-weighted). 0.15 was a strict Pareto improvement across the eval repos; higher trades rank precision for marginal recall. |
| Variable | Default | Meaning |
|---|---|---|
CODERAG_WORKERS |
4 |
Worker threads for chunking + embedding (1 = serial; a big lever for remote/OpenAI embeddings). |
| Variable | Default | Meaning |
|---|---|---|
CODERAG_API_KEY |
– | If set, the HTTP API requires it (Authorization: Bearer <key> or X-API-Key). Always set it when the server is reachable beyond localhost. |
CODERAG_CORS_ORIGINS |
– | Comma-separated CORS allowlist (never *). Empty ⇒ no cross-origin browser access. |
| Variable | Default | Meaning |
|---|---|---|
CODERAG_MCP_AUTO_INDEX |
true |
Index the watched dir on startup (in the background). |
CODERAG_MCP_WATCH |
true |
Keep the index live via the filesystem watcher. |
CODERAG_MCP_SNIPPET_LINES |
12 |
Lines of a chunk returned in a search_code snippet by default. |
| Variable | Default | Meaning |
|---|---|---|
CODERAG_DEMO_MODE |
false |
Show a notice, hide Reindex, cap results, and rate-limit answers per browser session. |
CODERAG_DEMO_MAX_ANSWERS |
5 |
LLM answers allowed per browser session. |
CODERAG_DEMO_COOLDOWN_SECONDS |
20 |
Minimum seconds between answers in a session. |
The same options are fields on the immutable Config object, so a
library caller doesn't need environment variables:
from coderag import CodeRAG, Config
# Local embeddings + a local LLM for answers, no env vars:
cfg = Config.from_env(
watched_dir="/path/to/repo",
openai_base_url="http://localhost:11434/v1",
chat_model="llama3.1",
)
cr = CodeRAG(cfg)
cr.index()Config.from_env(**overrides) reads the environment/.env first, then applies the keyword
overrides last — handy for tests and embedding CodeRAG in your own tools.