Configuration & local models

CodeRAG runs fully locally with no API key out of the box. Everything below is optional — reach for it only when you want to change a default, scale up, or turn on the optional LLM answer. Settings come from CODERAG_* environment variables (or a .env file — copy example.env and edit). A few CLI flags (--watched-dir, --store-dir, --provider, --model) override the matching variable for a single command.

The two model roles (read this first)

CodeRAG uses models in two independent places. Keeping them straight removes most of the confusion about "OpenAI or Anthropic":

Role	What it does	Default	Required?
Embedding model	Turns code + your query into vectors — this is the search.	Local ONNX `bge-small` via fastembed.	Always used. Local, no key.
Answer LLM	Optional. Writes a grounded, cited prose answer from the retrieved chunks (`coderag search --answer`, the UI's answer box).	Off.	Only if you want generated answers.

Search works with only the embedding model. The answer LLM is a separate, optional layer — and it, too, can be a local model. You never need a cloud account to use CodeRAG.

Recipes — pick the one that matches you

1. Fully local, zero config (default)

Local embeddings, no API key, no answer LLM. Just index and search:

pip install -e .
coderag index --watched-dir /path/to/repo
coderag search "where is retry/backoff handled" --watched-dir /path/to/repo

Nothing to configure. Want a stronger (still local, still keyless) embedding model? See Local embedding models below.

2. Fully local with generated answers (Ollama / LM Studio / vLLM)

Add a --answer that's written by a local LLM. Any OpenAI-compatible server works — you point CodeRAG at it with OPENAI_BASE_URL and name the model with CODERAG_CHAT_MODEL. The answer backend stays openai (it means "the OpenAI protocol", not the OpenAI company); no API key is needed for a local server.

# 1. Run a local model server, e.g. Ollama:
ollama serve
ollama pull llama3.1

# 2. Point CodeRAG at it (in your shell or .env):
export OPENAI_BASE_URL=http://localhost:11434/v1   # Ollama's OpenAI-compatible endpoint
export CODERAG_CHAT_MODEL=llama3.1                  # the model name your server serves

# 3. Search with a locally-generated answer:
coderag search "how is the vector index persisted" --answer

Other local servers expose the same OpenAI-compatible API — only the base URL and model name change:

Server	Typical `OPENAI_BASE_URL`	Notes
Ollama	`http://localhost:11434/v1`	`ollama pull <model>`; key ignored.
LM Studio	`http://localhost:1234/v1`	Start the local server from the app.
vLLM	`http://localhost:8000/v1`	`vllm serve <model>`; OpenAI-compatible.
LocalAI	`http://localhost:8080/v1`	Drop-in OpenAI replacement.

A local server usually ignores the API key, so you can leave OPENAI_API_KEY unset — CodeRAG sends a harmless placeholder. Set it only if your server enforces one.

3. OpenAI cloud (embeddings and/or answers)

pip install -e ".[openai]"
export OPENAI_API_KEY=sk-...
# Cloud embeddings (optional — the local default is fine for most repos):
coderag index --provider openai            # uses CODERAG_OPENAI_MODEL (text-embedding-3-small)
# Cloud answer:
coderag search "where is auth handled" --answer    # uses CODERAG_CHAT_MODEL (gpt-4o-mini)

4. Anthropic (Claude) answers

The answer LLM can be Claude instead of an OpenAI-compatible model. (Embeddings still come from the local default or OpenAI — Anthropic is answer-only here.)

pip install -e ".[anthropic]"
export CODERAG_LLM_PROVIDER=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
coderag search "where is auth handled" --answer    # uses CODERAG_ANTHROPIC_MODEL

5. Local embeddings via an OpenAI-compatible server (advanced)

Beyond the built-in fastembed models, you can serve embeddings from a local OpenAI-compatible endpoint (e.g. Ollama embedding models, text-embeddings-inference). Use the openai provider pointed at your local URL:

export OPENAI_BASE_URL=http://localhost:11434/v1
export CODERAG_OPENAI_MODEL=nomic-embed-text
coderag index --provider openai

Changing the embedding model (its dimension) triggers a one-time index rebuild — that's expected and safe (the LanceDB store is rebuildable — re-indexing recreates it from source).

Local embedding models

The default BAAI/bge-small-en-v1.5 is the smallest/fastest; larger local models score better on code retrieval. List the curated, keyless options:

coderag eval --list-models

Then select one (downloaded once, cached under CODERAG_CACHE_DIR):

export CODERAG_MODEL=jinaai/jina-embeddings-v2-base-code   # best out-of-the-box local code retriever
coderag index --full

Want sharper top results without a cloud call? Turn on the optional local cross-encoder reranker (no key, ~30 ms/query): export CODERAG_RERANK=1.

Full settings reference

Grouped by area. Default is what you get with the variable unset; everything here is optional.

Embeddings (the search model)

Variable	Default	Meaning
`CODERAG_PROVIDER`	`fastembed`	Embedding backend: `fastembed` (local) · `openai` (OpenAI API or any OpenAI-compatible/local server) · `fake` (deterministic, tests).
`CODERAG_MODEL`	`BAAI/bge-small-en-v1.5`	Local fastembed model id (`coderag eval --list-models`).
`CODERAG_OPENAI_MODEL`	`text-embedding-3-small`	Embedding model when `provider=openai` (cloud or local server).
`CODERAG_CACHE_DIR`	`~/.cache/coderag`	Where downloaded local models are cached.
`CODERAG_EMBED_BATCH`	`64`	Texts embedded per batch during indexing.

Answers (the optional LLM)

Variable	Default	Meaning
`CODERAG_LLM_PROVIDER`	`openai`	Answer backend: `openai` (OpenAI API or any OpenAI-compatible/local server) · `anthropic`.
`CODERAG_CHAT_MODEL`	`gpt-4o-mini`	Chat model for the `openai` backend — set to your local model name (e.g. `llama3.1`) when using `OPENAI_BASE_URL`.
`OPENAI_API_KEY`	–	OpenAI key. Needed for the OpenAI cloud (embeddings or answers); optional for a local server.
`OPENAI_BASE_URL`	–	Point the OpenAI client at a self-hosted/local server (Ollama, vLLM, LM Studio, LocalAI). Enables local embeddings and local answers.
`ANTHROPIC_API_KEY`	–	Anthropic (Claude) key — required when `CODERAG_LLM_PROVIDER=anthropic`.
`CODERAG_ANTHROPIC_MODEL`	`claude-opus-4-8`	Claude model for the `anthropic` answer backend.
`CODERAG_ANSWER_MAX_TOKENS`	`1024`	Max tokens generated per answer.

Locations

Variable	Default	Meaning
`CODERAG_WATCHED_DIR`	cwd	Codebase to index/search.
`CODERAG_STORE_DIR`	`<watched-dir>/.coderag`	Where the LanceDB store lives. Derived from the watched dir (not the cwd) so the index is found no matter where you run `coderag` from.
`CODERAG_INDEX_ALL_TEXT`	`false`	Index any UTF-8 text file (docs/config/extensionless), not just code. Binary files are always skipped.

Retrieval & quality

Variable	Default	Meaning
`CODERAG_TOP_K`	`8`	Results returned.
`CODERAG_FETCH_K`	`50`	Candidates pulled from each retriever before fusion.
`CODERAG_RRF_K`	`60`	Reciprocal Rank Fusion constant.
`CODERAG_DENSE_WEIGHT`	`1.0`	Weight of dense (vector) results in fusion.
`CODERAG_LEXICAL_WEIGHT`	`1.0`	Weight of BM25 (keyword) results in fusion.
`CODERAG_RERANK`	`false`	Enable the optional local cross-encoder reranker (no key; ~30 ms/query).
`CODERAG_RERANK_MODEL`	`Xenova/ms-marco-MiniLM-L-12-v2`	Local reranker model (`coderag eval --list-models`).
`CODERAG_RERANK_CANDIDATES`	`50`	Fused hits reranked before trimming to `top_k`.
`CODERAG_ADAPTIVE_FUSION`	`false`	Tilt fusion weights by query type (dense-up for NL, BM25-up for code).
`CODERAG_NL_DENSE_WEIGHT` / `CODERAG_NL_LEXICAL_WEIGHT`	`1.0` / `0.4`	Adaptive weights for natural-language queries.
`CODERAG_CODE_DENSE_WEIGHT` / `CODERAG_CODE_LEXICAL_WEIGHT`	`1.0` / `1.0`	Adaptive weights for identifier/code queries.
`CODERAG_GRAPH_EXPANSION`	`false`	Enrich the pool with 1-hop call-graph neighbors of the top hits — the definitions of what each hit calls (its callees). No key, no schema change. Small, consistent symbol-level lift across flask/requests/click.
`CODERAG_GRAPH_SEEDS`	`5`	Top fused hits to expand callees from.
`CODERAG_GRAPH_NEIGHBORS`	`5`	Max new callee definitions pulled per seed.
`CODERAG_GRAPH_WEIGHT`	`0.15`	Fusion weight of the neighbor list (down-weighted). `0.15` was a strict Pareto improvement across the eval repos; higher trades rank precision for marginal recall.

Scale & throughput

Variable	Default	Meaning
`CODERAG_WORKERS`	`4`	Worker threads for chunking + embedding (`1` = serial; a big lever for remote/OpenAI embeddings).

HTTP API server (`coderag serve`)

Variable	Default	Meaning
`CODERAG_API_KEY`	–	If set, the HTTP API requires it (`Authorization: Bearer <key>` or `X-API-Key`). Always set it when the server is reachable beyond localhost.
`CODERAG_CORS_ORIGINS`	–	Comma-separated CORS allowlist (never `*`). Empty ⇒ no cross-origin browser access.

MCP server (`coderag mcp`)

Variable	Default	Meaning
`CODERAG_MCP_AUTO_INDEX`	`true`	Index the watched dir on startup (in the background).
`CODERAG_MCP_WATCH`	`true`	Keep the index live via the filesystem watcher.
`CODERAG_MCP_SNIPPET_LINES`	`12`	Lines of a chunk returned in a `search_code` snippet by default.

Demo mode (public, untrusted UI)

Variable	Default	Meaning
`CODERAG_DEMO_MODE`	`false`	Show a notice, hide Reindex, cap results, and rate-limit answers per browser session.
`CODERAG_DEMO_MAX_ANSWERS`	`5`	LLM answers allowed per browser session.
`CODERAG_DEMO_COOLDOWN_SECONDS`	`20`	Minimum seconds between answers in a session.

Setting configuration from Python

The same options are fields on the immutable Config object, so a library caller doesn't need environment variables:

from coderag import CodeRAG, Config

# Local embeddings + a local LLM for answers, no env vars:
cfg = Config.from_env(
    watched_dir="/path/to/repo",
    openai_base_url="http://localhost:11434/v1",
    chat_model="llama3.1",
)
cr = CodeRAG(cfg)
cr.index()

Config.from_env(**overrides) reads the environment/.env first, then applies the keyword overrides last — handy for tests and embedding CodeRAG in your own tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration & local models

The two model roles (read this first)

Recipes — pick the one that matches you

1. Fully local, zero config (default)

2. Fully local with generated answers (Ollama / LM Studio / vLLM)

3. OpenAI cloud (embeddings and/or answers)

4. Anthropic (Claude) answers

5. Local embeddings via an OpenAI-compatible server (advanced)

Local embedding models

Full settings reference

Embeddings (the search model)

Answers (the optional LLM)

Locations

Retrieval & quality

Scale & throughput

HTTP API server (`coderag serve`)

MCP server (`coderag mcp`)

Demo mode (public, untrusted UI)

Setting configuration from Python

FilesExpand file tree

configuration.md

Latest commit

History

configuration.md

File metadata and controls

Configuration & local models

The two model roles (read this first)

Recipes — pick the one that matches you

1. Fully local, zero config (default)

2. Fully local with generated answers (Ollama / LM Studio / vLLM)

3. OpenAI cloud (embeddings and/or answers)

4. Anthropic (Claude) answers

5. Local embeddings via an OpenAI-compatible server (advanced)

Local embedding models

Full settings reference

Embeddings (the search model)

Answers (the optional LLM)

Locations

Retrieval & quality

Scale & throughput

HTTP API server (coderag serve)

MCP server (coderag mcp)

Demo mode (public, untrusted UI)

Setting configuration from Python

HTTP API server (`coderag serve`)

MCP server (`coderag mcp`)