A VS Code extension that integrates llama-server LLMs as language model chat providers, enabling local AI-powered coding assistance directly in VS Code.
Note: This extension has no affiliation with llama.cpp or its maintainers. It is an independent third-party extension that provides integration with llama-server.
Before using this extension, you need to install and run llama-server from the llama.cpp project. Follow the quick start guide to get started.
You can install llama.cpp in several ways:
- Using package managers: Install using
brew,nix, orwinget - Docker: Run with Docker - see the Docker documentation
- Pre-built binaries: Download from the releases page
- Build from source: Clone the repository and build - check out the build guide
Once installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.
After installing llama.cpp, you need to start llama-server with your models configured. Here's an example startup script (see examples/start-llms):
#!/bin/bash
llama-server --port 8013 --models-preset ./models.ini --timeout 3600--port: Specifies the port on which the server will listen (default: 8080)--models-preset: Path to your models configuration file (INI format)--timeout: How long in seconds processing can take without any output to the client. Increase this alongside with the Request timeout (extension setting) if you get timeouts.
The server will start and load models according to your configuration. Make sure the server is running before configuring the VS Code extension.
Models are configured using an INI file format. See examples/models.ini for a complete example. Here's an example for a MacBook with 128GB RAM:
[nemotron-3-nano-30b]
jinja = true
ctx-size = 256000
temp = 1.0
top-p = 1.00
fit = on
hf = unsloth/Nemotron-3-Nano-30B-A3B-GGUF:BF16
[qwen3-4b]
jinja = true
ctx-size = 32768
temp = 0.6
min-p = 0.0
top-p = 0.95
top-k = 20
hf = unsloth/Qwen3-4B-128K-GGUF:Q8_K_XL
[glm-4-7-flash]
jinja = true
ctx-size = 202752
temp = 0.7
top-p = 1.0
min-p = 0.01
repeat-penalty = 1.0
hf = unsloth/GLM-4.7-Flash-GGUF:BF16Please look at the modelfile accompanying your model for the settings to use. All available settings can be found in the llama-server readme.
Make sure to pick models and context sizes that work with your machine.
-
Context size: Larger context sizes require more RAM
- 1,000,000 tokens ≈ 133GB RAM
- 256,000 tokens ≈ 33GB RAM
- 128,000 tokens ≈ 17GB RAM
- 32,768 tokens ≈ 4GB RAM
-
Quantization: Smaller quantizations use less RAM
- BF16: 2× model size (30b model => 60GB)
- Q8_0: 1× model size
- Q4_0: 0.5× model size
- Q1_0: 0.125× model size
Configure the extension by adding endpoint settings to your VS Code settings (File → Preferences → Settings, or edit settings.json directly).
{
"llamaCopilot.endpoints": {
"local": {
"url": "http://localhost:8013"
}
}
}Each endpoint has an identifier (e.g., "local"). Models from that endpoint will be displayed with the suffix @identifier (e.g., my-model@local). This allows you to:
- Connect to multiple llama-server instances
- Distinguish between models from different endpoints
- Configure different settings per endpoint
You can configure multiple endpoints:
{
"llamaCopilot.endpoints": {
"local": {
"url": "http://localhost:8013"
},
"remote": {
"url": "http://192.168.1.100:8080",
"apiToken": "your-api-token-here"
}
}
}You can override generation parameters (temperature, top_p, etc.) at both the endpoint and model level. These overrides are merged into the request body sent to llama-server.
Apply parameters to all models on an endpoint:
{
"llamaCopilot.endpoints": {
"local": {
"url": "http://localhost:8013",
"requestBody": {
"temperature": 0.7,
"top_p": 0.95,
"top_k": 40,
"min_p": 0.01,
"repeat_penalty": 1.1,
"max_tokens": 2048
}
}
}
}Override parameters for specific models. Model-level requestBody properties override endpoint-level properties:
{
"llamaCopilot.endpoints": {
"local": {
"url": "http://localhost:8013",
"requestBody": {
"temperature": 0.7,
"top_p": 0.95
},
"models": {
"my-model": {
"requestBody": {
"temperature": 0.6,
"top_p": 0.9,
"top_k": 40
}
}
}
}
}
}In this example, my-model will use temperature: 0.6, top_p: 0.9, and top_k: 40, while other models on the local endpoint will use temperature: 0.7 and top_p: 0.95.
temperature(number): Controls randomness (0.0 = deterministic, 2.0 = very creative)top_p(number): Nucleus sampling threshold (0.0 to 1.0)top_k(number): Top-k sampling (number of tokens to consider)min_p(number): Minimum probability thresholdrepeat_penalty(number): Penalty for repeating tokens (1.0 = no penalty, >1.0 = penalty)max_tokens(number): Maximum number of tokens to generate
Add custom headers for authentication or other purposes:
{
"llamaCopilot.endpoints": {
"local": {
"url": "http://localhost:8013",
"headers": {
"X-Custom-Header": "value"
},
"models": {
"my-model": {
"headers": {
"X-Model-Specific": "value"
}
}
}
}
}
}Model-level headers override endpoint-level headers.
For authenticated endpoints:
{
"llamaCopilot.endpoints": {
"secure": {
"url": "https://api.example.com",
"apiToken": "your-bearer-token-here"
}
}
}The token will be sent as Authorization: Bearer <token> in all requests.
The extension uses a Request timeout (Settings → Llama Copilot → Request timeout (seconds)) for how long it waits for the server to respond. This should be at least as large as the --timeout you pass to llama-server. If you see proxy or stream timeouts, increase the extension timeout and ensure llama-server is started with --timeout (e.g. --timeout 3600).
Override the context size for a specific model:
{
"llamaCopilot.endpoints": {
"local": {
"url": "http://localhost:8013",
"models": {
"large-model": {
"contextSize": 256000
}
}
}
}
}Override the maximum output tokens:
{
"llamaCopilot.endpoints": {
"local": {
"url": "http://localhost:8013",
"models": {
"my-model": {
"maxOutputTokens": 4096
}
}
}
}
}Configure model capabilities:
{
"llamaCopilot.endpoints": {
"local": {
"url": "http://localhost:8013",
"models": {
"multimodal-model": {
"capabilities": {
"imageInput": true,
"toolCalling": true
}
},
"tool-model": {
"capabilities": {
"toolCalling": 10
}
}
}
}
}
}imageInput(boolean): Whether the model supports image inputtoolCalling(boolean | number): Whether the model supports tool calling. Can be a boolean or a number (maximum number of tools)
The extension can show inline (ghost) completions in the editor using the llama-server /infill endpoint. This requires a FIM-capable model (fill-in-the-middle), such as the Sweep next-edit models.
- Configure an endpoint and ensure llama-server is running with a FIM-capable model loaded (see below).
- Set Inline completion model in Settings → Llama Copilot to a model ID including the endpoint, e.g.
sweep-next-edit-1.5b@local. If this setting is empty, inline completions are disabled.
Your server must be running with a model that supports FIM tokens. Add one of the following to your models.ini and load it (e.g. with llama-server --port 8013 --models-preset ./models.ini --timeout 3600):
sweep-next-edit-1.5b:
[sweep-next-edit-1.5b]
jinja = true
ctx-size = 0
temp = 0.7
top-p = 0.8
top-k = 20
hf = sweepai/sweep-next-edit-1.5B:latestsweep-next-edit-0.5b:
[sweep-next-edit-0.5b]
jinja = true
ctx-size = 0
temp = 0.7
top-p = 0.8
top-k = 20
hf = sweepai/sweep-next-edit-0.5B:Q8_0| Setting | Description |
|---|---|
| Inline completion model | Model ID (e.g. sweep-next-edit-1.5b@local). Empty = disabled. |
| Inline completion timeout (ms) | Request timeout; no suggestion is shown on timeout. |
| Inline completion debounce (ms) | Delay before sending an automatic (as-you-type) request. |
| Max input bytes | Maximum total input size (prefix + suffix + context) sent to the server. |
| Include context | When enabled, include content from other open tabs to improve suggestions. |
| Debug: Inline completion | Log requests, cancellations, and errors to the "LLaMA Server API" output. |
The extension includes a built-in tool that gives the LLM access to your project's cursor rules from .cursor/rules/. This allows the model to access project-specific guidelines, coding standards, and best practices automatically.
- Rule Discovery: The extension automatically reads all
.mdand.mdcfiles from.cursor/rules/in your workspace - Glob Matching: Rules with glob patterns in their frontmatter are matched against:
- File attachments (e.g.,
@src/logger.ts:32) - User messages
- Assistant messages
- Tool call parameters (first 1024 bytes)
- File attachments (e.g.,
- Session Scoping: Available rules are tracked per chat session. When a glob matches, that rule becomes available for that conversation
- Tool Exposure: When rules are available, a
get-project-ruletool is automatically exposed to the LLM
Rules can be simple markdown files (.md) or markdown files with frontmatter (.mdc):
Simple rule (.md):
# Coding Guidelines
Always use TypeScript strict mode.
Prefer async/await over promises.Rule with frontmatter (.mdc):
---
description: "TypeScript coding standards"
globs: ["**/*.ts", "**/*.tsx"]
alwaysApply: false
---
# TypeScript Guidelines
- Use strict mode
- Prefer interfaces over types for object shapes
- Use const assertions where appropriateGlob patterns are converted to regex patterns:
*matches any characters except path separators:[a-zA-Z0-9.~@+=_|-]**matches any characters including path separators:[a-zA-Z0-9.~@+=_|\/-]- Both Windows (
\) and Unix (/) path separators are supported
When rules are available, the LLM can call the get-project-rule tool:
get-project-rule(rule: "coding-guidelines.md,style/markdown.mdc")
The tool supports:
- Comma-separated rule names
- Optional
rule:prefix (e.g.,rule:style.mdor juststyle.md) - Fuzzy matching: If a rule isn't found exactly, the closest match (within Levenshtein distance 8) is used
- Returns
<empty file>if no matching rule is found
Enable or disable the cursor rules feature in settings:
{
"llamaCopilot.enableCursorRules": true
}When disabled:
- Rules are not parsed
- The tool is not exposed to the LLM
- No performance overhead from rule matching
- Create a rule file
.cursor/rules/typescript.md:
# TypeScript Rules
Always use explicit return types for functions.
Prefer `interface` over `type` for object shapes.- Create a rule with glob matching
.cursor/rules/react-components.mdc:
---
description: "React component guidelines"
globs: ["**/*.tsx", "src/components/**"]
---
# React Components
- Use functional components with hooks
- Extract complex logic into custom hooks
- Use React.memo for expensive components- When you mention a file matching the glob (e.g.,
@src/components/Button.tsx), the rule becomes available to the LLM automatically
- Open the Command Palette (Ctrl+Shift+P / Cmd+Shift+P)
- Type "Chat: Start Session" or use the chat interface
- Select a model from the list (models appear as
model-name@endpoint-id)
- Start a chat session with a selected model
- The extension supports tool calling if the model supports it
- Models with image input capability can process images
Use the command "Open Endpoint Settings" to quickly access the configuration, or navigate to Settings and search for "llamaCopilot".
- Ensure
llama-serveris running - Check that the URL in your configuration matches the server's address and port
- Verify the server is accessible (try opening the URL in a browser)
- Check that models are loaded in
llama-server(visit/modelsendpoint) - Ensure models don't have "/" in their ID (these are filtered out)
- Verify the endpoint URL is correct
- Check the VS Code output panel for error messages (View → Output → "LLaMA Server API")
- Validate your JSON syntax in
settings.json - Ensure required fields (
url) are present - Check that endpoint identifiers don't contain special characters
- Verify the parameter names match llama-server's API (check llama-server documentation)
- Remember that model-level
requestBodyoverrides endpoint-levelrequestBody - Check the VS Code output panel for API request/response logs