This project provides a local server that acts as a backend for the Claude Code command line coding assistant. It allows you to use open-source models running on your local machine via Apple's MLX framework. Instead of sending your code to Anthropic's servers, you can use powerful models like Llama 3, GLM-4.5-Air, DeepSeek, and more, all running on your Apple Silicon Mac.
This server implements the Claude Messages API format that Claude Code communicates with, redirecting all requests to a local model of your choice.
- Total Privacy: Your code, prompts, and conversations never leave your local machine.
- Use Any Model: Experiment with thousands of open-source models from the MLX Community on Hugging Face.
- Work Offline: Get code completions and chat with your local model without an internet connection.
- No API Keys or Costs: Run powerful models without needing to manage API keys or pay for usage.
- Full Customization: You have complete control over model parameters and generation settings.
There are two parts: running the local server, and configuring Claude Code to use it.
First, get the proxy server running on your machine.
-
Clone the repository:
git clone https://github.com/chand1012/claude-code-mlx-proxy.git cd claude-code-mlx-proxy -
Set up the environment: Copy the example
.envfile:cp .env.example .env
You can edit the
.envfile to customize the model, port, and other settings (see Configuration section below). -
Install dependencies: This project uses
uvfor fast package management.uv sync
-
Start the server:
uv run main.py
The server will start on
http://localhost:8888(or as configured in your.env) and begin downloading and loading the specified MLX model. This may take some time on the first run.
Next, tell your Claude Code extension to send requests to your local server instead of the official Anthropic API.
As described in the official Claude Code documentation, you do this by setting the ANTHROPIC_BASE_URL environment variable.
The most reliable way to do this is to launch your IDE from a terminal where the variable has been set:
# Set the environment variable to point to your local server
export ANTHROPIC_BASE_URL=http://localhost:8888
# Now, launch Claude Code from this same terminal window
claudeOnce your IDE is running, Claude Code will automatically use your local MLX backend. You can now chat with it or use its code completion features, and all requests will be handled by your local model.
Before configuring Claude Code, you can verify the server is working correctly by sending it a curl request from your terminal:
curl -X POST http://localhost:8888/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "claude-4-sonnet-20250514",
"max_tokens": 100,
"messages": [
{"role": "user", "content": "Explain what MLX is in one sentence."}
]
}'This will return a Claude-style response:
{
"id": "msg_12345678",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "MLX is Apple's machine learning framework optimized for efficient training and inference on Apple Silicon chips."
}
],
"model": "claude-4-sonnet-20250514",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 12,
"output_tokens": 18
}
}You can also test the token counting endpoint:
curl -X POST http://localhost:8888/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-d '{
"model": "claude-4-sonnet-20250514",
"messages": [
{"role": "user", "content": "Explain what MLX is in one sentence."}
]
}'This returns the token count:
{
"input_tokens": 12
}The server also supports streaming responses using Server-Sent Events (SSE), just like the real Claude API:
curl -X POST http://localhost:8888/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "claude-4-sonnet-20250514",
"max_tokens": 100,
"messages": [
{"role": "user", "content": "Explain what MLX is in one sentence."}
],
"stream": true
}'This will return a stream of events following the Claude streaming format.
The server implements the following Claude-compatible endpoints:
POST /v1/messages- Create a message (supports both streaming and non-streaming)POST /v1/messages/count_tokens- Count tokens in a messageGET /- Root endpoint with server statusGET /health- Health check endpoint
All server settings are managed through the .env file.
| Variable | Default | Description |
|---|---|---|
HOST |
0.0.0.0 |
The host address for the server. |
PORT |
8888 |
The port for the server. |
MODEL_NAME |
mlx-community/GLM-4.5-Air-3bit |
The MLX model to load from Hugging Face. Find more at the MLX Community. |
API_MODEL_NAME |
claude-4-sonnet-20250514 |
The model name that the API will report. Set this to a known Claude model to ensure client compatibility. |
TRUST_REMOTE_CODE |
false |
Set to true if the model tokenizer requires trusting remote code. |
EOS_TOKEN |
None |
The End-of-Sequence token, required for some models like Qwen. |
DEFAULT_MAX_TOKENS |
4096 |
The default maximum number of tokens to generate in a response. |
DEFAULT_TEMPERATURE |
1.0 |
The default temperature for generation (creativity). |
DEFAULT_TOP_P |
1.0 |
The default top-p for generation. |
VERBOSE |
false |
Set to true to enable verbose logging from the MLX generate function. |
This project is licensed under the MIT License.