This repository provides a robust template for creating local LLM sandboxes. It is designed for Red Teaming, by allowing you to mimic production environments without external dependencies or API costs.
This project serves as a "Local OpenAI API Mirror". It tricks applications into believing they are communicating with the real OpenAI API, while actually routing requests to a local LLM backend (defaulting to Ollama).
Why use this for Red Teaming?
- Controlled Environment: Test attacks and defenses in a safe, isolated container.
- No Cost: Run extensive fuzzing or automated scans without burning API credits.
- Offline Capable: Work in air-gapped or restricted network environments.
- Model Agnostic: Swap between different model families (Llama, Mistral, Gemma, etc.) to test model-specific vulnerabilities.
The template includes a FastAPI-based mock server, modular service implementations, automated testing, client scripts, and container orchestration using Podman.
graph TB
subgraph "Client Environment"
Client[Client Application]
end
subgraph "Application Server"
API[LLM API Gateway]
AppLogic[Application Logic]
end
subgraph "External Services"
LLM[Language Model Service<br/>OpenAI/Anthropic/etc.]
end
Client -->|HTTPS| API
API --> AppLogic
AppLogic -->|API Call| LLM
LLM -->|Response| AppLogic
AppLogic --> API
API -->|Response| Client
style Client fill:#e1f5ff
style API fill:#fff4e1
style AppLogic fill:#fff4e1
style LLM fill:#ffe1f5
graph LR
subgraph "Client Environment (Local)"
GradioUI[Gradio Web UI<br/>:7860]
TestClient[Automated Test Client<br/>config/prompts.toml]
end
subgraph "Application Server (Container)"
MockAPI[Mock API Gateway<br/>FastAPI :8000]
MockLogic[Mock App Logic<br/>app/mocks/openai.py]
end
subgraph "External Services (Local Host)"
Ollama[Ollama Server<br/>:11434]
Model[gpt-oss:20b Model<br/>config/model.toml]
end
GradioUI -->|HTTP| MockAPI
TestClient -->|HTTP| MockAPI
MockAPI --> MockLogic
MockLogic -->|HTTP| Ollama
Ollama --> Model
Model --> Ollama
Ollama -->|Response| MockLogic
MockLogic --> MockAPI
MockAPI -->|Response| GradioUI
MockAPI -->|Response| TestClient
style GradioUI fill:#e1f5ff
style TestClient fill:#e1f5ff
style MockAPI fill:#fff4e1
style MockLogic fill:#fff4e1
style Ollama fill:#ffe1f5
style Model fill:#ffe1f5
Mapping to Production:
- Client Environment → Local browser/scripts (instead of remote client)
- Application Server → Containerized mock API (instead of cloud deployment)
- External Services → Local Ollama + model (instead of cloud LLM/VectorDB)
The threat model for this Local LLM architecture is available in the threat_model/ directory. It includes:
- Diagram:
LLM_TM_diagram.json(ThreatCanvas compatible) - Report:
LLM_TM_report.mdandLLM_TM_report.pdf
- uv – Python package manager (
pip install uvif not already installed) - Podman (or Docker – replace
podmanwithdockerin the Makefile if desired) - Ollama (Local LLM runner)
- Install Ollama.
- Pull a model (e.g., Llama 3):
make ollama-pull
- Start the Ollama server (usually runs automatically):
ollama serve
- Note: The containerized app accesses Ollama on the host via
host.containers.internal:11434
- Note: The containerized app accesses Ollama on the host via
Because this template uses Ollama as the default backend, you can use any model supported by Ollama from its library. This includes a wide range of open-weights models perfect for testing different capabilities and safety filters:
- Llama 3 (Meta)
- Mistral / Mixtral (Mistral AI)
- Gemma (Google)
- Qwen (Alibaba)
- DeepSeek (DeepSeek)
- Phi-3 (Microsoft)
- GPT-OSS (Various community implementations)
The default configuration of this sandbox uses the gpt-oss:20b model, which is a 4-bit quantized (Q4) 20-billion (20B) parameter model. To ensure low-latency performance and prevent resource exhaustion, the following specifications are recommended:
- Dedicated GPU Memory: 16 GB.
- System Memory: 32 GB.
- Storage: 14 GB available space.
For Apple Silicon Macs, you can use the gpt-oss:20b model with the following specifications or better:
- Chip: Apple M4 Pro.
- Memory: 24 GB.
- Storage: 14 GB available space.
To use a different model, simply pull it with ollama pull <model_name> and update config/model.toml (see next subsection).
Controls which LLM model to use:
[default]
model = "gpt-oss:20b" # Change to switch models
[ollama]
base_url = "http://host.containers.internal:11434/v1"Defines automated test prompts organized by category:
basic- Simple functionality testscustom- Your own test prompts
You can configure a global pre-prompt that is prepended to every user query. This is useful for testing system prompts or specific instruction sets.
Example Configuration (using XML format for structure):
[client]
pre_prompt = """
<system_instruction>
You are a Red Team assistant designed to test security vulnerabilities.
Please adopt an adversarial persona.
</system_instruction>
<context_rules>
1. Use the provided context to answer.
2. If the answer is not in the context, say "I don't know".
</context_rules>
"""# View all available commands
make help
# Full automated setup and launch Gradio UI
make run-gradio-headless
# Or step-by-step:
make install # Install uv
make sync # Install dependencies
make build # Build container
make up # Start container
make test # Test health endpointThe mock API will be available at http://localhost:8000.
Run make help to see all commands:
Container Operations:
make build- Build the container imagemake up- Run the containermake down- Stop and remove the containermake clean- Clean up containers and images
Development:
make install- Install uv package managermake sync- Sync/install dependenciesmake lock- Update dependency lock file
Testing:
make test- Full setup + health checkmake test-client- Run automated prompt tests
UI:
make run-gradio-headless- Full setup + launch Gradio web interface (container)make stop-gradio- Stop the Gradio container
Code Quality:
make format- Run black and isort formattersmake mypy- Run mypy type checker
Ollama:
make ollama-pull- Pull gpt-oss:20b modelmake ollama-serve- Start Ollama (checks if already running)
curl http://localhost:8000/healthExpected response: {"status": "ok"}
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-mock-key" \
-d '{
"model": "gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello!"}]
}'Run the test suite with prompts from config/prompts.toml:
make test-clientOutput includes:
- Test results for each prompt category
- Success/failure status
- Response previews
- Summary statistics
Interactive chat interface:
make run-gradio-headlessOpens at http://localhost:7860 with a user-friendly chat UI.
.
├── config/ # Configuration files
│ ├── client_config.toml # Client settings
│ ├── model.toml # Model settings (default model, Ollama config)
│ └── prompts.toml # Test prompts for automated testing
├── data/ # Placeholder for document files
├── app/ # FastAPI mock server package
│ ├── __init__.py
│ ├── main.py # FastAPI entry point
│ └── mocks/ # Modular mock service implementations
│ ├── __init__.py
│ ├── openai.py # Mock OpenAI API using Ollama
│ └── README.md # Guide for adding new mocks
├── client/ # Client scripts
│ ├── main.py # Automated test runner
│ └── gradio_app.py # Web UI client
├── threat_model/ # Threat modeling artifacts
│ ├── LLM_TM_diagram.json
│ ├── LLM_TM_report.md
│ └── LLM_TM_report.pdf
├── Containerfile # Podman container definition
├── entrypoint.sh # Container entrypoint script
├── Makefile # Developer commands
├── packages.txt # System packages
├── pyproject.toml # uv project definition
├── uv.lock # Lock file generated by uv
└── README.md # This file
The template is designed to be easily extensible. While Ollama is the default, you can add support for other backends (like HuggingFace Transformers, vLLM, or other vector databases) by creating new mock services.
To add a new mock service (e.g., Pinecone, Anthropic, etc.):
- Create a new module in
app/mocks/(e.g.,pinecone_mock.py) - Implement your mock service as a FastAPI router
- Export the router in
app/mocks/__init__.py - Mount it in
app/main.py
👉 See app/mocks/README.md for detailed step-by-step instructions and code examples.
- Edit code in
app/orclient/ - Format code:
make format - Type check:
make mypy - Rebuild and test:
make run-gradio-headless
- Edit
config/prompts.toml - Add prompts to existing categories or create new ones
- Run tests:
make test-client
- Edit
config/model.toml - Update the
modelfield under[default] - Pull the new model:
ollama pull <model-name> - Restart:
make down && make up
- All commands are designed for Podman; replace
podmanwithdockerin the Makefile if you prefer Docker - The mock API uses
sk-mock-keyas the authentication token for testing purposes - Container name:
app_container - Image name:
llm-mock-api - Extend mock services in
app/mocks/to add support for additional APIs
Port conflicts:
- If port 8000 is in use:
make cleanto remove old containers - If port 7860 is in use:
make run-gradio-headlessautomatically kills existing Gradio instances
Ollama connection issues:
- Ensure Ollama is running:
ollama serve - Check if model is available:
ollama list - Pull model if needed:
make ollama-pull
Container issues:
- View logs:
podman logs app_container - Restart:
make down && make up - Full cleanup:
make clean && make build && make up