LLM Local Sandbox

Overview

This repository provides a robust template for creating local LLM sandboxes. It is designed for Red Teaming, by allowing you to mimic production environments without external dependencies or API costs.

Using as a Sandbox Template

This project serves as a "Local OpenAI API Mirror". It tricks applications into believing they are communicating with the real OpenAI API, while actually routing requests to a local LLM backend (defaulting to Ollama).

Why use this for Red Teaming?

Controlled Environment: Test attacks and defenses in a safe, isolated container.
No Cost: Run extensive fuzzing or automated scans without burning API credits.
Offline Capable: Work in air-gapped or restricted network environments.
Model Agnostic: Swap between different model families (Llama, Mistral, Gemma, etc.) to test model-specific vulnerabilities.

The template includes a FastAPI-based mock server, modular service implementations, automated testing, client scripts, and container orchestration using Podman.

Architecture

Production Deployment (Target Architecture)

graph TB
    subgraph "Client Environment"
        Client[Client Application]
    end
    
    subgraph "Application Server"
        API[LLM API Gateway]
        AppLogic[Application Logic]
    end
    
    subgraph "External Services"
        LLM[Language Model Service<br/>OpenAI/Anthropic/etc.]
    end
    
    Client -->|HTTPS| API
    API --> AppLogic
    AppLogic -->|API Call| LLM
    LLM -->|Response| AppLogic
    AppLogic --> API
    API -->|Response| Client
    
    style Client fill:#e1f5ff
    style API fill:#fff4e1
    style AppLogic fill:#fff4e1
    style LLM fill:#ffe1f5

Local Mock Setup (This Template)

graph LR
    subgraph "Client Environment (Local)"
        GradioUI[Gradio Web UI<br/>:7860]
        TestClient[Automated Test Client<br/>config/prompts.toml]
    end
    
    subgraph "Application Server (Container)"
        MockAPI[Mock API Gateway<br/>FastAPI :8000]
        MockLogic[Mock App Logic<br/>app/mocks/openai.py]
    end
    
    subgraph "External Services (Local Host)"
        Ollama[Ollama Server<br/>:11434]
        Model[gpt-oss:20b Model<br/>config/model.toml]
    end
    
    GradioUI -->|HTTP| MockAPI
    TestClient -->|HTTP| MockAPI
    MockAPI --> MockLogic
    MockLogic -->|HTTP| Ollama
    Ollama --> Model
    Model --> Ollama
    Ollama -->|Response| MockLogic
    MockLogic --> MockAPI
    MockAPI -->|Response| GradioUI
    MockAPI -->|Response| TestClient
    
    style GradioUI fill:#e1f5ff
    style TestClient fill:#e1f5ff
    style MockAPI fill:#fff4e1
    style MockLogic fill:#fff4e1
    style Ollama fill:#ffe1f5
    style Model fill:#ffe1f5

Mapping to Production:

Client Environment → Local browser/scripts (instead of remote client)
Application Server → Containerized mock API (instead of cloud deployment)
External Services → Local Ollama + model (instead of cloud LLM/VectorDB)

Threat Modeling

The threat model for this Local LLM architecture is available in the threat_model/ directory. It includes:

Diagram: LLM_TM_diagram.json (ThreatCanvas compatible)
Report: LLM_TM_report.md and LLM_TM_report.pdf

Prerequisites

uv – Python package manager (pip install uv if not already installed)
Podman (or Docker – replace podman with docker in the Makefile if desired)
Ollama (Local LLM runner)

Local Ollama Setup

Install Ollama.
Pull a model (e.g., Llama 3):
```
make ollama-pull
```
Start the Ollama server (usually runs automatically):
```
ollama serve
```
- Note: The containerized app accesses Ollama on the host via host.containers.internal:11434

Supported Models

Because this template uses Ollama as the default backend, you can use any model supported by Ollama from its library. This includes a wide range of open-weights models perfect for testing different capabilities and safety filters:

Llama 3 (Meta)
Mistral / Mixtral (Mistral AI)
Gemma (Google)
Qwen (Alibaba)
DeepSeek (DeepSeek)
Phi-3 (Microsoft)
GPT-OSS (Various community implementations)

The default configuration of this sandbox uses the gpt-oss:20b model, which is a 4-bit quantized (Q4) 20-billion (20B) parameter model. To ensure low-latency performance and prevent resource exhaustion, the following specifications are recommended:

Dedicated GPU Memory: 16 GB.
System Memory: 32 GB.
Storage: 14 GB available space.

For Apple Silicon Macs, you can use the gpt-oss:20b model with the following specifications or better:

Chip: Apple M4 Pro.
Memory: 24 GB.
Storage: 14 GB available space.

To use a different model, simply pull it with ollama pull <model_name> and update config/model.toml (see next subsection).

Configuration

Model Configuration (`config/model.toml`)

Controls which LLM model to use:

[default]
model = "gpt-oss:20b"  # Change to switch models

[ollama]
base_url = "http://host.containers.internal:11434/v1"

Test Prompts (`config/prompts.toml`)

Defines automated test prompts organized by category:

basic - Simple functionality tests
custom - Your own test prompts

Client Configuration (`config/client_config.toml`)

You can configure a global pre-prompt that is prepended to every user query. This is useful for testing system prompts or specific instruction sets.

Example Configuration (using XML format for structure):

[client]
pre_prompt = """
<system_instruction>
    You are a Red Team assistant designed to test security vulnerabilities.
    Please adopt an adversarial persona.
</system_instruction>

<context_rules>
    1. Use the provided context to answer.
    2. If the answer is not in the context, say "I don't know".
</context_rules>
"""

Quick Start

# View all available commands
make help

# Full automated setup and launch Gradio UI
make run-gradio-headless

# Or step-by-step:
make install      # Install uv
make sync         # Install dependencies
make build        # Build container
make up           # Start container
make test         # Test health endpoint

The mock API will be available at http://localhost:8000.

Available Commands

Run make help to see all commands:

Container Operations:

make build - Build the container image
make up - Run the container
make down - Stop and remove the container
make clean - Clean up containers and images

Development:

make install - Install uv package manager
make sync - Sync/install dependencies
make lock - Update dependency lock file

Testing:

make test - Full setup + health check
make test-client - Run automated prompt tests

UI:

make run-gradio-headless - Full setup + launch Gradio web interface (container)
make stop-gradio - Stop the Gradio container

Code Quality:

make format - Run black and isort formatters
make mypy - Run mypy type checker

Ollama:

make ollama-pull - Pull gpt-oss:20b model
make ollama-serve - Start Ollama (checks if already running)

Testing the Mock API

Health Check

curl http://localhost:8000/health

Expected response: {"status": "ok"}

Chat Completion

curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-mock-key" \
-d '{
  "model": "gpt-oss:20b",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Automated Testing

Run the test suite with prompts from config/prompts.toml:

make test-client

Output includes:

Test results for each prompt category
Success/failure status
Response previews
Summary statistics

Gradio Web Interface

Interactive chat interface:

make run-gradio-headless

Opens at http://localhost:7860 with a user-friendly chat UI.

Project Structure

.
├── config/                   # Configuration files
│   ├── client_config.toml   # Client settings
│   ├── model.toml           # Model settings (default model, Ollama config)
│   └── prompts.toml         # Test prompts for automated testing
├── data/                     # Placeholder for document files
├── app/                      # FastAPI mock server package
│   ├── __init__.py
│   ├── main.py              # FastAPI entry point
│   └── mocks/               # Modular mock service implementations
│       ├── __init__.py
│       ├── openai.py        # Mock OpenAI API using Ollama
│       └── README.md        # Guide for adding new mocks
├── client/                   # Client scripts
│   ├── main.py              # Automated test runner
│   └── gradio_app.py        # Web UI client
├── threat_model/            # Threat modeling artifacts
│   ├── LLM_TM_diagram.json
│   ├── LLM_TM_report.md
│   └── LLM_TM_report.pdf
├── Containerfile            # Podman container definition
├── entrypoint.sh            # Container entrypoint script
├── Makefile                 # Developer commands
├── packages.txt             # System packages
├── pyproject.toml           # uv project definition
├── uv.lock                  # Lock file generated by uv
└── README.md                # This file

Adding New Mock Services (Extensibility)

The template is designed to be easily extensible. While Ollama is the default, you can add support for other backends (like HuggingFace Transformers, vLLM, or other vector databases) by creating new mock services.

To add a new mock service (e.g., Pinecone, Anthropic, etc.):

Create a new module in app/mocks/ (e.g., pinecone_mock.py)
Implement your mock service as a FastAPI router
Export the router in app/mocks/__init__.py
Mount it in app/main.py

👉 See app/mocks/README.md for detailed step-by-step instructions and code examples.

Development Workflow

Making Changes

Edit code in app/ or client/
Format code: make format
Type check: make mypy
Rebuild and test: make run-gradio-headless

Adding Test Prompts

Edit config/prompts.toml
Add prompts to existing categories or create new ones
Run tests: make test-client

Changing Models

Edit config/model.toml
Update the model field under [default]
Pull the new model: ollama pull <model-name>
Restart: make down && make up

Notes

All commands are designed for Podman; replace podman with docker in the Makefile if you prefer Docker
The mock API uses sk-mock-key as the authentication token for testing purposes
Container name: app_container
Image name: llm-mock-api
Extend mock services in app/mocks/ to add support for additional APIs

Troubleshooting

Port conflicts:

If port 8000 is in use: make clean to remove old containers
If port 7860 is in use: make run-gradio-headless automatically kills existing Gradio instances

Ollama connection issues:

Ensure Ollama is running: ollama serve
Check if model is available: ollama list
Pull model if needed: make ollama-pull

Container issues:

View logs: podman logs app_container
Restart: make down && make up
Full cleanup: make clean && make build && make up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Local Sandbox

Overview

Using as a Sandbox Template

Architecture

Production Deployment (Target Architecture)

Local Mock Setup (This Template)

Threat Modeling

Prerequisites

Local Ollama Setup

Supported Models

Configuration

Model Configuration (`config/model.toml`)

Test Prompts (`config/prompts.toml`)

Client Configuration (`config/client_config.toml`)

Quick Start

Available Commands

Testing the Mock API

Health Check

Chat Completion

Automated Testing

Gradio Web Interface

Project Structure

Adding New Mock Services (Extensibility)

Development Workflow

Making Changes

Adding Test Prompts

Changing Models

Notes

Troubleshooting

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LLM Local Sandbox

Overview

Using as a Sandbox Template

Architecture

Production Deployment (Target Architecture)

Local Mock Setup (This Template)

Threat Modeling

Prerequisites

Local Ollama Setup

Supported Models

Configuration

Model Configuration (config/model.toml)

Test Prompts (config/prompts.toml)

Client Configuration (config/client_config.toml)

Quick Start

Available Commands

Testing the Mock API

Health Check

Chat Completion

Automated Testing

Gradio Web Interface

Project Structure

Adding New Mock Services (Extensibility)

Development Workflow

Making Changes

Adding Test Prompts

Changing Models

Notes

Troubleshooting

Model Configuration (`config/model.toml`)

Test Prompts (`config/prompts.toml`)

Client Configuration (`config/client_config.toml`)