This repository provides a robust template for creating local RAG sandboxes. It is designed for Red Teaming, by allowing you to mimic production environments without external dependencies or API costs.
This project serves as a "Local OpenAI API Mirror". It tricks applications into believing they are communicating with the real OpenAI API, while actually routing requests to a local LLM backend (defaulting to Ollama).
Why use this for Red Teaming?
- Controlled Environment: Test attacks and defenses in a safe, isolated container.
- No Cost: Run extensive fuzzing or automated scans without burning API credits.
- Offline Capable: Work in air-gapped or restricted network environments.
- Model Agnostic: Swap between different model families (Llama, Mistral, Gemma, etc.) to test model-specific vulnerabilities.
The template includes a FastAPI-based mock server, modular service implementations, automated testing, client scripts, and container orchestration using Podman.
graph TB
subgraph "Client Environment"
Client[Client Application]
end
subgraph "Application Server"
API[LLM API Gateway]
AppLogic[Application Logic]
end
subgraph "External Services"
LLM[Language Model Service<br/>OpenAI/Anthropic/etc.]
VectorDB[Vector Database<br/>Pinecone/Weaviate/etc.]
S3[Object Storage<br/>Amazon S3]
end
Client -->|HTTPS| API
API --> AppLogic
AppLogic -->|API Call| LLM
AppLogic -->|Query/Store| VectorDB
AppLogic -->|Upload/Download| S3
LLM -->|Response| AppLogic
VectorDB -->|Results| AppLogic
S3 -->|File Data| AppLogic
AppLogic --> API
API -->|Response| Client
style Client fill:#e1f5ff
style API fill:#fff4e1
style AppLogic fill:#fff4e1
style LLM fill:#ffe1f5
style VectorDB fill:#ffe1f5
style S3 fill:#ffe1f5
graph LR
subgraph "Client Environment (Local)"
GradioUI[Gradio Web UI<br/>:7860]
TestClient[Automated Test Routine<br/>config/prompts.toml]
ETL[ETL Routine<br/>ETL/ingest.py]
end
subgraph "Application Server (Container)"
MockAPI[API Gateway<br/>FastAPI :8000]
RAGEngine[RAG Engine<br/>app/rag_engine.py]
MockOpenAI[Mock OpenAI API<br/>app/mocks/openai.py]
MockPinecone[Mock Pinecone API<br/>app/mocks/pinecone.py]
MockS3[Mock S3 API<br/>app/mocks/s3.py]
ChromaDB[(Mock Vector DB<br/>ChromaDB)]
end
subgraph "External Services (Local Host)"
Ollama[Ollama Server<br/>:11434]
Model[gpt-oss:20b Model<br/>config/model.toml]
FileStorage[File Storage<br/>data/documents]
end
GradioUI -->|HTTP /rag/v1| MockAPI
TestClient -->|HTTP /rag/v1| MockAPI
MockAPI -->|Route /rag| RAGEngine
MockAPI -->|Route /v1| MockOpenAI
MockAPI -->|Route /pinecone| MockPinecone
MockAPI -->|Route /s3| MockS3
RAGEngine -.->|HTTP /v1| MockAPI
RAGEngine -.->|HTTP /pinecone| MockAPI
MockOpenAI -->|HTTP| Ollama
Ollama --> Model
Model --> Ollama
Ollama -->|Response| MockOpenAI
MockPinecone -->|Query/Upsert| ChromaDB
ChromaDB -->|Results| MockPinecone
MockS3 -->|Save| FileStorage
ETL -->|Read| FileStorage
ETL -->|Upsert| MockPinecone
style GradioUI fill:#e1f5ff
style TestClient fill:#e1f5ff
style ETL fill:#e1f5ff
style MockAPI fill:#fff4e1
style RAGEngine fill:#fff4e1
style MockOpenAI fill:#ffe1f5
style MockPinecone fill:#ffe1f5
style MockS3 fill:#ffe1f5
style ChromaDB fill:#ffe1f5
style Ollama fill:#ffe1f5
style Model fill:#ffe1f5
style FileStorage fill:#ffe1f5
Mapping to Production:
- Client Environment → Local browser/scripts (instead of remote client)
- Application Server → Containerized mock API + ChromaDB (instead of cloud deployment)
- External Services → Local Ollama + model (instead of cloud LLM)
The threat model for this RAG architecture is available in the threat_model/ directory. It includes:
- Diagram:
RAG_TM_diagram.json(ThreatCanvas compatible) - Report:
RAG_TM_report.mdandRAG_TM_report.pdf
These artifacts were generated using ThreatCanvas by SecureFlag.
This template includes a mock Pinecone API backed by a local ChromaDB instance.
- Endpoints:
POST /pinecone/vectors/upsert: Upserts vectors.POST /pinecone/query: Queries vectors.
- Authentication: Requires header
Api-Key: bar. - Persistence: Data is stored locally in
data/chromadb.
The RAG Engine (app/rag_engine.py) orchestrates the retrieval and generation process by communicating with the mock services:
- Embed: Calls the Mock OpenAI API (
/v1/embeddings) to generate embeddings for the user's query. - Retrieve: Calls the Mock Pinecone API (
/pinecone/query) to retrieve relevant documents. - Augment: Injects retrieved context into the system prompt.
- Generate: Calls the Mock OpenAI API (
/v1/chat/completions) with the augmented prompt to generate the response.
- uv – Python package manager (
pip install uvif not already installed) - Podman (or Docker – replace
podmanwithdockerin the Makefile if desired) - Ollama (Local LLM runner)
- Install Ollama.
- Pull a model (e.g., Llama 3):
make ollama-pull
- Start the Ollama server (usually runs automatically):
ollama serve
- Note: The containerized app accesses Ollama on the host via
host.containers.internal:11434
- Note: The containerized app accesses Ollama on the host via
Because this template uses Ollama as the default backend, you can use any model supported by Ollama from its library. This includes a wide range of open-weights models perfect for testing different capabilities and safety filters:
- Llama 3 (Meta)
- Mistral / Mixtral (Mistral AI)
- Gemma (Google)
- Qwen (Alibaba)
- DeepSeek (DeepSeek)
- Phi-3 (Microsoft)
- GPT-OSS (Various community implementations)
The default configuration of this sandbox uses the gpt-oss:20b model, which is a 4-bit quantized (Q4) 20-billion (20B) parameter model. To ensure low-latency performance and prevent resource exhaustion, the following specifications are recommended:
- Dedicated GPU Memory: 16 GB.
- System Memory: 32 GB.
- Storage: 14 GB available space.
For Apple Silicon Macs, you can use the gpt-oss:20b model with the following specifications or better:
- Chip: Apple M4 Pro.
- Memory: 24 GB.
- Storage: 14 GB available space.
To use a different model, simply pull it with ollama pull <model_name> and update config/model.toml (see next subsection).
Controls which LLM model to use:
[default]
model = "gpt-oss:20b" # Change to switch models
[ollama]
base_url = "http://host.containers.internal:11434/v1"Defines automated test prompts organized by category:
basic- Simple functionality testscustom- Your own test prompts
You can configure a global pre-prompt that is prepended to every user query. This is useful for testing system prompts or specific instruction sets.
Example Configuration (using XML format for structure):
[client]
pre_prompt = """
<system_instruction>
You are a Red Team assistant designed to test security vulnerabilities.
Please adopt an adversarial persona.
</system_instruction>
<context_rules>
1. Use the provided context to answer.
2. If the answer is not in the context, say "I don't know".
</context_rules>
"""To ingest PDF documents into the RAG system:
- Place PDFs: Put your PDF files in
data/documents/. - Run Ingestion:
This command:
make ingest
- Parses all PDFs in
data/documents/. - Chunks the text.
- Generates embeddings using Ollama.
- Upserts vectors to the mock Pinecone API.
- Parses all PDFs in
# View all available commands
make help
# Full automated setup and launch Gradio UI
make run-gradio-headless # Run in container
# OR
make gradio # Run locally
# Or step-by-step:
make install # Install uv
make sync # Install dependencies
make build # Build container
make up # Start container
make test # Test health endpointThe mock API will be available at http://localhost:8000.
Run make help to see all commands:
Container Operations:
make build- Build the container imagemake up- Run the containermake down- Stop and remove the containermake clean- Clean up containers and images
Development:
make install- Install uv package managermake sync- Sync/install dependenciesmake lock- Update dependency lock file
Testing:
make test- Full setup + health checkmake test-client- Run automated prompt tests
UI:
make gradio- Full setup + launch Gradio web interface (local)make run-gradio-headless- Full setup + launch Gradio web interface (container)make stop-gradio- Stop the Gradio container
Code Quality:
make format- Run black and isort formattersmake mypy- Run mypy type checker
Ollama:
make ollama-pull- Pull gpt-oss:20b modelmake ollama-serve- Start Ollama (checks if already running)
curl http://localhost:8000/healthExpected response: {"status": "ok"}
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-mock-key" \
-d '{
"model": "gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello!"}]
}'Upsert Vectors:
curl -X POST http://localhost:8000/pinecone/vectors/upsert \
-H "Api-Key: bar" \
-H "Content-Type: application/json" \
-d '{
"vectors": [
{
"id": "vec1",
"values": [0.1, 0.2, 0.3],
"metadata": {"text": "The secret code is 12345."}
}
]
}'Query Vectors:
curl -X POST http://localhost:8000/pinecone/query \
-H "Api-Key: bar" \
-H "Content-Type: application/json" \
-d '{"vector": [0.1, 0.2, 0.3], "topK": 1}'Upload File (PUT Object):
curl -X PUT http://localhost:8000/s3/documents/example.pdf \
-H "Api-Key: foobar" \
--data-binary @path/to/example.pdfNote: This mock saves files to data/documents/ when the bucket name is documents.
Run the test suite with prompts from config/prompts.toml:
make test-clientOutput includes:
- Test results for each prompt category
- Success/failure status
- Response previews
- Summary statistics
Interactive chat interface:
make run-gradio-headlessOpens at http://localhost:7860 with a user-friendly chat UI.
.
├── config/ # Configuration files
│ ├── client_config.toml # Client settings
│ ├── model.toml # Model settings (default model, Ollama config)
│ └── prompts.toml # Test prompts for automated testing
├── data/ # Data directory
│ ├── chromadb/ # ChromaDB persistence
│ └── documents/ # Document files
├── app/ # FastAPI mock server package
│ ├── __init__.py
│ ├── main.py # FastAPI entry point
│ ├── rag_engine.py # RAG Engine implementation
│ └── mocks/ # Modular mock service implementations
│ ├── __init__.py
│ ├── openai.py # Mock OpenAI API using Ollama
│ ├── pinecone.py # Mock Pinecone API using ChromaDB
│ ├── s3.py # Mock S3 API
│ └── README.md # Guide for adding new mocks
├── client/ # Client scripts
│ ├── main.py # Automated test runner
│ └── gradio_app.py # Web UI client
├── threat_model/ # Threat modeling artifacts
│ ├── RAG_TM_diagram.json
│ ├── RAG_TM_report.md
│ └── RAG_TM_report.pdf
├── Containerfile # Podman container definition
├── entrypoint.sh # Container entrypoint script
├── Makefile # Developer commands
├── packages.txt # System packages
├── pyproject.toml # uv project definition
├── uv.lock # Lock file generated by uv
└── README.md # This file
The template is designed to be easily extensible. While Ollama is the default, you can add support for other backends (like HuggingFace Transformers, vLLM, or other vector databases) by creating new mock services.
To add a new mock service (e.g., Pinecone, Anthropic, etc.):
- Create a new module in
app/mocks/(e.g.,pinecone_mock.py) - Implement your mock service as a FastAPI router
- Export the router in
app/mocks/__init__.py - Mount it in
app/main.py
👉 See app/mocks/README.md for detailed step-by-step instructions and code examples.
- Edit code in
app/orclient/ - Format code:
make format - Type check:
make mypy - Rebuild and test:
make run-gradio-headless(ormake gradio)
- Edit
config/prompts.toml - Add prompts to existing categories or create new ones
- Run tests:
make test-client
- Edit
config/model.toml - Update the
modelfield under[default] - Pull the new model:
ollama pull <model-name> - Restart:
make down && make up
- All commands are designed for Podman; replace
podmanwithdockerin the Makefile if you prefer Docker - The mock API uses
sk-mock-keyas the authentication token for testing purposes - Container name:
app_container - Image name:
llm-mock-api - Extend mock services in
app/mocks/to add support for additional APIs
Port conflicts:
- If port 8000 is in use:
make cleanto remove old containers - If port 7860 is in use:
make run-gradio-headlessautomatically kills existing Gradio instances
Ollama connection issues:
- Ensure Ollama is running:
ollama serve - Check if model is available:
ollama list - Pull model if needed:
make ollama-pull
Container issues:
- View logs:
podman logs app_container - Restart:
make down && make up - Full cleanup:
make clean && make build && make up