llama-chat 🦙 Configuration Guide

Complete configuration reference for customizing your llama-chat installation with llama.cpp.

📋 Quick Reference

Configuration Type	File/Method	Purpose
Runtime Settings	`config.json`	Model parameters, timeouts, performance
Service Configuration	`llama-chat.conf`	Server ports, paths, GPU settings
Environment	Environment Variables	Runtime overrides and debugging
Database	`DATABASE_PATH`	SQLite database location
llama.cpp	Command line args	Server-specific configuration

🔧 Runtime Configuration (config.json)

Create a config.json file in your project root to customize application behavior.

Complete Configuration Example

{
  "timeouts": {
    "llamacpp_timeout": 600,
    "llamacpp_connect_timeout": 45
  },
  "model_options": {
    "temperature": 0.1,
    "top_p": 0.95,
    "top_k": 50,
    "min_p": 0.01,
    "num_predict": 4096,
    "repeat_penalty": 1.15,
    "stop": ["\n\nHuman:", "\n\nUser:"]
  },
  "performance": {
    "context_history_limit": 15,
    "num_thread": -1,
    "use_mlock": true,
    "use_mmap": true
  },
  "system_prompt": "You are Dost, a knowledgeable and thoughtful AI assistant. Take time to provide detailed, accurate, and well-reasoned responses. Consider multiple perspectives and provide comprehensive information when helpful.",
  "response_optimization": {
    "stream": false,
    "keep_alive": "10m"
  }
}

🏭 Service Configuration (llama-chat.conf)

The main configuration file for service management and hardware optimization.

Complete Service Configuration

# llama-chat Configuration File
# This file contains configuration options for llama-chat and llama.cpp server

# ============================================================================
# INSTALLATION SETTINGS
# ============================================================================

# Installation directory
INSTALL_DIR=$HOME/llama-chat

# ============================================================================
# FLASK APPLICATION SETTINGS
# ============================================================================

# Flask web server configuration
FLASK_HOST=127.0.0.1
FLASK_PORT=3000
FLASK_DEBUG=false

# ============================================================================
# LLAMA.CPP SERVER SETTINGS
# ============================================================================

# Basic server configuration
LLAMACPP_HOST=127.0.0.1
LLAMACPP_PORT=8080
MODELS_DIR=$INSTALL_DIR/models

# Model settings
DEFAULT_MODEL=
CONTEXT_SIZE=4096
GPU_LAYERS=0
THREADS=4
BATCH_SIZE=512

# ============================================================================
# ADVANCED LLAMA.CPP SERVER OPTIONS
# ============================================================================

# Processing and Performance
# LLAMA_ARG_N_PARALLEL=1
# LLAMA_ARG_CONT_BATCHING=false
# LLAMA_ARG_N_THREADS_BATCH=4
# LLAMA_ARG_N_UBATCH=512
# LLAMA_ARG_N_KEEP=-1

# Memory Management
# LLAMA_ARG_MLOCK=false
# LLAMA_ARG_NO_MMAP=false
# LLAMA_ARG_NUMA=false

# Model Loading
# LLAMA_ARG_N_CTX=4096
# LLAMA_ARG_N_BATCH=512
# LLAMA_ARG_N_GPU_LAYERS=0
# LLAMA_ARG_MAIN_GPU=0
# LLAMA_ARG_TENSOR_SPLIT=

# Security Settings
# LLAMA_ARG_API_KEY=
# LLAMA_ARG_API_KEY_FILE=

# ============================================================================
# LOGGING CONFIGURATION
# ============================================================================

# Log file locations
LOG_DIR=$INSTALL_DIR/logs
LLAMACPP_LOG_FILE=$LOG_DIR/llamacpp.log
FLASK_LOG_FILE=$LOG_DIR/flask.log

# Log rotation
LOG_MAX_SIZE=100M
LOG_ROTATE_COUNT=5

⏱️ Timeout Configuration

Controls connection and response timing behavior.

Settings

{
  "timeouts": {
    "llamacpp_timeout": 600,
    "llamacpp_connect_timeout": 45
  }
}

Parameter	Default	Description
`llamacpp_timeout`	`600`	Maximum seconds to wait for AI response
`llamacpp_connect_timeout`	`45`	Maximum seconds to wait for connection

Recommendations

Fast responses: Set llamacpp_timeout to 120-300 seconds
Complex queries: Use 600-1200 seconds
Slow networks: Increase llamacpp_connect_timeout to 60
Local setup: Keep defaults or reduce timeouts

🤖 Model Options

Fine-tune AI model behavior and response characteristics.

Core Parameters

{
  "model_options": {
    "temperature": 0.1,
    "top_p": 0.95,
    "top_k": 50,
    "min_p": 0.01,
    "num_predict": 4096,
    "repeat_penalty": 1.15,
    "stop": ["\n\nHuman:", "\n\nUser:"]
  }
}

Parameter Details

Parameter	Range	Default	Description
`temperature`	0.0-2.0	`0.1`	Response creativity (0=deterministic, 2=very creative)
`top_p`	0.0-1.0	`0.95`	Nucleus sampling threshold
`top_k`	1-100	`50`	Consider top K probable next tokens
`min_p`	0.0-1.0	`0.01`	Minimum probability threshold
`num_predict`	1-8192	`4096`	Maximum tokens to generate
`repeat_penalty`	0.5-2.0	`1.15`	Penalty for repeating tokens
`stop`	Array	`["\n\nHuman:", "\n\nUser:"]`	Stop generation sequences

Use Case Presets

Creative Writing

{
  "temperature": 0.8,
  "top_p": 0.9,
  "top_k": 50
}

Code Generation

{
  "temperature": 0.1,
  "top_p": 0.7,
  "top_k": 20
}

Analytical Tasks

{
  "temperature": 0.2,
  "top_p": 0.8,
  "top_k": 25
}

Conversational Chat

{
  "temperature": 0.3,
  "top_p": 0.85,
  "top_k": 40
}

⚡ Performance Configuration

Optimize memory usage and processing speed for llama.cpp.

Settings

{
  "performance": {
    "context_history_limit": 15,
    "num_thread": -1,
    "use_mlock": true,
    "use_mmap": true
  }
}

Service Configuration (llama-chat.conf)

# Model settings
CONTEXT_SIZE=4096       # Context window size
GPU_LAYERS=0            # Number of layers to offload to GPU
THREADS=4               # CPU threads to use
BATCH_SIZE=512          # Batch size for processing

# Advanced llama.cpp settings
LLAMA_ARG_N_PARALLEL=1           # Parallel processing slots
LLAMA_ARG_CONT_BATCHING=false    # Continuous batching
LLAMA_ARG_MLOCK=false            # Lock model in memory
LLAMA_ARG_NUMA=false             # NUMA optimization

Parameter Details

Parameter	Default	Description
`context_history_limit`	`15`	Number of previous messages to include
`CONTEXT_SIZE`	`4096`	Context window size for model
`GPU_LAYERS`	`0`	GPU layers to offload (0 = CPU only)
`THREADS`	`4`	CPU threads (-1 = auto-detect)
`BATCH_SIZE`	`512`	Batch processing size

Hardware Optimization

CPU-Only Systems

# llama-chat.conf
GPU_LAYERS=0
THREADS=-1  # Use all CPU cores
LLAMA_ARG_MLOCK=false
LLAMA_ARG_NO_MMAP=false

GPU Acceleration (NVIDIA)

# llama-chat.conf
GPU_LAYERS=32           # Or -1 for all layers
THREADS=8               # Fewer CPU threads when using GPU
LLAMA_ARG_MAIN_GPU=0    # Primary GPU ID
LLAMA_ARG_N_GPU_LAYERS=32

Apple Silicon (Metal)

# llama-chat.conf
GPU_LAYERS=-1           # Use all GPU layers
THREADS=8
# Metal is automatically enabled during compilation

Low Memory Systems (<8GB RAM)

{
  "performance": {
    "context_history_limit": 5,
    "use_mlock": false
  }
}

# llama-chat.conf
CONTEXT_SIZE=2048
BATCH_SIZE=256
LLAMA_ARG_N_CTX=2048

High Memory Systems (>16GB RAM)

{
  "performance": {
    "context_history_limit": 25,
    "use_mlock": true
  }
}

# llama-chat.conf
CONTEXT_SIZE=8192
BATCH_SIZE=1024
LLAMA_ARG_N_CTX=8192

🎭 System Prompt Customization

Define your AI assistant's personality and behavior.

Default System Prompt

{
  "system_prompt": "You are Dost, a knowledgeable and thoughtful AI assistant. Take time to provide detailed, accurate, and well-reasoned responses. Consider multiple perspectives and provide comprehensive information when helpful."
}

Custom Prompt Examples

Technical Expert

{
  "system_prompt": "You are a senior software architect with expertise in Python, web development, and system design. Provide detailed technical explanations with code examples when helpful. Focus on best practices, performance, and maintainability."
}

Creative Writer

{
  "system_prompt": "You are a creative writing assistant specializing in storytelling, character development, and narrative structure. Help users develop compelling stories with vivid descriptions and engaging dialogue."
}

Educational Tutor

{
  "system_prompt": "You are a patient and encouraging tutor. Break down complex concepts into digestible steps, provide examples, and ask clarifying questions to ensure understanding. Adapt your teaching style to the student's level."
}

🌍 Environment Variables

Configure application runtime through environment variables.

Core Variables

# llama.cpp Configuration
LLAMACPP_HOST=127.0.0.1
LLAMACPP_PORT=8080

# Flask Configuration  
FLASK_HOST=127.0.0.1
FLASK_PORT=3000
DEBUG=false

# Paths
MODELS_DIR=./models
DATABASE_PATH=./data/llama-chat.db
LOG_DIR=./logs

# Performance
GPU_LAYERS=0
THREADS=4
CONTEXT_SIZE=4096

Advanced Variables

# llama.cpp Server Arguments (prefix with LLAMA_ARG_)
LLAMA_ARG_N_GPU_LAYERS=32
LLAMA_ARG_N_PARALLEL=1
LLAMA_ARG_CONT_BATCHING=false
LLAMA_ARG_MLOCK=false
LLAMA_ARG_NUMA=false
LLAMA_ARG_N_CTX=4096
LLAMA_ARG_N_BATCH=512

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-chat 🦙 Configuration Guide

📋 Quick Reference

🔧 Runtime Configuration (config.json)

Complete Configuration Example

🏭 Service Configuration (llama-chat.conf)

Complete Service Configuration

⏱️ Timeout Configuration

Settings

Recommendations

🤖 Model Options

Core Parameters

Parameter Details

Use Case Presets

Creative Writing

Code Generation

Analytical Tasks

Conversational Chat

⚡ Performance Configuration

Settings

Service Configuration (llama-chat.conf)

Parameter Details

Hardware Optimization

CPU-Only Systems

GPU Acceleration (NVIDIA)

Apple Silicon (Metal)

Low Memory Systems (<8GB RAM)

High Memory Systems (>16GB RAM)

🎭 System Prompt Customization

Default System Prompt

Custom Prompt Examples

Technical Expert

Creative Writer

Educational Tutor

🌍 Environment Variables

Core Variables

Advanced Variables

FilesExpand file tree

config.md

Latest commit

History

config.md

File metadata and controls

llama-chat 🦙 Configuration Guide

📋 Quick Reference

🔧 Runtime Configuration (config.json)

Complete Configuration Example

🏭 Service Configuration (llama-chat.conf)

Complete Service Configuration

⏱️ Timeout Configuration

Settings

Recommendations

🤖 Model Options

Core Parameters

Parameter Details

Use Case Presets

Creative Writing

Code Generation

Analytical Tasks

Conversational Chat

⚡ Performance Configuration

Settings

Service Configuration (llama-chat.conf)

Parameter Details

Hardware Optimization

CPU-Only Systems

GPU Acceleration (NVIDIA)

Apple Silicon (Metal)

Low Memory Systems (<8GB RAM)

High Memory Systems (>16GB RAM)

🎭 System Prompt Customization

Default System Prompt

Custom Prompt Examples

Technical Expert

Creative Writer

Educational Tutor

🌍 Environment Variables

Core Variables

Advanced Variables