🦙 LLaMA Kotlin Android

Run LLaMA models on Android with idiomatic Kotlin

An Android library for running LLaMA models on-device using llama.cpp.
Lightweight, easy-to-use API with full coroutine support following modern Android best practices.

Quick Start • API Reference • Contributing • Used By

📸 Screenshots

Chat Interface

🚀 Quick Start

1. Add Dependency

dependencies {
    implementation("org.codeshipping:llama-kotlin-android:0.1.0")
}

2. Download a Model

Recommended models for Android:

Model	Size	RAM	Download
Phi-3.5-mini ⭐	~2.4GB	4-6GB	Download
TinyLlama-1.1B	~670MB	2GB	Download
Llama-3.2-3B	~2GB	6-8GB	Download
Qwen2.5-1.5B	~1GB	3-4GB	Download

3. Basic Usage

import org.codeshipping.llamakotlin.LlamaModel

class MyActivity : AppCompatActivity() {
    private var model: LlamaModel? = null
    
    // Load model
    lifecycleScope.launch {
        model = LlamaModel.load("/path/to/model.gguf") {
            contextSize = 2048
            threads = 4
            temperature = 0.7f
        }
    }
    
    // Generate response (streaming)
    lifecycleScope.launch {
        model?.generateStream("Hello, how are you?")
            ?.collect { token ->
                textView.append(token)
            }
    }
    
    // Clean up
    override fun onDestroy() {
        super.onDestroy()
        model?.close()
    }
}

4. Run Sample App

git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android
./gradlew :sample:installDebug

✨ Features

Feature	Description
🔒 On-device inference	No internet required, complete privacy
🎯 Kotlin-first API	Idiomatic, DSL-style configuration
🌊 Coroutine Support	`Flow<String>` streaming, structured concurrency
💬 Conversation History	Multi-turn context in sample app
⚡ Multiple quantization	Q4_0, Q4_K_M, Q5_K_M, Q8_0 support
📦 Small footprint	~15 MB library size (without models)
🧹 Memory safe	Automatic resource cleanup with Closeable pattern

📚 API Reference

LlamaModel

class LlamaModel : Closeable {
    
    companion object {
        suspend fun load(modelPath: String, config: LlamaConfig.() -> Unit = {}): LlamaModel
        fun getVersion(): String
    }
    
    // One-shot generation
    suspend fun generate(prompt: String): String
    
    // Streaming generation (recommended)
    fun generateStream(prompt: String): Flow<String>
    
    // Cancel ongoing generation
    fun cancelGeneration()
    
    val isLoaded: Boolean
    override fun close()
}

Configuration DSL

val model = LlamaModel.load(modelPath) {
    // Context
    contextSize = 2048          // Max context length
    batchSize = 512             // Batch size for prompt processing
    
    // Threading
    threads = 4                 // Number of threads
    threadsBatch = 4            // Threads for batch processing
    
    // Sampling
    temperature = 0.7f          // Randomness (0.0 - 2.0)
    topP = 0.9f                // Nucleus sampling
    topK = 40                   // Top-K sampling
    repeatPenalty = 1.1f       // Repetition penalty
    
    // Generation limits
    maxTokens = 512            // Max tokens to generate
    seed = -1                  // Random seed (-1 = random)
    
    // Memory options
    useMmap = true             // Memory-map model file
    useMlock = false           // Lock model in RAM
    gpuLayers = 0              // GPU layers (0 = CPU only)
}

Exception Handling

try {
    val model = LlamaModel.load(path)
} catch (e: LlamaException.ModelNotFound) {
    // File doesn't exist
} catch (e: LlamaException.ModelLoadError) {
    // Failed to load (invalid format, OOM, etc.)
} catch (e: LlamaException.GenerationError) {
    // Generation failed
}

💬 Chat Templates

Llama 3.2 / 3.1 Format

fun formatLlama3Prompt(system: String, user: String): String {
    return buildString {
        append("<|begin_of_text|>")
        append("<|start_header_id|>system<|end_header_id|>\n\n")
        append(system)
        append("<|eot_id|>")
        append("<|start_header_id|>user<|end_header_id|>\n\n")
        append(user)
        append("<|eot_id|>")
        append("<|start_header_id|>assistant<|end_header_id|>\n\n")
    }
}

Phi-3 Format

fun formatPhi3Prompt(system: String, user: String): String {
    return "<|system|>\n$system<|end|>\n<|user|>\n$user<|end|>\n<|assistant|>\n"
}

ChatML Format (Qwen, etc.)

fun formatChatML(system: String, user: String): String {
    return buildString {
        append("<|im_start|>system\n$system<|im_end|>\n")
        append("<|im_start|>user\n$user<|im_end|>\n")
        append("<|im_start|>assistant\n")
    }
}

📦 Supported Models

Model	Size	Quality	Speed	Best For
Phi-3.5-mini	2.4GB	⭐⭐⭐⭐	⭐⭐⭐	General use
TinyLlama-1.1B	670MB	⭐⭐	⭐⭐⭐⭐⭐	Testing, low-end devices
Qwen2.5-1.5B	1GB	⭐⭐⭐	⭐⭐⭐⭐	Coding, reasoning
Llama-3.2-3B	2GB	⭐⭐⭐⭐⭐	⭐⭐	High quality chat
Gemma-2B	1.2GB	⭐⭐⭐	⭐⭐⭐⭐	Google alternative

Quantization Formats

Format	Size	Quality	When to Use
Q4_0	Smallest	Lower	Memory constrained
Q4_K_M	Small	Good	Recommended
Q5_K_M	Medium	Better	Quality priority
Q8_0	Large	Best	Accuracy critical

🏗️ Architecture

┌─────────────────────────────────────────────┐
│              Your Application               │
│    (Activity/ViewModel using LlamaModel)    │
└─────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────┐
│            Kotlin API Layer                 │
│  ┌─────────────┐  ┌─────────────────────┐  │
│  │ LlamaModel  │  │   LlamaConfig DSL   │  │
│  │  (suspend)  │  │   LlamaException    │  │
│  └─────────────┘  └─────────────────────┘  │
│         Coroutines + Flow                   │
└─────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────┐
│              JNI Bridge                     │
│         llama_jni.cpp + Wrappers           │
└─────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────┐
│              Native Layer                   │
│    llama.cpp (GGML, GGUF, Inference)       │
│         ARM NEON optimizations              │
└─────────────────────────────────────────────┘

Project Structure

llama-kotlin-android/
├── app/                          # 📦 Library module
│   ├── src/main/
│   │   ├── cpp/                  # C++ native code
│   │   │   ├── llama.cpp/        # llama.cpp submodule
│   │   │   ├── llama_jni.cpp     # JNI bridge
│   │   │   └── llama_context_wrapper.cpp
│   │   └── java/org/codeshipping/llamakotlin/
│   │       ├── LlamaModel.kt     # Main API
│   │       ├── LlamaConfig.kt    # Configuration
│   │       └── exception/        # Exceptions
│
├── sample/                       # 📱 Sample app
│   └── src/main/
│       ├── java/.../MainActivity.kt
│       └── res/layout/activity_main.xml
│
└── README.md

🛠️ Building from Source

Prerequisites

Android Studio Ladybug+
NDK 27.3.13750724
CMake 3.22.1+

Build Steps

# Clone with submodules
git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android

# Update submodule if needed
git submodule update --init --recursive

# Build library
./gradlew :app:assembleRelease

# Build and install sample
./gradlew :sample:installDebug

📋 Requirements

Requirement	Minimum	Recommended
Android API	24 (7.0)	26+ (8.0+)
RAM	3 GB	6+ GB
Storage	1 GB	4+ GB
Architecture	arm64-v8a	arm64-v8a

🏆 Used By

Projects using LLaMA Kotlin Android:

Project	Description
MultiGPT	Multi-provider AI chat app for Android with local inference support

Using this library? Open a PR to add your project here!

🤝 Contributing

We welcome contributions from the community! Whether it's a bug fix, new feature, or documentation improvement — all contributions are appreciated.

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to your branch (git push origin feature/amazing-feature)
Open a Pull Request

Good First Issues

Looking to contribute? Check out issues labeled good first issue for beginner-friendly tasks.

Areas We Need Help

🧪 Test coverage improvements
📖 Documentation and examples
🐛 Bug reports and fixes
✨ New model support and chat templates
⚡ Performance optimizations

⭐ Star History

If you find this library useful, please consider giving it a star! It helps others discover the project.

📄 License

MIT License — see LICENSE for details.

🙏 Acknowledgments

llama.cpp — Incredible C++ inference engine
ggml — Tensor library for ML
Meta AI — LLaMA models

Made with ❤️ for the Android community

Report Bug · Request Feature · Discussions

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
app		app
gradle		gradle
sample		sample
screenshots		screenshots
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
PUBLISHING.md		PUBLISHING.md
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
publish.sh		publish.sh
settings.gradle.kts		settings.gradle.kts
sync-upstream.sh		sync-upstream.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦙 LLaMA Kotlin Android

📸 Screenshots

🚀 Quick Start

1. Add Dependency

2. Download a Model

3. Basic Usage

4. Run Sample App

✨ Features

📚 API Reference

LlamaModel

Configuration DSL

Exception Handling

💬 Chat Templates

📦 Supported Models

Quantization Formats

🏗️ Architecture

🛠️ Building from Source

Prerequisites

Build Steps

📋 Requirements

🏆 Used By

🤝 Contributing

How to Contribute

Good First Issues

Areas We Need Help

⭐ Star History

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🦙 LLaMA Kotlin Android

📸 Screenshots

🚀 Quick Start

1. Add Dependency

2. Download a Model

3. Basic Usage

4. Run Sample App

✨ Features

📚 API Reference

LlamaModel

Configuration DSL

Exception Handling

💬 Chat Templates

📦 Supported Models

Quantization Formats

🏗️ Architecture

🛠️ Building from Source

Prerequisites

Build Steps

📋 Requirements

🏆 Used By

🤝 Contributing

How to Contribute

Good First Issues

Areas We Need Help

⭐ Star History

📄 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages