Run LLaMA models on Android with idiomatic Kotlin
An Android library for running LLaMA models on-device using llama.cpp.
Lightweight, easy-to-use API with full coroutine support following modern Android best practices.
Quick Start β’ API Reference β’ Contributing β’ Used By
| Chat Interface |
|---|
![]() |
dependencies {
implementation("org.codeshipping:llama-kotlin-android:0.1.0")
}Recommended models for Android:
| Model | Size | RAM | Download |
|---|---|---|---|
| Phi-3.5-mini β | ~2.4GB | 4-6GB | Download |
| TinyLlama-1.1B | ~670MB | 2GB | Download |
| Llama-3.2-3B | ~2GB | 6-8GB | Download |
| Qwen2.5-1.5B | ~1GB | 3-4GB | Download |
import org.codeshipping.llamakotlin.LlamaModel
class MyActivity : AppCompatActivity() {
private var model: LlamaModel? = null
// Load model
lifecycleScope.launch {
model = LlamaModel.load("/path/to/model.gguf") {
contextSize = 2048
threads = 4
temperature = 0.7f
}
}
// Generate response (streaming)
lifecycleScope.launch {
model?.generateStream("Hello, how are you?")
?.collect { token ->
textView.append(token)
}
}
// Clean up
override fun onDestroy() {
super.onDestroy()
model?.close()
}
}git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android
./gradlew :sample:installDebug| Feature | Description |
|---|---|
| π On-device inference | No internet required, complete privacy |
| π― Kotlin-first API | Idiomatic, DSL-style configuration |
| π Coroutine Support | Flow<String> streaming, structured concurrency |
| π¬ Conversation History | Multi-turn context in sample app |
| β‘ Multiple quantization | Q4_0, Q4_K_M, Q5_K_M, Q8_0 support |
| π¦ Small footprint | ~15 MB library size (without models) |
| π§Ή Memory safe | Automatic resource cleanup with Closeable pattern |
class LlamaModel : Closeable {
companion object {
suspend fun load(modelPath: String, config: LlamaConfig.() -> Unit = {}): LlamaModel
fun getVersion(): String
}
// One-shot generation
suspend fun generate(prompt: String): String
// Streaming generation (recommended)
fun generateStream(prompt: String): Flow<String>
// Cancel ongoing generation
fun cancelGeneration()
val isLoaded: Boolean
override fun close()
}val model = LlamaModel.load(modelPath) {
// Context
contextSize = 2048 // Max context length
batchSize = 512 // Batch size for prompt processing
// Threading
threads = 4 // Number of threads
threadsBatch = 4 // Threads for batch processing
// Sampling
temperature = 0.7f // Randomness (0.0 - 2.0)
topP = 0.9f // Nucleus sampling
topK = 40 // Top-K sampling
repeatPenalty = 1.1f // Repetition penalty
// Generation limits
maxTokens = 512 // Max tokens to generate
seed = -1 // Random seed (-1 = random)
// Memory options
useMmap = true // Memory-map model file
useMlock = false // Lock model in RAM
gpuLayers = 0 // GPU layers (0 = CPU only)
}try {
val model = LlamaModel.load(path)
} catch (e: LlamaException.ModelNotFound) {
// File doesn't exist
} catch (e: LlamaException.ModelLoadError) {
// Failed to load (invalid format, OOM, etc.)
} catch (e: LlamaException.GenerationError) {
// Generation failed
}Llama 3.2 / 3.1 Format
fun formatLlama3Prompt(system: String, user: String): String {
return buildString {
append("<|begin_of_text|>")
append("<|start_header_id|>system<|end_header_id|>\n\n")
append(system)
append("<|eot_id|>")
append("<|start_header_id|>user<|end_header_id|>\n\n")
append(user)
append("<|eot_id|>")
append("<|start_header_id|>assistant<|end_header_id|>\n\n")
}
}Phi-3 Format
fun formatPhi3Prompt(system: String, user: String): String {
return "<|system|>\n$system<|end|>\n<|user|>\n$user<|end|>\n<|assistant|>\n"
}ChatML Format (Qwen, etc.)
fun formatChatML(system: String, user: String): String {
return buildString {
append("<|im_start|>system\n$system<|im_end|>\n")
append("<|im_start|>user\n$user<|im_end|>\n")
append("<|im_start|>assistant\n")
}
}| Model | Size | Quality | Speed | Best For |
|---|---|---|---|---|
| Phi-3.5-mini | 2.4GB | ββββ | βββ | General use |
| TinyLlama-1.1B | 670MB | ββ | βββββ | Testing, low-end devices |
| Qwen2.5-1.5B | 1GB | βββ | ββββ | Coding, reasoning |
| Llama-3.2-3B | 2GB | βββββ | ββ | High quality chat |
| Gemma-2B | 1.2GB | βββ | ββββ | Google alternative |
| Format | Size | Quality | When to Use |
|---|---|---|---|
| Q4_0 | Smallest | Lower | Memory constrained |
| Q4_K_M | Small | Good | Recommended |
| Q5_K_M | Medium | Better | Quality priority |
| Q8_0 | Large | Best | Accuracy critical |
βββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application β
β (Activity/ViewModel using LlamaModel) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Kotlin API Layer β
β βββββββββββββββ βββββββββββββββββββββββ β
β β LlamaModel β β LlamaConfig DSL β β
β β (suspend) β β LlamaException β β
β βββββββββββββββ βββββββββββββββββββββββ β
β Coroutines + Flow β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β JNI Bridge β
β llama_jni.cpp + Wrappers β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Native Layer β
β llama.cpp (GGML, GGUF, Inference) β
β ARM NEON optimizations β
βββββββββββββββββββββββββββββββββββββββββββββββ
Project Structure
llama-kotlin-android/
βββ app/ # π¦ Library module
β βββ src/main/
β β βββ cpp/ # C++ native code
β β β βββ llama.cpp/ # llama.cpp submodule
β β β βββ llama_jni.cpp # JNI bridge
β β β βββ llama_context_wrapper.cpp
β β βββ java/org/codeshipping/llamakotlin/
β β βββ LlamaModel.kt # Main API
β β βββ LlamaConfig.kt # Configuration
β β βββ exception/ # Exceptions
β
βββ sample/ # π± Sample app
β βββ src/main/
β βββ java/.../MainActivity.kt
β βββ res/layout/activity_main.xml
β
βββ README.md
- Android Studio Ladybug+
- NDK 27.3.13750724
- CMake 3.22.1+
# Clone with submodules
git clone --recursive https://github.com/it5prasoon/llama-kotlin-android.git
cd llama-kotlin-android
# Update submodule if needed
git submodule update --init --recursive
# Build library
./gradlew :app:assembleRelease
# Build and install sample
./gradlew :sample:installDebug| Requirement | Minimum | Recommended |
|---|---|---|
| Android API | 24 (7.0) | 26+ (8.0+) |
| RAM | 3 GB | 6+ GB |
| Storage | 1 GB | 4+ GB |
| Architecture | arm64-v8a | arm64-v8a |
Projects using LLaMA Kotlin Android:
| Project | Description |
|---|---|
| MultiGPT | Multi-provider AI chat app for Android with local inference support |
Using this library? Open a PR to add your project here!
We welcome contributions from the community! Whether it's a bug fix, new feature, or documentation improvement β all contributions are appreciated.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to your branch (
git push origin feature/amazing-feature) - Open a Pull Request
Looking to contribute? Check out issues labeled good first issue for beginner-friendly tasks.
- π§ͺ Test coverage improvements
- π Documentation and examples
- π Bug reports and fixes
- β¨ New model support and chat templates
- β‘ Performance optimizations
If you find this library useful, please consider giving it a star! It helps others discover the project.
MIT License β see LICENSE for details.
- llama.cpp β Incredible C++ inference engine
- ggml β Tensor library for ML
- Meta AI β LLaMA models
Made with β€οΈ for the Android community
