Your processing runs are failing at the classification stage with Ollama errors like:
model requires more system memory (12.8 GiB) than is available (9.3 GiB)
memory layout cannot be allocated
GGML_ASSERT(ctx->mem_buffer != NULL) failed
Based on your logs from November 10th, 2025, the issue is VRAM contention:
- PyAnnote (Diarization) loads into VRAM (~8GB)
- PyAnnote doesn't fully unload after diarization completes
- Ollama tries to load
gpt-oss:20b(requires 12.8GB) - Only 9.3GB VRAM is available (PyAnnote still occupies ~3-4GB)
- Ollama fails with memory allocation errors
2025-11-10 21:09:17 | WARNING | DDSessionProcessor.classifier.ollama |
Classification failed for segment 0 using gpt-oss:20b:
model requires more system memory (12.8 GiB) than is available (9.3 GiB)
Multiple subsequent attempts show:
- Low-VRAM retry also fails
- Persistent
memory layout cannot be allocatederrors - Classification continues failing for all segments
Why this works: Offloads classification to cloud, eliminates VRAM contention entirely.
Steps:
- Get free Groq API key at https://console.groq.com/
- Add to
.env:GROQ_API_KEY=your_groq_api_key_here
- Test connection:
python test_api_keys.py
- Configure in UI:
- Step 2: Configure Session
- Advanced Backend Settings
- Classification:
groq
Benefits:
- ✅ 100% free (no credit card required)
- ✅ Faster than local Ollama
- ✅ No VRAM usage
- ✅ No memory contention
- ✅ Works every time
Recommended Configuration (12GB VRAM):
Transcription: groq (cloud - free)
Diarization: pyannote (local - 8GB VRAM)
Classification: groq (cloud - free)
Why this works: Smaller model fits in remaining VRAM after PyAnnote.
Steps:
-
Pull a smaller model:
ollama pull llama3.1:8b # or ollama pull mistral:7b -
Update
.env:OLLAMA_MODEL=llama3.1:8b
-
Restart application
Model Size Comparison:
| Model | Size | VRAM Required | Fits After PyAnnote? |
|---|---|---|---|
| gpt-oss:20b | 20B params | ~12.8GB | ❌ No (requires 12.8GB) |
| llama3.1:8b | 8B params | ~8GB | |
| llama3.1:3b | 3B params | ~4GB | ✅ Yes (plenty of room) |
| phi3:mini | 3.8B params | ~4GB | ✅ Yes |
Drawbacks:
- Smaller models = lower classification quality
- Still risk of VRAM issues if PyAnnote doesn't release memory
- Slower than cloud (local inference)
Why this works: Moves Ollama to CPU, leaving GPU entirely for PyAnnote.
Steps:
-
Check current Ollama configuration:
ollama show gpt-oss:20b --modelfile
-
Create a CPU-only version:
# Create Modelfile cat > Modelfile-cpu <<EOF FROM gpt-oss:20b PARAMETER num_gpu 0 PARAMETER num_thread 8 EOF # Create the model ollama create gpt-oss-cpu -f Modelfile-cpu
-
Update
.env:OLLAMA_MODEL=gpt-oss-cpu
Drawbacks:
- 🐌 VERY SLOW on CPU (10-20x slower)
- May timeout on long segments
- Not practical for large sessions
Why this works: Ensures complete VRAM release between stages.
Steps:
-
Run diarization only:
- Configure: Diarization =
pyannote - Skip classification stage
- Let it complete
- Configure: Diarization =
-
Restart application (clears VRAM):
- Settings & Tools tab
- Application Control
- Restart Application button
-
Run classification only:
- Load previous session
- Skip diarization (already done)
- Run classification with Ollama
Drawbacks:
- 🕐 Manual intervention required
- 🔄 Two-step process
- 📦 Complex for large sessions
Why this works: Offloads classification, keeps diarization local (HF API is rate-limited).
Configuration:
Transcription: groq (cloud - free, fast)
Diarization: pyannote (local - best quality, no rate limits)
Classification: groq (cloud - free, fast)
Benefits:
- ✅ No VRAM contention (only PyAnnote uses GPU)
- ✅ Best diarization quality (local PyAnnote)
- ✅ No HF API rate limits for diarization
- ✅ Fast classification (Groq)
- ✅ 100% free
Steps:
- Set up Groq API (Solution 1)
- Configure mixed backends in UI
- Process normally
nvidia-smiollama listollama show gpt-oss:20bollama run gpt-oss:20b "What is 2+2?"If this fails with memory errors, the model is too large for your system.
import torch
print(f"GPU Memory Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"GPU Memory Cached: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")Meaning: Ollama model requires X GB, but only Y GB is free in VRAM. Fix: Use smaller model or switch to Groq.
Meaning: VRAM fragmentation or not enough contiguous memory. Fix: Restart application to clear VRAM, or use Groq.
Meaning: Internal Ollama error due to memory allocation failure. Fix: Model is too large for available VRAM. Use Groq or smaller model.
Meaning: Even with reduced context window, model doesn't fit. Fix: Model is fundamentally too large. Must use Groq or much smaller model.
Based on your logs and typical processing times:
| Backend | Time | Success Rate | VRAM Usage |
|---|---|---|---|
| Ollama gpt-oss:20b (GPU) | ~10-15 min | ❌ 0% (fails) | 12.8GB (too much) |
| Ollama llama3.1:8b (GPU) | ~8-10 min | 8GB (tight fit) | |
| Ollama llama3.1:8b (CPU) | ~60-90 min | ✅ 100% | 0GB (CPU only) |
| Groq llama-3.3-70b | ~2-3 min | ✅ 100% | 0GB (cloud) |
Conclusion: Groq is faster, more reliable, and eliminates VRAM issues entirely.
Your system: 12GB VRAM (likely RTX 3060 or similar)
What you'd need for gpt-oss:20b: ~16-24GB VRAM to handle:
- PyAnnote diarization (8GB)
- Ollama gpt-oss:20b (12.8GB)
- Overhead and fragmentation (~2-4GB)
Upgrade options:
- RTX 4090 (24GB) - $1,600+
- RTX A5000 (24GB) - $2,000+
- RTX 6000 Ada (48GB) - $6,000+
Better option: Use free Groq API instead of hardware upgrade.
- Switch to Groq (Solution 1)
- Takes 5 minutes to set up
- Eliminates all Ollama errors
- Faster than local
- 100% free
- Use mixed configuration (Solution 5)
Transcription: groq Diarization: pyannote (local) Classification: groq
- Try smaller model (Solution 2)
Then update
ollama pull llama3.1:3b
.env:OLLAMA_MODEL=llama3.1:3b
After implementing a solution:
-
Start fresh session:
- Upload a small test video (~5-10 minutes)
- Configure your chosen backend
-
Monitor logs:
tail -f logs/session_processor_*.log | grep -i "classif\|error"
-
Check for success:
- No "memory layout cannot be allocated" errors
- No "model requires more system memory" errors
- Classification completes for all segments
-
Verify output:
- Check
output/[session]/[session]_classified.json - Should have classifications for all segments
- No null/missing classifications
- Check
- CLOUD_INFERENCE_OPTIONS.md - Complete guide to cloud backends
- test_api_keys.py - Test your Groq/HF API keys
- Groq Console - Get free API key
- Ollama Model Library - Browse smaller models
If you're still experiencing problems after trying these solutions:
-
Share your logs:
tail -100 logs/session_processor_*.log -
Share your configuration:
cat .env | grep -v "API_KEY" # Don't share actual keys
-
Share VRAM status:
nvidia-smi
-
Check Ollama status:
ollama list ollama ps # Show running models
Last Updated: 2025-11-11 Your Specific Issue: gpt-oss:20b requires 12.8GB, only 9.3GB available after PyAnnote Recommended Fix: Switch classification to Groq (free, fast, reliable)