This document outlines 100% free cloud API options for offloading compute-intensive tasks in the D&D Session Processor. These are particularly useful if local inference (Ollama, Whisper, PyAnnote) encounters issues or if you want to leverage cloud resources.
Best for: Transcription, Classification (IC/OOC) Cost: Completely free with registration (no credit card required) Speed: Fastest inference available (up to 1000 tokens/second)
- Visit https://console.groq.com/
- Register for a free account (no payment method needed)
- Generate an API key
- Add to your
.envfile:GROQ_API_KEY=your_groq_api_key_here
- LLaMA 3.3 70B Versatile (
llama-3.3-70b-versatile) - DEFAULT for classification/transcription - LLaMA 3.1 8B Instant (
llama-3.1-8b-instant) - Fast, efficient alternative - Mixtral 8x7B (
mixtral-8x7b-32768) - Large context window - Whisper Large V3 (
whisper-large-v3) - Speech-to-text transcription
Note: Older models like llama3-8b-8192 and llama-3.2-1b-preview have been decommissioned.
Free-tier Groq accounts enforce short bursts. Configure the .env knobs to prevent rate_limit_exceeded responses:
GROQ_MAX_CALLS_PER_SECONDโ steady throughput target (defaults to 2 req/s)GROQ_RATE_LIMIT_BURSTโ how many calls may fire inside the window before throttlingGROQ_RATE_LIMIT_PERIOD_SECONDSโ moving window duration used by the limiter
The IC/OOC classifier now enforces these limits automatically and backs off whenever the API reports a rate limit.
- Navigate to Step 2: Configure Session
- Expand Advanced Backend Settings accordion
- Select backends:
- Transcription:
groq - Classification:
groq
- Transcription:
Run the API validation script:
python test_api_keys.pyBest for: High-quality transcription with official OpenAI support Cost: Pay-per-use ($0.006 per minute of audio) Speed: Fast cloud processing
- Visit https://platform.openai.com/api-keys
- Create a new API key (requires payment method on file)
- Add to your
.envfile:OPENAI_API_KEY=your_openai_api_key_here
- Whisper-1 - Official OpenAI Whisper model with excellent multilingual support
- Navigate to Step 2: Configure Session
- Expand Advanced Backend Settings accordion
- Select backends:
- Transcription:
openai
- Transcription:
- $0.006 per minute of audio
- Example: 4-hour D&D session = 240 minutes ร $0.006 = $1.44
- Much cheaper than real-time transcription services
- No monthly minimums or subscription required
- Verbose JSON response with segment and word-level timestamps
- Automatic language detection
- Excellent Dutch language support
- Built-in retry logic with exponential backoff
- Temporary file cleanup after processing
Run the API validation script:
python test_api_keys.pyBest for: Diarization (speaker identification) Cost: Free tier with rate limits Limitations: ~1000 requests/day on free tier
- Visit https://huggingface.co/settings/tokens
- Create a new token with "Make calls to Inference Providers" permission
- Add to your
.envfile:HUGGING_FACE_API_KEY=your_hf_token_here
- PyAnnote Audio 3.1 - Speaker diarization (who is speaking when)
- Whisper - Speech recognition
- Custom fine-tuned models - Upload your own
- Navigate to Step 2: Configure Session
- Expand Advanced Backend Settings accordion
- Select backends:
- Diarization:
hf_api
- Diarization:
Run the API validation script:
python test_api_keys.pyBest for: Real-time transcription Cost: $200 free credits (no credit card for trial) Speed: Extremely fast real-time processing Website: https://deepgram.com/
Pros:
- Superior accuracy
- Real-time streaming
- Multiple languages
Cons:
- Not implemented in this project yet
- Credits expire after trial period
Best for: Podcast/interview transcription Cost: 10 hours/month free Website: https://www.gladia.io/
Pros:
- Speaker diarization included
- Multi-language support
- Webhook callbacks
Cons:
- Not implemented in this project yet
- Limited free tier hours
Best for: Text classification Cost: Free tier with rate limits Website: https://cohere.com/
Pros:
- Excellent for classification tasks
- Good documentation
- Simple API
Cons:
- Not implemented in this project yet
- Primarily focused on text (not audio)
Best for: Multi-modal classification Cost: Free tier (60 requests/minute) Website: https://ai.google.dev/
Pros:
- Multi-modal (text, image, audio)
- Generous free tier
- Fast inference
Cons:
- Not implemented in this project yet
| Task | Local Backend | Cloud Backend | Free Cloud Option |
|---|---|---|---|
| Transcription | Whisper (GPU/CPU) | Groq / OpenAI | โ Groq (unlimited) / ๐ฐ OpenAI ($0.006/min) |
| Diarization | PyAnnote (GPU) | HF Inference | โ HuggingFace (~1000/day) |
| Classification | Ollama (CPU/GPU) | Groq LLaMA | โ Groq (unlimited) |
Local Processing (12GB VRAM):
- โ Whisper Large V3 (~10GB VRAM)
- โ PyAnnote Audio (~8GB VRAM)
โ ๏ธ Ollama LLaMA 3.1 8B (~8GB VRAM) - May conflict with other models
Cloud Processing:
- โ No VRAM required
- โ No local compute
- โ Scales automatically
Symptoms:
- Processing fails at classification stage
- Errors like:
model requires more system memory (12.8 GiB) than is available (9.3 GiB) - Errors like:
memory layout cannot be allocated - CUDA out of memory errors
- Ollama timeouts
Root Cause: VRAM contention between PyAnnote (diarization) and Ollama (classification). PyAnnote occupies ~8GB VRAM and doesn't fully release it before Ollama tries to load large models.
Quick Fix: Switch classification to Groq (100% free, no VRAM usage):
Classification Backend: groq
Detailed Troubleshooting: See TROUBLESHOOTING_OLLAMA.md for:
- Complete root cause analysis with log examples
- 5 different solution approaches
- Model size comparisons
- Performance benchmarks
- Step-by-step diagnostic commands
Best setup to avoid Ollama errors:
Transcription: groq (cloud - free, fast)
Diarization: pyannote (local - uses 8GB VRAM)
Classification: groq (cloud - free, fast)
Alternative with OpenAI (for highest quality):
Transcription: openai (cloud - paid, high quality)
Diarization: pyannote (local - uses 8GB VRAM)
Classification: groq (cloud - free, fast)
Why this works:
- Cloud transcription handles audio processing (no local VRAM usage)
- PyAnnote runs on GPU with 8GB VRAM (plenty of headroom)
- Groq handles classification (no local VRAM usage)
- No VRAM contention = no Ollama errors
Alternative (all local, but risky with 12GB VRAM):
Transcription: whisper (local - ~10GB VRAM)
Diarization: pyannote (local - ~8GB VRAM, runs after Whisper)
Classification: ollama (local - ~8GB VRAM, runs after PyAnnote)
- Local Whisper (GPU): ~15-20 minutes
- Groq Whisper: ~5-10 minutes
- Deepgram: ~2-3 minutes (real-time)
- Local PyAnnote (GPU): ~10-15 minutes
- HF Inference API: ~15-20 minutes (due to queue)
- Local Ollama (GPU): ~5-10 minutes
- Groq LLaMA: ~2-3 minutes
- Cohere: ~1-2 minutes
- Store in
.envfile (never commit to git) - Use environment variables
- Rotate keys regularly
- Use read-only tokens when possible (HuggingFace)
- Groq: Data not used for training (check Terms of Service)
- HuggingFace: Inference API doesn't store audio
- Local: 100% private (no data leaves your machine)
For sensitive campaigns:
- Use local backends only
- Or review cloud provider privacy policies carefully
- Groq Console
- HuggingFace Inference API
- Deepgram API Docs
- Gladia API Docs
- Cohere API Docs
- Google AI Studio
If your local Ollama keeps erroring out:
-
Get Groq API Key:
# Visit https://console.groq.com/ # Sign up (free, no credit card) # Copy your API key
-
Add to .env:
echo "GROQ_API_KEY=your_key_here" >> .env
-
Test the connection:
python test_api_keys.py
-
Configure in UI:
- Open the Gradio app
- Go to Step 2: Configure Session
- Expand Advanced Backend Settings
- Set Classification to:
groq
-
Process your session:
- Should complete without Ollama errors
- Cloud classification is often faster
- Completely free, no limits
Before processing a session with cloud backends:
- API keys added to
.envfile - Ran
python test_api_keys.pysuccessfully - Selected cloud backends in UI (Step 2)
- Internet connection is stable
- (Optional) Set up local fallback if cloud fails
Last Updated: 2025-11-11 Version: 1.0