Cloud Inference Options for D&D Session Processing

Overview

This document outlines 100% free cloud API options for offloading compute-intensive tasks in the D&D Session Processor. These are particularly useful if local inference (Ollama, Whisper, PyAnnote) encounters issues or if you want to leverage cloud resources.

🎯 Recommended: Groq (100% FREE)

Best for: Transcription, Classification (IC/OOC) Cost: Completely free with registration (no credit card required) Speed: Fastest inference available (up to 1000 tokens/second)

Setup

Visit https://console.groq.com/
Register for a free account (no payment method needed)
Generate an API key
Add to your .env file:
```
GROQ_API_KEY=your_groq_api_key_here
```

Supported Models

LLaMA 3.3 70B Versatile (llama-3.3-70b-versatile) - DEFAULT for classification/transcription
LLaMA 3.1 8B Instant (llama-3.1-8b-instant) - Fast, efficient alternative
Mixtral 8x7B (mixtral-8x7b-32768) - Large context window
Whisper Large V3 (whisper-large-v3) - Speech-to-text transcription

Note: Older models like llama3-8b-8192 and llama-3.2-1b-preview have been decommissioned.

Rate limit tuning

Free-tier Groq accounts enforce short bursts. Configure the .env knobs to prevent rate_limit_exceeded responses:

GROQ_MAX_CALLS_PER_SECOND – steady throughput target (defaults to 2 req/s)
GROQ_RATE_LIMIT_BURST – how many calls may fire inside the window before throttling
GROQ_RATE_LIMIT_PERIOD_SECONDS – moving window duration used by the limiter

The IC/OOC classifier now enforces these limits automatically and backs off whenever the API reports a rate limit.

Configuration in UI

Navigate to Step 2: Configure Session
Expand Advanced Backend Settings accordion
Select backends:
- Transcription: groq
- Classification: groq

Testing

Run the API validation script:

python test_api_keys.py

🟢 OpenAI Whisper API (PAY-PER-USE)

Best for: High-quality transcription with official OpenAI support Cost: Pay-per-use ($0.006 per minute of audio) Speed: Fast cloud processing

Setup

Visit https://platform.openai.com/api-keys
Create a new API key (requires payment method on file)

Add to your .env file:

OPENAI_API_KEY=your_openai_api_key_here

Supported Models

Whisper-1 - Official OpenAI Whisper model with excellent multilingual support

Configuration in UI

Navigate to Step 2: Configure Session
Expand Advanced Backend Settings accordion
Select backends:
- Transcription: openai

Pricing

$0.006 per minute of audio
Example: 4-hour D&D session = 240 minutes × $0.006 = $1.44
Much cheaper than real-time transcription services
No monthly minimums or subscription required

Features

Verbose JSON response with segment and word-level timestamps
Automatic language detection
Excellent Dutch language support
Built-in retry logic with exponential backoff
Temporary file cleanup after processing

Testing

Run the API validation script:

python test_api_keys.py

🤗 HuggingFace Inference API (FREE TIER)

Best for: Diarization (speaker identification) Cost: Free tier with rate limits Limitations: ~1000 requests/day on free tier

Setup

Visit https://huggingface.co/settings/tokens
Create a new token with "Make calls to Inference Providers" permission

Add to your .env file:

HUGGING_FACE_API_KEY=your_hf_token_here

Supported Models

PyAnnote Audio 3.1 - Speaker diarization (who is speaking when)
Whisper - Speech recognition
Custom fine-tuned models - Upload your own

Configuration in UI

Navigate to Step 2: Configure Session
Expand Advanced Backend Settings accordion
Select backends:
- Diarization: hf_api

Testing

Run the API validation script:

python test_api_keys.py

🆓 Alternative Free Options

Deepgram

Best for: Real-time transcription Cost: $200 free credits (no credit card for trial) Speed: Extremely fast real-time processing Website: https://deepgram.com/

Pros:

Superior accuracy
Real-time streaming
Multiple languages

Cons:

Not implemented in this project yet
Credits expire after trial period

Gladia

Best for: Podcast/interview transcription Cost: 10 hours/month free Website: https://www.gladia.io/

Pros:

Speaker diarization included
Multi-language support
Webhook callbacks

Cons:

Not implemented in this project yet
Limited free tier hours

Cohere

Best for: Text classification Cost: Free tier with rate limits Website: https://cohere.com/

Pros:

Excellent for classification tasks
Good documentation
Simple API

Cons:

Not implemented in this project yet
Primarily focused on text (not audio)

Google AI Studio (Gemini)

Best for: Multi-modal classification Cost: Free tier (60 requests/minute) Website: https://ai.google.dev/

Pros:

Multi-modal (text, image, audio)
Generous free tier
Fast inference

Cons:

Not implemented in this project yet

💻 Local vs Cloud: Quick Comparison

Task	Local Backend	Cloud Backend	Free Cloud Option
Transcription	Whisper (GPU/CPU)	Groq / OpenAI	✅ Groq (unlimited) / 💰 OpenAI ($0.006/min)
Diarization	PyAnnote (GPU)	HF Inference	✅ HuggingFace (~1000/day)
Classification	Ollama (CPU/GPU)	Groq LLaMA	✅ Groq (unlimited)

Hardware Requirements

Local Processing (12GB VRAM):

✅ Whisper Large V3 (~10GB VRAM)
✅ PyAnnote Audio (~8GB VRAM)
⚠️ Ollama LLaMA 3.1 8B (~8GB VRAM) - May conflict with other models

Cloud Processing:

✅ No VRAM required
✅ No local compute
✅ Scales automatically

🔍 Troubleshooting

Issue: Ollama classification errors during local processing

Symptoms:

Processing fails at classification stage
Errors like: model requires more system memory (12.8 GiB) than is available (9.3 GiB)
Errors like: memory layout cannot be allocated
CUDA out of memory errors
Ollama timeouts

Root Cause: VRAM contention between PyAnnote (diarization) and Ollama (classification). PyAnnote occupies ~8GB VRAM and doesn't fully release it before Ollama tries to load large models.

Quick Fix: Switch classification to Groq (100% free, no VRAM usage):

Classification Backend: groq

Detailed Troubleshooting: See TROUBLESHOOTING_OLLAMA.md for:

Complete root cause analysis with log examples
5 different solution approaches
Model size comparisons
Performance benchmarks
Step-by-step diagnostic commands

🚀 Recommended Configuration for 12GB VRAM

Best setup to avoid Ollama errors:

Transcription:  groq       (cloud - free, fast)
Diarization:    pyannote   (local - uses 8GB VRAM)
Classification: groq       (cloud - free, fast)

Alternative with OpenAI (for highest quality):

Transcription:  openai     (cloud - paid, high quality)
Diarization:    pyannote   (local - uses 8GB VRAM)
Classification: groq       (cloud - free, fast)

Why this works:

Cloud transcription handles audio processing (no local VRAM usage)
PyAnnote runs on GPU with 8GB VRAM (plenty of headroom)
Groq handles classification (no local VRAM usage)
No VRAM contention = no Ollama errors

Alternative (all local, but risky with 12GB VRAM):

Transcription:  whisper    (local - ~10GB VRAM)
Diarization:    pyannote   (local - ~8GB VRAM, runs after Whisper)
Classification: ollama     (local - ~8GB VRAM, runs after PyAnnote)

⚠️ This may cause VRAM errors if models don't unload properly.

📊 Performance Comparison

Transcription (60-minute video)

Local Whisper (GPU): ~15-20 minutes
Groq Whisper: ~5-10 minutes
Deepgram: ~2-3 minutes (real-time)

Diarization (60-minute video)

Local PyAnnote (GPU): ~10-15 minutes
HF Inference API: ~15-20 minutes (due to queue)

Classification (100 chunks)

Local Ollama (GPU): ~5-10 minutes
Groq LLaMA: ~2-3 minutes
Cohere: ~1-2 minutes

🔐 Security Considerations

API Keys

Store in .env file (never commit to git)
Use environment variables
Rotate keys regularly
Use read-only tokens when possible (HuggingFace)

Data Privacy

Groq: Data not used for training (check Terms of Service)
HuggingFace: Inference API doesn't store audio
Local: 100% private (no data leaves your machine)

For sensitive campaigns:

Use local backends only
Or review cloud provider privacy policies carefully

📚 Additional Resources

🎮 Quick Start: Switching from Ollama to Groq

If your local Ollama keeps erroring out:

Get Groq API Key:

# Visit https://console.groq.com/
# Sign up (free, no credit card)
# Copy your API key

Add to .env:

echo "GROQ_API_KEY=your_key_here" >> .env

Test the connection:
```
python test_api_keys.py
```
Configure in UI:
- Open the Gradio app
- Go to Step 2: Configure Session
- Expand Advanced Backend Settings
- Set Classification to: groq
Process your session:
- Should complete without Ollama errors
- Cloud classification is often faster
- Completely free, no limits

✅ Validation Checklist

Before processing a session with cloud backends:

API keys added to .env file
Ran python test_api_keys.py successfully
Selected cloud backends in UI (Step 2)
Internet connection is stable
(Optional) Set up local fallback if cloud fails

Last Updated: 2025-11-11 Version: 1.0

FilesExpand file tree

CLOUD_INFERENCE_OPTIONS.md

Latest commit

History

CLOUD_INFERENCE_OPTIONS.md

File metadata and controls

Cloud Inference Options for D&D Session Processing

Overview

🎯 Recommended: Groq (100% FREE)

Setup

Supported Models

Rate limit tuning

Configuration in UI

Testing

🟢 OpenAI Whisper API (PAY-PER-USE)

Setup

Supported Models

Configuration in UI

Pricing

Features

Testing

🤗 HuggingFace Inference API (FREE TIER)

Setup

Supported Models

Configuration in UI

Testing

🆓 Alternative Free Options

Deepgram

Gladia

Cohere

Google AI Studio (Gemini)

💻 Local vs Cloud: Quick Comparison

Hardware Requirements

🔍 Troubleshooting

Issue: Ollama classification errors during local processing

🚀 Recommended Configuration for 12GB VRAM

📊 Performance Comparison

Transcription (60-minute video)

Diarization (60-minute video)

Classification (100 chunks)

🔐 Security Considerations

API Keys

Data Privacy

📚 Additional Resources

🎮 Quick Start: Switching from Ollama to Groq

✅ Validation Checklist