Skip to content

Latest commit

ย 

History

History
379 lines (286 loc) ยท 10.3 KB

File metadata and controls

379 lines (286 loc) ยท 10.3 KB

Cloud Inference Options for D&D Session Processing

Overview

This document outlines 100% free cloud API options for offloading compute-intensive tasks in the D&D Session Processor. These are particularly useful if local inference (Ollama, Whisper, PyAnnote) encounters issues or if you want to leverage cloud resources.


๐ŸŽฏ Recommended: Groq (100% FREE)

Best for: Transcription, Classification (IC/OOC) Cost: Completely free with registration (no credit card required) Speed: Fastest inference available (up to 1000 tokens/second)

Setup

  1. Visit https://console.groq.com/
  2. Register for a free account (no payment method needed)
  3. Generate an API key
  4. Add to your .env file:
    GROQ_API_KEY=your_groq_api_key_here
    

Supported Models

  • LLaMA 3.3 70B Versatile (llama-3.3-70b-versatile) - DEFAULT for classification/transcription
  • LLaMA 3.1 8B Instant (llama-3.1-8b-instant) - Fast, efficient alternative
  • Mixtral 8x7B (mixtral-8x7b-32768) - Large context window
  • Whisper Large V3 (whisper-large-v3) - Speech-to-text transcription

Note: Older models like llama3-8b-8192 and llama-3.2-1b-preview have been decommissioned.

Rate limit tuning

Free-tier Groq accounts enforce short bursts. Configure the .env knobs to prevent rate_limit_exceeded responses:

  • GROQ_MAX_CALLS_PER_SECOND โ€“ steady throughput target (defaults to 2 req/s)
  • GROQ_RATE_LIMIT_BURST โ€“ how many calls may fire inside the window before throttling
  • GROQ_RATE_LIMIT_PERIOD_SECONDS โ€“ moving window duration used by the limiter

The IC/OOC classifier now enforces these limits automatically and backs off whenever the API reports a rate limit.

Configuration in UI

  1. Navigate to Step 2: Configure Session
  2. Expand Advanced Backend Settings accordion
  3. Select backends:
    • Transcription: groq
    • Classification: groq

Testing

Run the API validation script:

python test_api_keys.py

๐ŸŸข OpenAI Whisper API (PAY-PER-USE)

Best for: High-quality transcription with official OpenAI support Cost: Pay-per-use ($0.006 per minute of audio) Speed: Fast cloud processing

Setup

  1. Visit https://platform.openai.com/api-keys
  2. Create a new API key (requires payment method on file)
  3. Add to your .env file:
    OPENAI_API_KEY=your_openai_api_key_here
    

Supported Models

  • Whisper-1 - Official OpenAI Whisper model with excellent multilingual support

Configuration in UI

  1. Navigate to Step 2: Configure Session
  2. Expand Advanced Backend Settings accordion
  3. Select backends:
    • Transcription: openai

Pricing

  • $0.006 per minute of audio
  • Example: 4-hour D&D session = 240 minutes ร— $0.006 = $1.44
  • Much cheaper than real-time transcription services
  • No monthly minimums or subscription required

Features

  • Verbose JSON response with segment and word-level timestamps
  • Automatic language detection
  • Excellent Dutch language support
  • Built-in retry logic with exponential backoff
  • Temporary file cleanup after processing

Testing

Run the API validation script:

python test_api_keys.py

๐Ÿค— HuggingFace Inference API (FREE TIER)

Best for: Diarization (speaker identification) Cost: Free tier with rate limits Limitations: ~1000 requests/day on free tier

Setup

  1. Visit https://huggingface.co/settings/tokens
  2. Create a new token with "Make calls to Inference Providers" permission
  3. Add to your .env file:
    HUGGING_FACE_API_KEY=your_hf_token_here
    

Supported Models

  • PyAnnote Audio 3.1 - Speaker diarization (who is speaking when)
  • Whisper - Speech recognition
  • Custom fine-tuned models - Upload your own

Configuration in UI

  1. Navigate to Step 2: Configure Session
  2. Expand Advanced Backend Settings accordion
  3. Select backends:
    • Diarization: hf_api

Testing

Run the API validation script:

python test_api_keys.py

๐Ÿ†“ Alternative Free Options

Deepgram

Best for: Real-time transcription Cost: $200 free credits (no credit card for trial) Speed: Extremely fast real-time processing Website: https://deepgram.com/

Pros:

  • Superior accuracy
  • Real-time streaming
  • Multiple languages

Cons:

  • Not implemented in this project yet
  • Credits expire after trial period

Gladia

Best for: Podcast/interview transcription Cost: 10 hours/month free Website: https://www.gladia.io/

Pros:

  • Speaker diarization included
  • Multi-language support
  • Webhook callbacks

Cons:

  • Not implemented in this project yet
  • Limited free tier hours

Cohere

Best for: Text classification Cost: Free tier with rate limits Website: https://cohere.com/

Pros:

  • Excellent for classification tasks
  • Good documentation
  • Simple API

Cons:

  • Not implemented in this project yet
  • Primarily focused on text (not audio)

Google AI Studio (Gemini)

Best for: Multi-modal classification Cost: Free tier (60 requests/minute) Website: https://ai.google.dev/

Pros:

  • Multi-modal (text, image, audio)
  • Generous free tier
  • Fast inference

Cons:

  • Not implemented in this project yet

๐Ÿ’ป Local vs Cloud: Quick Comparison

Task Local Backend Cloud Backend Free Cloud Option
Transcription Whisper (GPU/CPU) Groq / OpenAI โœ… Groq (unlimited) / ๐Ÿ’ฐ OpenAI ($0.006/min)
Diarization PyAnnote (GPU) HF Inference โœ… HuggingFace (~1000/day)
Classification Ollama (CPU/GPU) Groq LLaMA โœ… Groq (unlimited)

Hardware Requirements

Local Processing (12GB VRAM):

  • โœ… Whisper Large V3 (~10GB VRAM)
  • โœ… PyAnnote Audio (~8GB VRAM)
  • โš ๏ธ Ollama LLaMA 3.1 8B (~8GB VRAM) - May conflict with other models

Cloud Processing:

  • โœ… No VRAM required
  • โœ… No local compute
  • โœ… Scales automatically

๐Ÿ” Troubleshooting

Issue: Ollama classification errors during local processing

Symptoms:

  • Processing fails at classification stage
  • Errors like: model requires more system memory (12.8 GiB) than is available (9.3 GiB)
  • Errors like: memory layout cannot be allocated
  • CUDA out of memory errors
  • Ollama timeouts

Root Cause: VRAM contention between PyAnnote (diarization) and Ollama (classification). PyAnnote occupies ~8GB VRAM and doesn't fully release it before Ollama tries to load large models.

Quick Fix: Switch classification to Groq (100% free, no VRAM usage):

Classification Backend: groq

Detailed Troubleshooting: See TROUBLESHOOTING_OLLAMA.md for:

  • Complete root cause analysis with log examples
  • 5 different solution approaches
  • Model size comparisons
  • Performance benchmarks
  • Step-by-step diagnostic commands

๐Ÿš€ Recommended Configuration for 12GB VRAM

Best setup to avoid Ollama errors:

Transcription:  groq       (cloud - free, fast)
Diarization:    pyannote   (local - uses 8GB VRAM)
Classification: groq       (cloud - free, fast)

Alternative with OpenAI (for highest quality):

Transcription:  openai     (cloud - paid, high quality)
Diarization:    pyannote   (local - uses 8GB VRAM)
Classification: groq       (cloud - free, fast)

Why this works:

  • Cloud transcription handles audio processing (no local VRAM usage)
  • PyAnnote runs on GPU with 8GB VRAM (plenty of headroom)
  • Groq handles classification (no local VRAM usage)
  • No VRAM contention = no Ollama errors

Alternative (all local, but risky with 12GB VRAM):

Transcription:  whisper    (local - ~10GB VRAM)
Diarization:    pyannote   (local - ~8GB VRAM, runs after Whisper)
Classification: ollama     (local - ~8GB VRAM, runs after PyAnnote)

โš ๏ธ This may cause VRAM errors if models don't unload properly.


๐Ÿ“Š Performance Comparison

Transcription (60-minute video)

  • Local Whisper (GPU): ~15-20 minutes
  • Groq Whisper: ~5-10 minutes
  • Deepgram: ~2-3 minutes (real-time)

Diarization (60-minute video)

  • Local PyAnnote (GPU): ~10-15 minutes
  • HF Inference API: ~15-20 minutes (due to queue)

Classification (100 chunks)

  • Local Ollama (GPU): ~5-10 minutes
  • Groq LLaMA: ~2-3 minutes
  • Cohere: ~1-2 minutes

๐Ÿ” Security Considerations

API Keys

  • Store in .env file (never commit to git)
  • Use environment variables
  • Rotate keys regularly
  • Use read-only tokens when possible (HuggingFace)

Data Privacy

  • Groq: Data not used for training (check Terms of Service)
  • HuggingFace: Inference API doesn't store audio
  • Local: 100% private (no data leaves your machine)

For sensitive campaigns:

  • Use local backends only
  • Or review cloud provider privacy policies carefully

๐Ÿ“š Additional Resources


๐ŸŽฎ Quick Start: Switching from Ollama to Groq

If your local Ollama keeps erroring out:

  1. Get Groq API Key:

    # Visit https://console.groq.com/
    # Sign up (free, no credit card)
    # Copy your API key
  2. Add to .env:

    echo "GROQ_API_KEY=your_key_here" >> .env
  3. Test the connection:

    python test_api_keys.py
  4. Configure in UI:

    • Open the Gradio app
    • Go to Step 2: Configure Session
    • Expand Advanced Backend Settings
    • Set Classification to: groq
  5. Process your session:

    • Should complete without Ollama errors
    • Cloud classification is often faster
    • Completely free, no limits

โœ… Validation Checklist

Before processing a session with cloud backends:

  • API keys added to .env file
  • Ran python test_api_keys.py successfully
  • Selected cloud backends in UI (Step 2)
  • Internet connection is stable
  • (Optional) Set up local fallback if cloud fails

Last Updated: 2025-11-11 Version: 1.0