The Agentic Correction System includes a comprehensive human feedback loop that enables continuous improvement through learning from manual corrections. This guide explains how to use the system, analyze collected data, and improve AI accuracy over time.
- Making Corrections in the UI
- Annotation Collection Process
- Analyzing Collected Data
- Improving the Classifier
- Evaluating Improvement
- Future: Fine-Tuning & RLHF
python -m lyrics_transcriber.cli.cli_main your-song.mp3 \
--artist "Artist Name" --title "Song Title"This will:
- Transcribe the audio
- Fetch reference lyrics
- Identify anchor sequences and gaps
- Run agentic AI correction (if
USE_AGENTIC_AI=1) - Launch the review UI in your browser
When you make edits in the UI:
- Edit a word's text
- Delete a word
- Merge/split words
- Adjust timing
An annotation modal will appear (if annotations are enabled).
The modal asks for:
-
Correction Type (dropdown)
- Sound-Alike Error
- Background Vocals
- Extra Filler Words
- Punctuation/Style Only
- Repeated Section
- Complex Multi-Error
- Ambiguous
- No Error
- Manual Edit
-
Confidence (1-5 slider)
- 1: Very Uncertain
- 2: Somewhat Uncertain
- 3: Neutral
- 4: Fairly Confident
- 5: Very Confident
-
Reasoning (text area, minimum 10 characters)
- Explain WHY this correction is needed
- Reference what you heard in the audio
- Mention if reference lyrics helped
- Save & Continue: Stores the annotation for analysis
- Skip: Applies the correction without annotation (not recommended)
When you click "Finish Review":
- All corrections are applied
- All annotations are submitted to the backend
- Data is saved to
cache/correction_annotations.jsonl
Annotations are stored in JSONL (JSON Lines) format:
- File:
cache/correction_annotations.jsonl - One annotation per line
- Easy to append, version control friendly
- Can be parsed line-by-line for large datasets
{
"annotation_id": "550e8400-e29b-41d4-a716-446655440000",
"audio_hash": "abc123",
"gap_id": "gap_1",
"annotation_type": "SOUND_ALIKE",
"action_taken": "REPLACE",
"original_text": "out I'm starting over",
"corrected_text": "now I'm starting over",
"confidence": 5.0,
"reasoning": "The word 'out' sounds like 'now' but the reference lyrics and context make it clear it should be 'now'",
"word_ids_affected": ["word_123"],
"agentic_proposal": {"action": "ReplaceWord", "replacement_text": "now"},
"agentic_category": "SOUND_ALIKE",
"agentic_agreed": true,
"reference_sources_consulted": ["genius", "spotify"],
"artist": "Rancid",
"title": "Time Bomb",
"session_id": "lyrics-correction-abc123",
"timestamp": "2025-10-27T12:00:00"
}For each correction, we collect:
- What changed: Original → Corrected text
- Why it changed: Human reasoning
- How confident: 1-5 scale
- What AI suggested: For comparison
- Agreement: Did human agree with AI?
- Context: Song, artist, reference sources used
python scripts/analyze_annotations.pyOptions:
--cache-dir cache: Where to find annotations--output CORRECTION_ANALYSIS.md: Where to save report
-
Overview Statistics
- Total annotations collected
- Number of unique songs/artists
- Date range
- Average confidence
- High-confidence percentage
-
Breakdown by Type
- How many of each error category
- Percentage distribution
-
Actions Taken
- REPLACE, DELETE, NO_ACTION, etc.
- Which actions are most common
-
AI Performance
- Overall agreement rate
- Agreement rate by category
- Which categories AI is good/bad at
-
Common Error Patterns
- Top 20 most frequent corrections
- "word A → word B" patterns
- Examples from real songs
-
Frequently Misheard Words
- Sound-alike errors that occur multiple times
- e.g., "out" → "now", "said" → "set"
-
Reference Source Usage
- Which sources are consulted most often
- Helps identify most reliable sources
-
Recommendations
- Categories needing improvement
- When to regenerate few-shot examples
- When you have enough data for fine-tuning
## Most Common Error Patterns
### 1. `out → now` (15 occurrences)
- **Type:** SOUND_ALIKE
- **Average Confidence:** 4.8/5.0
- **Examples:**
- Rancid - Time Bomb: "Classic homophone error..."
- ...
## Agentic AI Performance
- **Overall Agreement Rate:** 65.3%
### Agreement by Category
- **SOUND_ALIKE:** 78.5% (23 samples)
- **BACKGROUND_VOCALS:** 92.1% (12 samples)
- **EXTRA_WORDS:** 45.2% (8 samples) ⚠️ Needs improvementMinimum recommended:
- 20+ high-confidence annotations (confidence >= 4)
- At least 3-5 examples per category
- Multiple different songs/artists
Check if ready:
python scripts/analyze_annotations.pyLook for: "Training Data Available" section in the report.
python scripts/generate_few_shot_examples.pyOptions:
--min-confidence 4.0: Only use annotations rated 4 or 5--max-per-category 5: How many examples per category--output path/to/examples.yaml: Custom output location
Output: lyrics_transcriber/correction/agentic/prompts/examples.yaml
Review the generated examples.yaml:
metadata:
generated_at: cache/correction_annotations.jsonl
total_examples: 25
categories: [sound_alike, background_vocals, extra_words, ...]
examples_by_category:
sound_alike:
- gap_text: "out I'm starting over"
corrected_text: "now I'm starting over"
action: REPLACE
reasoning: "..."
confidence: 5.0
artist: "Rancid"
title: "Time Bomb"
agentic_agreed: trueThe classifier automatically checks for examples.yaml on startup:
def load_few_shot_examples() -> Dict[str, List[Dict]]:
examples_path = Path(__file__).parent / "examples.yaml"
if not examples_path.exists():
return get_hardcoded_examples() # Uses defaults
# Load from file
with open(examples_path, 'r') as f:
data = yaml.safe_load(f)
return data.get('examples_by_category', {})No code changes needed - just regenerate the examples file!
USE_AGENTIC_AI=1 python -m lyrics_transcriber.cli.cli_main test-song.mp3 \
--artist "Test Artist" --title "Test Song"Monitor:
- Classification accuracy (check Langfuse traces)
- Agreement rate in next batch of annotations
- Corrections that are automatically applied
Track these over time as you collect more data:
-
AI Agreement Rate
- Target: > 70% overall
- Track per category (some will be harder than others)
-
Classification Accuracy
- What % of gaps are correctly categorized
- Measured by human verification
-
Auto-Correction Rate
- What % of gaps are automatically corrected (vs flagged)
- Should increase over time
-
High-Confidence Annotations
- What % of human corrections are rated 4-5
- Higher = clearer patterns
1. Process songs with agentic AI
2. Human reviews and corrects in UI
3. Annotations collected
4. Analyze annotations (identify patterns)
5. Regenerate few-shot examples
6. Classifier improves
7. Repeat with next batch of songs
- Weekly: Run analysis script to monitor progress
- Monthly: Regenerate few-shot examples if you have +20 new high-confidence annotations
- Quarterly: Review agreement rates by category, adjust prompts if needed
- Action: Review annotations for that category
- Check if prompt examples are misleading
- Consider adding more specific guidelines
- Might indicate more complex songs
- Or classifier is making less obvious mistakes
- Check if few-shot examples cover this pattern
- May need to add explicit handling
Once you have 100-200+ high-confidence annotations:
-
Export Training Data
from lyrics_transcriber.correction.feedback.store import FeedbackStore store = FeedbackStore("cache") training_file = store.export_to_training_data() print(f"Training data: {training_file}")
-
Fine-Tune Small Model
- Use Llama 3.1-8B or similar open model
- Fine-tune on classification task
- Much faster and cheaper than GPT-4 for inference
- Can run locally via Ollama
-
Reinforcement Learning from Human Feedback (RLHF)
- Collect preference rankings (A vs B comparisons)
- Fine-tune model to align with human preferences
- More advanced but very powerful
- Hugging Face Transformers: Standard fine-tuning pipeline
- Axolotl: Easy fine-tuning for open models
- LangFuse: Can track model versions and performance
- PEFT/LoRA: Parameter-efficient fine-tuning (faster, cheaper)
Pros of fine-tuning:
- Faster inference (no external API calls)
- Lower long-term costs
- Complete control over model
- Can run offline
Cons of fine-tuning:
- Requires significant data (100+ examples)
- Initial setup time
- Need to manage model hosting
- May not match GPT-4 quality initially
Recommendation: Start with few-shot learning (what we've built), then consider fine-tuning once you have 200+ annotations.
✅ Do:
- Be specific in reasoning ("The word 'out' sounds like 'now' but context confirms 'now'")
- Reference what you heard in the audio
- Mention which reference source helped
- Use confidence 4-5 only when certain
❌ Don't:
- Generic reasoning ("it was wrong")
- Annotate if you're guessing
- Skip annotations to save time (reduces data quality)
- Aim for diversity: Different artists, genres, decades
- Prioritize quality over quantity: Better to have 50 excellent annotations than 200 rushed ones
- Regular reviews: Process songs weekly to maintain consistent annotation quality
- Backup annotations file:
cache/correction_annotations.jsonlis precious data - Version control: Consider committing
examples.yamlto track improvements - Monitor logs: Check Langfuse for AI performance trends
POST /api/v1/annotations
- Save a correction annotation
- Body:
CorrectionAnnotationobject (without ID/timestamp) - Returns:
{"status": "success", "annotation_id": "..."}
GET /api/v1/annotations/{audio_hash}
- Get all annotations for a specific song
- Returns:
{"audio_hash": "...", "count": N, "annotations": [...]}
GET /api/v1/annotations/stats
- Get aggregated statistics
- Returns:
AnnotationStatisticsobject
// Submit annotations after review
await apiClient.submitAnnotations(annotations)
// Get statistics for dashboard
const stats = await apiClient.getAnnotationStats()Check:
- Is the modal appearing after edits?
- Are you in read-only mode? (Annotations disabled in read-only)
- Check browser console for errors
- Verify
cache/correction_annotations.jsonlexists and is writable
Check:
- Annotations enabled? (localStorage key:
annotationsEnabled) - Did you actually change the text? (Modal only shows if original ≠ corrected)
- Are you using live API mode (not file-only)?
Check:
- Does
cache/correction_annotations.jsonlexist? - Is the file valid JSONL? (one JSON object per line)
- Run with verbose logging:
python scripts/analyze_annotations.py --cache-dir cache
Check:
- Do you have any high-confidence annotations? (confidence >= 4)
- Try lowering threshold:
--min-confidence 3.0 - Check YAML syntax in generated file
lyrics_transcriber/correction/feedback/schemas.py- Annotation data modelslyrics_transcriber/correction/feedback/store.py- Storage backendlyrics_transcriber/frontend/src/components/CorrectionAnnotationModal.tsx- UI modallyrics_transcriber/correction/agentic/prompts/classifier.py- Classification prompt builder
cache/correction_annotations.jsonl- All collected annotations (backup this!)lyrics_transcriber/correction/agentic/prompts/examples.yaml- Few-shot examples for classifiercache/training_data.jsonl- Exported high-confidence data for fine-tuning
scripts/analyze_annotations.py- Generate analysis reportscripts/generate_few_shot_examples.py- Update classifier examples
CORRECTION_ANALYSIS.md- Generated by analysis scriptAGENTIC_IMPLEMENTATION_STATUS.md- Implementation status and architecture
# Process 5 songs with annotation collection
for song in song1.mp3 song2.mp3 song3.mp3 song4.mp3 song5.mp3; do
python -m lyrics_transcriber.cli.cli_main "$song" --artist "..." --title "..."
# Review in UI, make corrections, fill in annotations
done
# Check what you've collected
python scripts/analyze_annotations.pyExpected: 20-50 annotations
# Generate analysis report
python scripts/analyze_annotations.py
# Review CORRECTION_ANALYSIS.md
# Identify: Most common errors, AI agreement rate# Check if ready for few-shot generation
python scripts/analyze_annotations.py
# Look for "Training Data Available" message
# Generate few-shot examples
python scripts/generate_few_shot_examples.py
# Verify examples.yaml was created
cat lyrics_transcriber/correction/agentic/prompts/examples.yaml# Process new songs (classifier now uses your examples)
python -m lyrics_transcriber.cli.cli_main new-song.mp3 ...
# Compare agreement rates
python scripts/analyze_annotations.py
# Check if agreement rate increasedIf you have 100-200+ high-confidence annotations:
from lyrics_transcriber.correction.feedback.store import FeedbackStore
store = FeedbackStore("cache")
training_file = store.export_to_training_data()
print(f"Training data ready: {training_file}")
# Use this file for fine-tuning a small LLMYou can manually edit examples.yaml to:
- Add hand-crafted examples
- Remove low-quality examples
- Adjust example ordering (first examples are most influential)
- Save current
examples.yamlasexamples_v1.yaml - Generate new version with different parameters
- Process same song with both versions
- Compare results in Langfuse
- Keep the better version
Track time saved:
- Before AI: Average time to manually correct a song
- After AI: Average time with AI assistance
- Time saved = (Before - After) × Songs processed
Typical results:
- Without AI: 10-15 min/song (manual correction)
- With AI (50% accuracy): 5-8 min/song
- With AI (70% accuracy): 3-5 min/song
- With AI (90% accuracy): 1-2 min/song (just verification)
Q: How many annotations do I need before it's useful? A: You'll start seeing improvement with 20-30 high-quality annotations. Real gains come at 50-100+.
Q: Should I annotate every single correction? A: Yes, if possible. But if you're in a hurry, prioritize:
- Cases where you disagree with AI
- Complex/interesting error patterns
- High-confidence corrections (4-5 rating)
Q: What if the AI gets worse after updating examples?
A: Revert to previous examples.yaml, review the new annotations for quality issues, or increase --min-confidence threshold.
Q: Can I disable annotation collection?
A: Yes, toggle in UI (localStorage key: annotationsEnabled) or set to false in code.
Q: How do I backup my annotations?
A: Copy cache/correction_annotations.jsonl to a safe location. Consider version control.
Q: What if annotation file gets corrupted? A: Each line is independent (JSONL format). Delete corrupted lines and re-run analysis.
After implementing this feedback loop:
-
Short term (Weeks 1-4):
- Collect diverse annotations
- Run analysis weekly
- Update few-shot examples monthly
-
Medium term (Months 2-6):
- Achieve 70%+ AI agreement rate
- Reduce manual review time by 50%
- Build dataset of 100-200 annotations
-
Long term (Months 6-12):
- Consider fine-tuning custom model
- Implement RLHF for preference learning
- Achieve 85%+ AI agreement rate
- Reduce manual review time by 80%
If you discover patterns or improvements:
- Document in your annotations
- Share insights in
CORRECTION_ANALYSIS.md - Consider contributing successful prompts/examples back to the project
For issues or questions:
- Check
AGENTIC_IMPLEMENTATION_STATUS.mdfor known issues - Review
QUICK_START_AGENTIC.mdfor testing guidance - Check Langfuse traces for AI behavior insights