Human Feedback Loop Documentation

Overview

The Agentic Correction System includes a comprehensive human feedback loop that enables continuous improvement through learning from manual corrections. This guide explains how to use the system, analyze collected data, and improve AI accuracy over time.

Making Corrections in the UI
Annotation Collection Process
Analyzing Collected Data
Improving the Classifier
Evaluating Improvement
Future: Fine-Tuning & RLHF

Making Corrections in the UI

Step 1: Process a Song

python -m lyrics_transcriber.cli.cli_main your-song.mp3 \
  --artist "Artist Name" --title "Song Title"

This will:

Transcribe the audio
Fetch reference lyrics
Identify anchor sequences and gaps
Run agentic AI correction (if USE_AGENTIC_AI=1)
Launch the review UI in your browser

Step 2: Review and Correct in UI

When you make edits in the UI:

Edit a word's text
Delete a word
Merge/split words
Adjust timing

An annotation modal will appear (if annotations are enabled).

Step 3: Fill in Annotation Details

The modal asks for:

Correction Type (dropdown)
- Sound-Alike Error
- Background Vocals
- Extra Filler Words
- Punctuation/Style Only
- Repeated Section
- Complex Multi-Error
- Ambiguous
- No Error
- Manual Edit
Confidence (1-5 slider)
- 1: Very Uncertain
- 2: Somewhat Uncertain
- 3: Neutral
- 4: Fairly Confident
- 5: Very Confident
Reasoning (text area, minimum 10 characters)
- Explain WHY this correction is needed
- Reference what you heard in the audio
- Mention if reference lyrics helped

Step 4: Save or Skip

Save & Continue: Stores the annotation for analysis
Skip: Applies the correction without annotation (not recommended)

Step 5: Complete Review

When you click "Finish Review":

All corrections are applied
All annotations are submitted to the backend
Data is saved to cache/correction_annotations.jsonl

Annotation Collection Process

Storage Format

Annotations are stored in JSONL (JSON Lines) format:

File: cache/correction_annotations.jsonl
One annotation per line
Easy to append, version control friendly
Can be parsed line-by-line for large datasets

Example Annotation

{
  "annotation_id": "550e8400-e29b-41d4-a716-446655440000",
  "audio_hash": "abc123",
  "gap_id": "gap_1",
  "annotation_type": "SOUND_ALIKE",
  "action_taken": "REPLACE",
  "original_text": "out I'm starting over",
  "corrected_text": "now I'm starting over",
  "confidence": 5.0,
  "reasoning": "The word 'out' sounds like 'now' but the reference lyrics and context make it clear it should be 'now'",
  "word_ids_affected": ["word_123"],
  "agentic_proposal": {"action": "ReplaceWord", "replacement_text": "now"},
  "agentic_category": "SOUND_ALIKE",
  "agentic_agreed": true,
  "reference_sources_consulted": ["genius", "spotify"],
  "artist": "Rancid",
  "title": "Time Bomb",
  "session_id": "lyrics-correction-abc123",
  "timestamp": "2025-10-27T12:00:00"
}

What Gets Tracked

For each correction, we collect:

What changed: Original → Corrected text
Why it changed: Human reasoning
How confident: 1-5 scale
What AI suggested: For comparison
Agreement: Did human agree with AI?
Context: Song, artist, reference sources used

Analyzing Collected Data

Running the Analysis Script

python scripts/analyze_annotations.py

Options:

--cache-dir cache: Where to find annotations
--output CORRECTION_ANALYSIS.md: Where to save report

What the Report Includes

Overview Statistics
- Total annotations collected
- Number of unique songs/artists
- Date range
- Average confidence
- High-confidence percentage
Breakdown by Type
- How many of each error category
- Percentage distribution
Actions Taken
- REPLACE, DELETE, NO_ACTION, etc.
- Which actions are most common
AI Performance
- Overall agreement rate
- Agreement rate by category
- Which categories AI is good/bad at
Common Error Patterns
- Top 20 most frequent corrections
- "word A → word B" patterns
- Examples from real songs
Frequently Misheard Words
- Sound-alike errors that occur multiple times
- e.g., "out" → "now", "said" → "set"
Reference Source Usage
- Which sources are consulted most often
- Helps identify most reliable sources
Recommendations
- Categories needing improvement
- When to regenerate few-shot examples
- When you have enough data for fine-tuning

Example Output

## Most Common Error Patterns

### 1. `out → now` (15 occurrences)
- **Type:** SOUND_ALIKE
- **Average Confidence:** 4.8/5.0
- **Examples:**
  - Rancid - Time Bomb: "Classic homophone error..."
  - ...

## Agentic AI Performance

- **Overall Agreement Rate:** 65.3%

### Agreement by Category
- **SOUND_ALIKE:** 78.5% (23 samples)
- **BACKGROUND_VOCALS:** 92.1% (12 samples)
- **EXTRA_WORDS:** 45.2% (8 samples) ⚠️ Needs improvement

Improving the Classifier

Step 1: Collect Sufficient Data

Minimum recommended:

20+ high-confidence annotations (confidence >= 4)
At least 3-5 examples per category
Multiple different songs/artists

Check if ready:

python scripts/analyze_annotations.py

Look for: "Training Data Available" section in the report.

Step 2: Generate Few-Shot Examples

python scripts/generate_few_shot_examples.py

Options:

--min-confidence 4.0: Only use annotations rated 4 or 5
--max-per-category 5: How many examples per category
--output path/to/examples.yaml: Custom output location

Output: lyrics_transcriber/correction/agentic/prompts/examples.yaml

Step 3: Verify Examples

Review the generated examples.yaml:

metadata:
  generated_at: cache/correction_annotations.jsonl
  total_examples: 25
  categories: [sound_alike, background_vocals, extra_words, ...]

examples_by_category:
  sound_alike:
    - gap_text: "out I'm starting over"
      corrected_text: "now I'm starting over"
      action: REPLACE
      reasoning: "..."
      confidence: 5.0
      artist: "Rancid"
      title: "Time Bomb"
      agentic_agreed: true

Step 4: Classifier Auto-Loads Examples

The classifier automatically checks for examples.yaml on startup:

def load_few_shot_examples() -> Dict[str, List[Dict]]:
    examples_path = Path(__file__).parent / "examples.yaml"
    
    if not examples_path.exists():
        return get_hardcoded_examples()  # Uses defaults
    
    # Load from file
    with open(examples_path, 'r') as f:
        data = yaml.safe_load(f)
        return data.get('examples_by_category', {})

No code changes needed - just regenerate the examples file!

Step 5: Test Improved Classifier

USE_AGENTIC_AI=1 python -m lyrics_transcriber.cli.cli_main test-song.mp3 \
  --artist "Test Artist" --title "Test Song"

Monitor:

Classification accuracy (check Langfuse traces)
Agreement rate in next batch of annotations
Corrections that are automatically applied

Evaluating Improvement

Metrics to Track

Track these over time as you collect more data:

AI Agreement Rate
- Target: > 70% overall
- Track per category (some will be harder than others)
Classification Accuracy
- What % of gaps are correctly categorized
- Measured by human verification
Auto-Correction Rate
- What % of gaps are automatically corrected (vs flagged)
- Should increase over time
High-Confidence Annotations
- What % of human corrections are rated 4-5
- Higher = clearer patterns

Continuous Improvement Cycle

1. Process songs with agentic AI
2. Human reviews and corrects in UI
3. Annotations collected
4. Analyze annotations (identify patterns)
5. Regenerate few-shot examples
6. Classifier improves
7. Repeat with next batch of songs

Recommended Schedule

Weekly: Run analysis script to monitor progress
Monthly: Regenerate few-shot examples if you have +20 new high-confidence annotations
Quarterly: Review agreement rates by category, adjust prompts if needed

Warning Signs

⚠️ Low agreement in specific category (< 50%)

Action: Review annotations for that category
Check if prompt examples are misleading
Consider adding more specific guidelines

⚠️ Confidence scores declining

Might indicate more complex songs
Or classifier is making less obvious mistakes

⚠️ Same errors repeating

Check if few-shot examples cover this pattern
May need to add explicit handling

Future: Fine-Tuning & RLHF

When You're Ready

Once you have 100-200+ high-confidence annotations:

Export Training Data

from lyrics_transcriber.correction.feedback.store import FeedbackStore

store = FeedbackStore("cache")
training_file = store.export_to_training_data()
print(f"Training data: {training_file}")

Fine-Tune Small Model
- Use Llama 3.1-8B or similar open model
- Fine-tune on classification task
- Much faster and cheaper than GPT-4 for inference
- Can run locally via Ollama
Reinforcement Learning from Human Feedback (RLHF)
- Collect preference rankings (A vs B comparisons)
- Fine-tune model to align with human preferences
- More advanced but very powerful

Resources for Fine-Tuning

Hugging Face Transformers: Standard fine-tuning pipeline
Axolotl: Easy fine-tuning for open models
LangFuse: Can track model versions and performance
PEFT/LoRA: Parameter-efficient fine-tuning (faster, cheaper)

Cost-Benefit Analysis

Pros of fine-tuning:

Faster inference (no external API calls)
Lower long-term costs
Complete control over model
Can run offline

Cons of fine-tuning:

Requires significant data (100+ examples)
Initial setup time
Need to manage model hosting
May not match GPT-4 quality initially

Recommendation: Start with few-shot learning (what we've built), then consider fine-tuning once you have 200+ annotations.

Best Practices

Annotation Quality

✅ Do:

Be specific in reasoning ("The word 'out' sounds like 'now' but context confirms 'now'")
Reference what you heard in the audio
Mention which reference source helped
Use confidence 4-5 only when certain

❌ Don't:

Generic reasoning ("it was wrong")
Annotate if you're guessing
Skip annotations to save time (reduces data quality)

Data Collection

Aim for diversity: Different artists, genres, decades
Prioritize quality over quantity: Better to have 50 excellent annotations than 200 rushed ones
Regular reviews: Process songs weekly to maintain consistent annotation quality

System Maintenance

Backup annotations file: cache/correction_annotations.jsonl is precious data
Version control: Consider committing examples.yaml to track improvements
Monitor logs: Check Langfuse for AI performance trends

API Reference

Backend Endpoints

POST /api/v1/annotations

Save a correction annotation
Body: CorrectionAnnotation object (without ID/timestamp)
Returns: {"status": "success", "annotation_id": "..."}

GET /api/v1/annotations/{audio_hash}

Get all annotations for a specific song
Returns: {"audio_hash": "...", "count": N, "annotations": [...]}

GET /api/v1/annotations/stats

Get aggregated statistics
Returns: AnnotationStatistics object

Frontend API Client

// Submit annotations after review
await apiClient.submitAnnotations(annotations)

// Get statistics for dashboard
const stats = await apiClient.getAnnotationStats()

Troubleshooting

Annotations Not Saving

Check:

Is the modal appearing after edits?
Are you in read-only mode? (Annotations disabled in read-only)
Check browser console for errors
Verify cache/correction_annotations.jsonl exists and is writable

Modal Not Appearing

Check:

Annotations enabled? (localStorage key: annotationsEnabled)
Did you actually change the text? (Modal only shows if original ≠ corrected)
Are you using live API mode (not file-only)?

Analysis Script Errors

Check:

Does cache/correction_annotations.jsonl exist?
Is the file valid JSONL? (one JSON object per line)
Run with verbose logging: python scripts/analyze_annotations.py --cache-dir cache

Few-Shot Generation Fails

Check:

Do you have any high-confidence annotations? (confidence >= 4)
Try lowering threshold: --min-confidence 3.0
Check YAML syntax in generated file

File Locations

Code

lyrics_transcriber/correction/feedback/schemas.py - Annotation data models
lyrics_transcriber/correction/feedback/store.py - Storage backend
lyrics_transcriber/frontend/src/components/CorrectionAnnotationModal.tsx - UI modal
lyrics_transcriber/correction/agentic/prompts/classifier.py - Classification prompt builder

Data

cache/correction_annotations.jsonl - All collected annotations (backup this!)
lyrics_transcriber/correction/agentic/prompts/examples.yaml - Few-shot examples for classifier
cache/training_data.jsonl - Exported high-confidence data for fine-tuning

Scripts

scripts/analyze_annotations.py - Generate analysis report
scripts/generate_few_shot_examples.py - Update classifier examples

Reports

CORRECTION_ANALYSIS.md - Generated by analysis script
AGENTIC_IMPLEMENTATION_STATUS.md - Implementation status and architecture

Example Workflow

Week 1: Initial Collection

# Process 5 songs with annotation collection
for song in song1.mp3 song2.mp3 song3.mp3 song4.mp3 song5.mp3; do
    python -m lyrics_transcriber.cli.cli_main "$song" --artist "..." --title "..."
    # Review in UI, make corrections, fill in annotations
done

# Check what you've collected
python scripts/analyze_annotations.py

Expected: 20-50 annotations

Week 2: First Analysis

# Generate analysis report
python scripts/analyze_annotations.py

# Review CORRECTION_ANALYSIS.md
# Identify: Most common errors, AI agreement rate

Week 4: First Improvement Cycle

# Check if ready for few-shot generation
python scripts/analyze_annotations.py
# Look for "Training Data Available" message

# Generate few-shot examples
python scripts/generate_few_shot_examples.py

# Verify examples.yaml was created
cat lyrics_transcriber/correction/agentic/prompts/examples.yaml

Week 5+: Monitor Improvement

# Process new songs (classifier now uses your examples)
python -m lyrics_transcriber.cli.cli_main new-song.mp3 ...

# Compare agreement rates
python scripts/analyze_annotations.py
# Check if agreement rate increased

Month 3: Consider Fine-Tuning

If you have 100-200+ high-confidence annotations:

from lyrics_transcriber.correction.feedback.store import FeedbackStore

store = FeedbackStore("cache")
training_file = store.export_to_training_data()
print(f"Training data ready: {training_file}")
# Use this file for fine-tuning a small LLM

Advanced Topics

Custom Few-Shot Examples

You can manually edit examples.yaml to:

Add hand-crafted examples
Remove low-quality examples
Adjust example ordering (first examples are most influential)

A/B Testing Different Prompts

Save current examples.yaml as examples_v1.yaml
Generate new version with different parameters
Process same song with both versions
Compare results in Langfuse
Keep the better version

Measuring ROI

Track time saved:

Before AI: Average time to manually correct a song
After AI: Average time with AI assistance
Time saved = (Before - After) × Songs processed

Typical results:

Without AI: 10-15 min/song (manual correction)
With AI (50% accuracy): 5-8 min/song
With AI (70% accuracy): 3-5 min/song
With AI (90% accuracy): 1-2 min/song (just verification)

FAQ

Q: How many annotations do I need before it's useful? A: You'll start seeing improvement with 20-30 high-quality annotations. Real gains come at 50-100+.

Q: Should I annotate every single correction? A: Yes, if possible. But if you're in a hurry, prioritize:

Cases where you disagree with AI
Complex/interesting error patterns
High-confidence corrections (4-5 rating)

Q: What if the AI gets worse after updating examples? A: Revert to previous examples.yaml, review the new annotations for quality issues, or increase --min-confidence threshold.

Q: Can I disable annotation collection? A: Yes, toggle in UI (localStorage key: annotationsEnabled) or set to false in code.

Q: How do I backup my annotations? A: Copy cache/correction_annotations.jsonl to a safe location. Consider version control.

Q: What if annotation file gets corrupted? A: Each line is independent (JSONL format). Delete corrupted lines and re-run analysis.

Next Steps

After implementing this feedback loop:

Short term (Weeks 1-4):
- Collect diverse annotations
- Run analysis weekly
- Update few-shot examples monthly
Medium term (Months 2-6):
- Achieve 70%+ AI agreement rate
- Reduce manual review time by 50%
- Build dataset of 100-200 annotations
Long term (Months 6-12):
- Consider fine-tuning custom model
- Implement RLHF for preference learning
- Achieve 85%+ AI agreement rate
- Reduce manual review time by 80%

Contributing

If you discover patterns or improvements:

Document in your annotations
Share insights in CORRECTION_ANALYSIS.md
Consider contributing successful prompts/examples back to the project

Support

For issues or questions:

Check AGENTIC_IMPLEMENTATION_STATUS.md for known issues
Review QUICK_START_AGENTIC.md for testing guidance
Check Langfuse traces for AI behavior insights

FilesExpand file tree

HUMAN_FEEDBACK_LOOP.md

Latest commit

History

HUMAN_FEEDBACK_LOOP.md

File metadata and controls

Human Feedback Loop Documentation

Overview

Table of Contents

Making Corrections in the UI

Step 1: Process a Song

Step 2: Review and Correct in UI

Step 3: Fill in Annotation Details

Step 4: Save or Skip

Step 5: Complete Review

Annotation Collection Process

Storage Format

Example Annotation

What Gets Tracked

Analyzing Collected Data

Running the Analysis Script

What the Report Includes

Example Output

Improving the Classifier

Step 1: Collect Sufficient Data

Step 2: Generate Few-Shot Examples

Step 3: Verify Examples

Step 4: Classifier Auto-Loads Examples

Step 5: Test Improved Classifier

Evaluating Improvement

Metrics to Track

Continuous Improvement Cycle

Recommended Schedule

Warning Signs

Future: Fine-Tuning & RLHF

When You're Ready

Resources for Fine-Tuning

Cost-Benefit Analysis

Best Practices

Annotation Quality

Data Collection

System Maintenance

API Reference

Backend Endpoints

Frontend API Client

Troubleshooting

Annotations Not Saving

Modal Not Appearing

Analysis Script Errors

Few-Shot Generation Fails

File Locations

Code

Data

Scripts

Reports

Example Workflow

Week 1: Initial Collection

Week 2: First Analysis

Week 4: First Improvement Cycle

Week 5+: Monitor Improvement

Month 3: Consider Fine-Tuning

Advanced Topics

Custom Few-Shot Examples

A/B Testing Different Prompts

Measuring ROI

FAQ

Next Steps

Contributing

Support