The classification-first agentic correction system is now functional with:
✅ 8 gap categories automatically detected by LLM ✅ 8 specialized handlers for each category ✅ Two-step workflow: Classify gap → Route to handler → Generate proposals ✅ Backend annotation system ready to collect human feedback
cd /Users/andrew/Projects/karaoke-gen/lyrics_transcriber_local
USE_AGENTIC_AI=1 python -m lyrics_transcriber.cli.cli_main Time-Bomb.flac \
--artist "Rancid" --title "Time Bomb"For each gap in the transcription:
-
Classification Step: LLM analyzes the gap and classifies it into one of 8 categories:
SOUND_ALIKE: Homophones like "out" vs "now"BACKGROUND_VOCALS: Parenthesized backing vocalsEXTRA_WORDS: Filler words like "And", "But"PUNCTUATION_ONLY: Just styling differencesNO_ERROR: Matches at least one reference sourceREPEATED_SECTION: Chorus/verse repetitionsCOMPLEX_MULTI_ERROR: Multiple error typesAMBIGUOUS: Needs human review
-
Handler Step: Appropriate handler processes the gap:
- Deterministic handlers (no LLM needed):
PunctuationHandler,NoErrorHandler,BackgroundVocalsHandler,ExtraWordsHandler - LLM-assisted handlers:
SoundAlikeHandler(extracts replacement from references) - Human review handlers:
RepeatedSectionHandler,ComplexMultiErrorHandler,AmbiguousHandler
- Deterministic handlers (no LLM needed):
-
Proposal Generation: Handler returns correction proposals with:
- Action type (ReplaceWord, DeleteWord, NoAction, Flag)
- Confidence score
- Reasoning
- Metadata (category, artist, title)
You should see log messages like:
🤖 Classified gap gap_1 as SOUND_ALIKE (confidence: 0.95)
🤖 Agent returned 1 proposals
🤖 Adapter returned 1 corrections
🤖 Applying 1 agentic corrections for gap 1
Made correction: 'out' -> 'now' (confidence: 0.75, reason: Sound-alike error...)
Issue: Classification was failing with enum validation errors
Solution: Updated enum values to match LLM output format (uppercase like "SOUND_ALIKE" instead of lowercase "sound_alike")
Status: ✅ Fixed and tested
If you have Langfuse configured, you can view:
- Each classification LLM call
- Handler processing
- All grouped under session ID:
lyrics-correction-{uuid}
Check your Langfuse dashboard at: https://cloud.langfuse.com
❌ Frontend UI for human feedback collection
- Annotation modal component
- Edit workflow integration
- Unable to collect human corrections yet
❌ Analysis scripts
- Can't generate reports from annotations
- Can't update few-shot examples automatically
❌ Comprehensive tests
- Unit tests for handlers
- Integration tests for full workflow
- Test the classification workflow with your Time-Bomb.flac file
- Review the corrections it proposes
- Check Langfuse traces to see how LLM classifies each gap
- Provide feedback on classification accuracy
Once you're satisfied with the classification accuracy, the next priority is implementing the frontend annotation modal so you can start collecting human feedback to improve the system over time.
- Check: Model is running (Ollama, OpenAI, etc.)
- Check: API keys are set if using cloud providers
- Check: Langfuse keys if observability needed
- This is normal for:
PUNCTUATION_ONLYgaps (no changes needed)NO_ERRORgaps (transcription is correct)- Gaps flagged for human review
- Not an error - system is working as designed
- Each gap requires 1-2 LLM calls (classification + optional handler)
- Consider using faster models (GPT-4-turbo, Claude Instant)
- Local models (Ollama) will be slower but free
Core Logic:
lyrics_transcriber/correction/agentic/agent.py- Main orchestratorlyrics_transcriber/correction/agentic/handlers/- Category handlerslyrics_transcriber/correction/agentic/prompts/classifier.py- Classification prompt
Storage:
lyrics_transcriber/correction/feedback/store.py- Annotation storagecache/correction_annotations.jsonl- Where annotations will be saved
Documentation:
AGENTIC_IMPLEMENTATION_STATUS.md- Full status and architecture.cursor/plans/agentic-correction-system-*.plan.md- Original plan