Skip to content

feat(fetchers): enhance YouTubeFetcher with transcript extraction#88

Merged
chaliy merged 1 commit intomainfrom
fix/issue-56-youtube-fetcher
Apr 3, 2026
Merged

feat(fetchers): enhance YouTubeFetcher with transcript extraction#88
chaliy merged 1 commit intomainfrom
fix/issue-56-youtube-fetcher

Conversation

@chaliy
Copy link
Copy Markdown
Contributor

@chaliy chaliy commented Apr 3, 2026

What

Enhance YouTubeFetcher with transcript/captions extraction via the timedtext API.

Why

Closes #56 — Agents encounter YouTube links but can't watch video. Extracting transcripts turns video content into LLM-consumable text. The existing implementation only had oEmbed metadata with no transcript support.

How

  • Added transcript extraction via YouTube timedtext XML API (English captions)
  • Parse timedtext XML segments and join into continuous text
  • Truncate very long transcripts (>15k chars) with indicator
  • Gracefully handle videos without transcripts
  • Added mobile URL support (m.youtube.com)
  • Comprehensive tests: XML parsing, entity decoding, truncation, formatting

Risk

  • Low
  • Transcript API is undocumented but widely used; graceful fallback when unavailable

Checklist

  • Unit tests are passed
  • Smoke tests are passed
  • Specs are up to date and not in conflict

- Add transcript extraction via YouTube timedtext API
- Parse timedtext XML format into joined transcript text
- Truncate very long transcripts (>15k chars) with indicator
- Show "No transcript available" when captions are unavailable
- Add mobile URL support (m.youtube.com)
- Add comprehensive tests: timedtext parsing, entity decoding,
  transcript truncation, formatting with/without all fields

Closes #56
@chaliy chaliy merged commit 0960c35 into main Apr 3, 2026
11 checks passed
@chaliy chaliy deleted the fix/issue-56-youtube-fetcher branch April 3, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(fetchers): YouTubeFetcher — video metadata and transcript extraction

1 participant