Skip to content

lmaoclost/yt-subtitle-markdown

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

YouTube Subtitle Markdown

This project is a Node.js pipeline that reads a Markdown file containing a list of YouTube videos and generates a single Markdown file with full subtitle transcriptions, in the original language of each video, whenever available.

It is designed to be:

  • resilient to failures
  • restartable
  • parallelized
  • compatible with modern YouTube subtitle quirks

✨ Features

  • Reads a .md file with YouTube links
  • Downloads subtitles using yt-dlp
  • Automatically selects the best available subtitle language
  • Supports manual subtitles, auto-generated subtitles, and *-orig tracks
  • Cleans VTT files (removes timestamps, tags, and formatting)
  • Writes a clean, readable transcription per video
  • Parallel processing (configurable)
  • Retry mechanism with failure tracking
  • Persistent progress (can resume after crashes or network loss)
  • Detailed logging to file
  • Modular architecture for easy maintenance and testing

πŸ“ Project Structure

.
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   └── constants.js           # All configuration constants
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ file.js                # File operations
β”‚   β”‚   β”œβ”€β”€ logger.js              # Logging system
β”‚   β”‚   └── vtt.js                 # VTT cleaning utilities
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ markdown.js            # Input MD parsing
β”‚   β”‚   β”œβ”€β”€ ytdlp.js               # yt-dlp integration
β”‚   β”‚   └── progress.js            # Progress tracking
β”‚   β”œβ”€β”€ workers/
β”‚   β”‚   └── videoProcessor.js      # Video processing logic
β”‚   └── index.js                   # Main entry point
β”œβ”€β”€ video-list.md                  # Input file (list of videos)
β”œβ”€β”€ video-list-transcription.md    # Output file (generated)
β”œβ”€β”€ progress.json                  # Progress tracking
β”œβ”€β”€ tmp_subs/                      # Temporary subtitle downloads
β”œβ”€β”€ logs/
β”‚   └── app.log                    # Detailed execution logs
β”œβ”€β”€ package.json
└── README.md

πŸ—οΈ Architecture

The project follows a modular architecture with clear separation of concerns:

config/

Centralized configuration and constants

  • Paths, timeouts, parallel limits, log levels

utils/

Reusable utility functions

  • logger.js: Multi-level logging system
  • file.js: File and directory operations
  • vtt.js: VTT subtitle cleaning

services/

Business logic and external integrations

  • markdown.js: Parse input Markdown file
  • ytdlp.js: All yt-dlp interactions (download, list subs, etc.)
  • progress.js: Save/load progress tracking

workers/

Processing and orchestration

  • videoProcessor.js: Individual video processing and parallel execution

index.js

Main entry point that orchestrates the entire pipeline


πŸ“ Input Format

The input file must be a Markdown file with links in this format:

[Video Title](https://www.youtube.com/watch?v=VIDEO_ID)

Example:

[How to Build a CLI Tool](https://www.youtube.com/watch?v=abc123)
[Node.js Best Practices](https://www.youtube.com/watch?v=xyz789)

πŸ“€ Output Format

## Video Title
https://www.youtube.com/watch?v=VIDEO_ID

Full transcription text goes here...

## Another Video Title
https://www.youtube.com/watch?v=ANOTHER_ID

Another full transcription...

🧠 Subtitle Selection Logic

  1. *-orig subtitles (original language track)
  2. Single available manual subtitle
  3. Auto-generated subtitle (fallback)
  4. Fail only if no subtitles exist

This ensures you always get the highest quality subtitle available.


πŸš€ Usage

# Using npm scripts
npm start

# Or directly with Node.js
node src/index.js

# Development mode with auto-reload
npm run dev

You can stop and restart at any time. Progress is saved automatically in progress.json.


βš™οΈ Configuration

Edit src/config/constants.js to customize:

export const SETTINGS = {
  MAX_PARALLEL: 6,      // Number of concurrent downloads
  MAX_RETRIES: 3,       // Retry attempts per video
  TIMEOUT_MS: 30_000,   // Timeout for each operation
};

πŸ“¦ Requirements

  • Node.js 18+ (ES modules support)
  • yt-dlp executable (must be in project root or PATH)
  • Recommended: ffmpeg, Node.js runtime for yt-dlp

Installing yt-dlp

# Windows
# Download from https://github.com/yt-dlp/yt-dlp/releases

# macOS/Linux
pip install yt-dlp
# or
brew install yt-dlp

πŸ”§ Development

Adding New Features

  1. New utilities β†’ src/utils/
  2. New services β†’ src/services/
  3. New processing logic β†’ src/workers/
  4. Configuration changes β†’ src/config/constants.js

Testing

The modular structure makes it easy to test individual components:

// Example: Testing VTT cleaning
import { cleanVtt } from './src/utils/vtt.js';

const dirty = 'WEBVTT\n\n00:00:01.000 --> 00:00:05.000\nHello world';
const clean = cleanVtt(dirty);
console.log(clean); // "Hello world"

πŸ“Š Logging

Logs are written to logs/app.log with timestamps and severity levels:

  • DEBUG: Detailed execution info
  • INFO: General progress updates
  • WARN: Non-critical issues
  • ERROR: Failures with retry info
  • FATAL: Critical errors that stop execution

πŸ”„ Progress Tracking

The progress.json file tracks:

  • βœ… "done": Successfully processed
  • ❌ "failed": Failed after all retries

Delete this file to reprocess all videos.


πŸ› Troubleshooting

No subtitles found

  • Check if the video has subtitles enabled
  • Try running yt-dlp --list-subs VIDEO_URL manually

Timeout errors

  • Increase TIMEOUT_MS in src/config/constants.js
  • Check your internet connection

yt-dlp not found

  • Ensure yt-dlp.exe is in the project root
  • Or update PATHS.YT_DLP in src/config/constants.js

πŸ“œ License

This project is under MIT license. Check the file LICENSE for details.


Made with πŸ’— by Renan Oliveira

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors