Skip to content

Marini97/MP4-Transcription-Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 MP4 to Text Transcription Tool

πŸ“ Description

This Python script provides a powerful transcription tool that converts MP4 video files to text transcriptions. The tool uses OpenAI's Whisper for high-quality speech recognition and PyAnnote for speaker diarization, allowing you to identify different speakers in your videos. Specialized for Italian content but supporting multiple languages.

✨ Features

  • πŸ”Š Transcribe MP4 files to text with high accuracy
  • πŸ‘₯ Speaker diarization to identify different speakers in your video
  • ⏰ Timestamps for each transcription segment
  • 🧠 GPU acceleration for faster processing (if available)
  • πŸ”§ Command-line interface for batch processing
  • πŸ“‚ Automatic output file organization
  • 🧹 Temporary file cleanup

πŸ›  Prerequisites

Software Requirements

  • Python 3.10+
  • FFmpeg
  • CUDA (optional, for GPU acceleration)
  • HuggingFace API token (required for speaker diarization)

πŸš€ Installation

1. Clone the Repository

git clone https://github.com/Marini97/MP4-Transcription-Tool.git
cd MP4-Transcription-Tool

2. Install FFmpeg

Windows

macOS

brew install ffmpeg

Linux

sudo apt-get update
sudo apt-get install ffmpeg

3. Create Virtual Environment (Optional but Recommended)

python -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`

4. Install Python Dependencies

pip install -r requirements.txt

5. Set up HuggingFace Token

Create a .env file in the project directory and add your token:

HUGGINGFACE_TOKEN=your_huggingface_token_here

You can get your token by:

  1. Creating an account at HuggingFace
  2. Going to your profile β†’ Settings β†’ Access Tokens
  3. Creating a new token with at least read access

πŸ–₯ Usage

Basic Usage

python mp4-transcription.py
  1. Run the script
  2. When prompted, enter the full path to your MP4 file
  3. Wait for transcription
  4. Find the transcription in the output folder

Command-line Usage

python mp4-transcription.py path/to/video.mp4 --speakers 3 --language en

Available Command-line Options

--output-dir DIR    Directory for output files (default: output)
--model MODEL       Whisper model to use (default: turbo)
--language          LANG Primary language in the video (default: it)
--speakers NUM      Number of speakers for diarization (default: 2)
--cpu               Force CPU usage even if GPU is available
--keep-temp         Keep temporary files after processing

Example Input Paths

  • Relative: video.mp4
  • Full Path: C:\Users\YourName\Videos\video.mp4
  • You can also drag and drop files into the terminal window

πŸ“¦ Output

The tool generates two types of transcription files:

1. Basic Transcription

  • Filename: [video_name]_transcription.txt
  • Contains timestamps and text without speaker identification

2. Transcription with Speaker Diarization

  • Filename: [video_name]_transcription_with_speakers.txt
  • Includes timestamps, speaker identification, and text
  • Format: [00:01:23] Speaker 1: Text of what was said

πŸ” Advanced Configuration

Whisper Models

You can choose from various Whisper models with the --model flag:

  • tiny: Fastest, least accurate
  • base: Fast with reasonable accuracy
  • small: Good balance of speed and accuracy
  • medium: High accuracy, slower processing
  • large: Highest accuracy, slowest processing
  • turbo: OpenAI's optimized model (default)

Languages

Set the primary language with the --language flag:

  • it: Italian (default)
  • en: English
  • fr: French
  • de: German
  • And many others supported by Whisper

πŸ›  Customization

The tool can be customized by modifying the config dictionary in the code:

self.config = {
    'output_dir': 'output',
    'temp_dir': 'temp',
    'whisper_model': 'turbo',
    'language': 'it',
    'num_speakers': 2,
    'sample_rate': '44100',
    'channels': '2',
    'use_gpu': torch.cuda.is_available(),
    'cleanup_temp': True
}

πŸ”§ Troubleshooting

Speaker Diarization Not Working

  • Check that your .env file contains a valid HUGGINGFACE_TOKEN
  • Ensure you have internet access for API calls

Poor Transcription Quality

  • Try using a larger Whisper model with --model medium or --model large
  • Ensure your audio has minimal background noise
  • Try adjusting the number of speakers with --speakers option

Error: "Input file not found"

  • Check that the file path is correct
  • If you're dragging and dropping files, the path may include quotes
  • Use absolute paths if relative paths aren't working

Slow Processing

  • Enable GPU acceleration if available
  • For large files, use smaller Whisper models like --model small

πŸ“ Notes

  • Speaker diarization works best for clear audio with distinct speakers
  • For multilingual content, set --language to the primary language
  • Using GPU acceleration can significantly improve processing speed

πŸ™ Acknowledgments


Happy Transcribing! πŸŽ§πŸ“

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages