This Python script provides a powerful transcription tool that converts MP4 video files to text transcriptions. The tool uses OpenAI's Whisper for high-quality speech recognition and PyAnnote for speaker diarization, allowing you to identify different speakers in your videos. Specialized for Italian content but supporting multiple languages.
- π Transcribe MP4 files to text with high accuracy
- π₯ Speaker diarization to identify different speakers in your video
- β° Timestamps for each transcription segment
- π§ GPU acceleration for faster processing (if available)
- π§ Command-line interface for batch processing
- π Automatic output file organization
- π§Ή Temporary file cleanup
- Python 3.10+
- FFmpeg
- CUDA (optional, for GPU acceleration)
- HuggingFace API token (required for speaker diarization)
git clone https://github.com/Marini97/MP4-Transcription-Tool.git
cd MP4-Transcription-Tool- Download from FFmpeg Official Site
- Add to system PATH
brew install ffmpegsudo apt-get update
sudo apt-get install ffmpegpython -m venv .venv
source .venv/bin/activate # On Windows use `.venv\Scripts\activate`pip install -r requirements.txtCreate a .env file in the project directory and add your token:
HUGGINGFACE_TOKEN=your_huggingface_token_here
You can get your token by:
- Creating an account at HuggingFace
- Going to your profile β Settings β Access Tokens
- Creating a new token with at least read access
python mp4-transcription.py- Run the script
- When prompted, enter the full path to your MP4 file
- Wait for transcription
- Find the transcription in the
outputfolder
python mp4-transcription.py path/to/video.mp4 --speakers 3 --language en--output-dir DIR Directory for output files (default: output)
--model MODEL Whisper model to use (default: turbo)
--language LANG Primary language in the video (default: it)
--speakers NUM Number of speakers for diarization (default: 2)
--cpu Force CPU usage even if GPU is available
--keep-temp Keep temporary files after processing
- Relative:
video.mp4 - Full Path:
C:\Users\YourName\Videos\video.mp4 - You can also drag and drop files into the terminal window
The tool generates two types of transcription files:
- Filename:
[video_name]_transcription.txt - Contains timestamps and text without speaker identification
- Filename:
[video_name]_transcription_with_speakers.txt - Includes timestamps, speaker identification, and text
- Format:
[00:01:23] Speaker 1: Text of what was said
You can choose from various Whisper models with the --model flag:
tiny: Fastest, least accuratebase: Fast with reasonable accuracysmall: Good balance of speed and accuracymedium: High accuracy, slower processinglarge: Highest accuracy, slowest processingturbo: OpenAI's optimized model (default)
Set the primary language with the --language flag:
it: Italian (default)en: Englishfr: Frenchde: German- And many others supported by Whisper
The tool can be customized by modifying the config dictionary in the code:
self.config = {
'output_dir': 'output',
'temp_dir': 'temp',
'whisper_model': 'turbo',
'language': 'it',
'num_speakers': 2,
'sample_rate': '44100',
'channels': '2',
'use_gpu': torch.cuda.is_available(),
'cleanup_temp': True
}- Check that your
.envfile contains a validHUGGINGFACE_TOKEN - Ensure you have internet access for API calls
- Try using a larger Whisper model with
--model mediumor--model large - Ensure your audio has minimal background noise
- Try adjusting the number of speakers with
--speakersoption
- Check that the file path is correct
- If you're dragging and dropping files, the path may include quotes
- Use absolute paths if relative paths aren't working
- Enable GPU acceleration if available
- For large files, use smaller Whisper models like
--model small
- Speaker diarization works best for clear audio with distinct speakers
- For multilingual content, set --language to the primary language
- Using GPU acceleration can significantly improve processing speed
Happy Transcribing! π§π