WhisperX
Fast automatic speech recognition with word-level timestamps and speaker diarization
Alternative To
- • Google Speech-to-Text
- • Amazon Transcribe
- • AssemblyAI
Difficulty Level
Requires some technical experience. Moderate setup complexity.
Overview
WhisperX is an enhanced version of OpenAI’s Whisper that provides fast automatic speech recognition with accurate word-level timestamps and speaker diarization. It achieves 70x realtime transcription speed with the large-v2 model and can identify different speakers in audio recordings. The tool combines OpenAI’s Whisper with WAV2VEC2 alignment and pyannote-audio diarization for a comprehensive audio transcription solution.
System Requirements
- CPU: 4+ cores (8+ recommended for faster processing)
- RAM: 16GB+ recommended
- GPU: NVIDIA GPU with 4GB+ VRAM (8GB+ recommended for large-v2 model)
- Storage: 10GB+ for installation and models
- OS: Linux, Windows, or macOS (tested on Python 3.10 with PyTorch 2.0)
- Dependencies: NVIDIA libraries cuBLAS 11.x and cuDNN 8.x for GPU execution
Installation Guide
Prerequisites
- Python 3.10 and PyTorch 2.0+ (other versions may work but are not officially supported)
- Git installed on your system
- NVIDIA GPU with appropriate drivers, cuBLAS 11.x, and cuDNN 8.x (for GPU acceleration)
- Hugging Face account (for accessing diarization models)
Installation
Clone the repository:
git clone https://github.com/m-bain/whisperX.gitNavigate to the project directory:
cd whisperXInstall the package:
pip install -e .Alternatively, install directly from GitHub:
pip install git+https://github.com/m-bain/whisperX.gitSet up Hugging Face access for diarization (optional but recommended):
Create a Hugging Face account at huggingface.co
Get your access token from your Hugging Face profile settings
Accept the user conditions for pyannote/speaker-diarization and pyannote/segmentation models
Set the environment variable:
export HF_TOKEN=your_hugging_face_token
Note: WhisperX is a command-line tool, not a web application. It is run via terminal commands to process audio files rather than through a web interface.
Practical Exercise: Getting Started with WhisperX
Let’s walk through a simple exercise to help you get familiar with WhisperX’s features.
Step 1: Basic Transcription with Word-Level Timestamps
Prepare an audio file for transcription (e.g., sample.mp3)
Run the basic transcription command:
whisperx sample.mp3 --model large-v2This will create several output files in the same directory:
.jsonfile with full transcription and word-level timestamps.srtsubtitle file.txtplain text transcription.vttWeb Video Text Tracks file
Step 2: Adding Speaker Diarization
Now let’s identify who is speaking in a conversation:
Use the same audio file and add the diarization flag:
whisperx sample.mp3 --model large-v2 --diarize --highlight_words TrueThis will:
- Transcribe the audio with word-level timestamps
- Identify different speakers, labeling them as “SPEAKER_00”, “SPEAKER_01”, etc.
- Create enhanced output files with speaker labels
If you know the number of speakers in advance, you can specify them:
whisperx sample.mp3 --model large-v2 --diarize --min_speakers 2 --max_speakers 2
Step 3: Using Different Languages
WhisperX supports multiple languages with automatic language detection:
For non-English audio, let WhisperX auto-detect the language:
whisperx foreign_sample.mp3 --model large-v2Or specify the language for better results (e.g., French):
whisperx french_sample.mp3 --model large-v2 --language fr
Step 4: Advanced Usage with Python API
For more control, you can use the Python API in your scripts:
import whisperx
import gc
# Device setup
device = "cuda" # or "cpu" for CPU processing
audio_file = "sample.mp3"
batch_size = 16 # Reduce if low on GPU memory
compute_type = "float16" # Change to "int8" if low on GPU memory
# 1. Transcribe with original whisper
model = whisperx.load_model("large-v2", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
# 2. Align whisper output to get word-level timestamps
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
# 3. Speaker diarization
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN", device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
# Print the diarized output
for segment in result["segments"]:
print(f"Speaker {segment['speaker']}: {segment['text']}")
# Free up memory
del model, diarize_model
gc.collect()
Resources
Official Documentation
- GitHub Repository README - Main documentation with usage examples
- WhisperX Academic Paper - Research paper explaining the methodology
- Example Code - Additional examples in different languages
Related Projects and Dependencies
- OpenAI Whisper - The base ASR model
- Faster-Whisper - Optimized Whisper backend
- Pyannote Audio - Speaker diarization models
- WAV2VEC2 Models - For word-level alignment
Community Support
- GitHub Issues - Bug reports and feature requests
- GitHub Discussions - Community discussions and questions
Online Demo
- Replicate Demo - Test WhisperX online without installation
Tutorials and Guides
- Hugging Face Integration Guide - Using WhisperX with Hugging Face
- Gladia: Top Whisper GitHub Projects - Comparison of WhisperX with other Whisper implementations
Suggested Projects
You might also be interested in these similar projects:
CrewAI is a standalone Python framework for orchestrating role-playing, autonomous AI agents that collaborate intelligently to tackle complex tasks through defined roles, tools, and workflows.
An open protocol that connects AI models to data sources and tools with a standardized interface
PydanticAI is a Python agent framework designed to make it less painful to build production-grade applications with Generative AI, featuring strong type safety and validation.