Speech Recognition Audio Processing Transcription

🎙️

WhisperX

Fast automatic speech recognition with word-level timestamps and speaker diarization

Intermediate open-source self-hosted speaker-diarization timestamps

GitHub Repository Official Website

Alternative To

• Google Speech-to-Text
• Amazon Transcribe
• AssemblyAI

Difficulty Level

Intermediate

Requires some technical experience. Moderate setup complexity.

Overview

WhisperX is an enhanced version of OpenAI’s Whisper that provides fast automatic speech recognition with accurate word-level timestamps and speaker diarization. It achieves 70x realtime transcription speed with the large-v2 model and can identify different speakers in audio recordings. The tool combines OpenAI’s Whisper with WAV2VEC2 alignment and pyannote-audio diarization for a comprehensive audio transcription solution.

System Requirements

CPU: 4+ cores (8+ recommended for faster processing)
RAM: 16GB+ recommended
GPU: NVIDIA GPU with 4GB+ VRAM (8GB+ recommended for large-v2 model)
Storage: 10GB+ for installation and models
OS: Linux, Windows, or macOS (tested on Python 3.10 with PyTorch 2.0)
Dependencies: NVIDIA libraries cuBLAS 11.x and cuDNN 8.x for GPU execution

Installation Guide

Prerequisites

Python 3.10 and PyTorch 2.0+ (other versions may work but are not officially supported)
Git installed on your system
NVIDIA GPU with appropriate drivers, cuBLAS 11.x, and cuDNN 8.x (for GPU acceleration)
Hugging Face account (for accessing diarization models)

Installation

Clone the repository:

git clone https://github.com/m-bain/whisperX.git

Navigate to the project directory:
```
cd whisperX
```

Install the package:

pip install -e .

Alternatively, install directly from GitHub:

pip install git+https://github.com/m-bain/whisperX.git

Set up Hugging Face access for diarization (optional but recommended):
- Create a Hugging Face account at huggingface.co
- Get your access token from your Hugging Face profile settings
- Accept the user conditions for pyannote/speaker-diarization and pyannote/segmentation models
- Set the environment variable:
```
export HF_TOKEN=your_hugging_face_token
```

Note: WhisperX is a command-line tool, not a web application. It is run via terminal commands to process audio files rather than through a web interface.

Practical Exercise: Getting Started with WhisperX

Let’s walk through a simple exercise to help you get familiar with WhisperX’s features.

Step 1: Basic Transcription with Word-Level Timestamps

Prepare an audio file for transcription (e.g., sample.mp3)
Run the basic transcription command:
```
whisperx sample.mp3 --model large-v2
```
This will create several output files in the same directory:
- .json file with full transcription and word-level timestamps
- .srt subtitle file
- .txt plain text transcription
- .vtt Web Video Text Tracks file

Step 2: Adding Speaker Diarization

Now let’s identify who is speaking in a conversation:

Use the same audio file and add the diarization flag:

whisperx sample.mp3 --model large-v2 --diarize --highlight_words True

This will:
- Transcribe the audio with word-level timestamps
- Identify different speakers, labeling them as “SPEAKER_00”, “SPEAKER_01”, etc.
- Create enhanced output files with speaker labels

If you know the number of speakers in advance, you can specify them:

whisperx sample.mp3 --model large-v2 --diarize --min_speakers 2 --max_speakers 2

Step 3: Using Different Languages

WhisperX supports multiple languages with automatic language detection:

For non-English audio, let WhisperX auto-detect the language:
```
whisperx foreign_sample.mp3 --model large-v2
```

Or specify the language for better results (e.g., French):

whisperx french_sample.mp3 --model large-v2 --language fr

Step 4: Advanced Usage with Python API

For more control, you can use the Python API in your scripts:

import whisperx
import gc

# Device setup
device = "cuda"  # or "cpu" for CPU processing
audio_file = "sample.mp3"
batch_size = 16  # Reduce if low on GPU memory
compute_type = "float16"  # Change to "int8" if low on GPU memory

# 1. Transcribe with original whisper
model = whisperx.load_model("large-v2", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align whisper output to get word-level timestamps
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

# 3. Speaker diarization
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN", device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Print the diarized output
for segment in result["segments"]:
    print(f"Speaker {segment['speaker']}: {segment['text']}")

# Free up memory
del model, diarize_model
gc.collect()

Resources

Official Documentation

GitHub Repository README - Main documentation with usage examples
WhisperX Academic Paper - Research paper explaining the methodology
Example Code - Additional examples in different languages

OpenAI Whisper - The base ASR model
Faster-Whisper - Optimized Whisper backend
Pyannote Audio - Speaker diarization models
WAV2VEC2 Models - For word-level alignment

Community Support

GitHub Issues - Bug reports and feature requests
GitHub Discussions - Community discussions and questions

Online Demo

Replicate Demo - Test WhisperX online without installation

Tutorials and Guides

Hugging Face Integration Guide - Using WhisperX with Hugging Face
Gladia: Top Whisper GitHub Projects - Comparison of WhisperX with other Whisper implementations

Suggested Projects

You might also be interested in these similar projects:

🤖

CrewAI

Agentic Frameworks

CrewAI is a standalone Python framework for orchestrating role-playing, autonomous AI agents that collaborate intelligently to tackle complex tasks through defined roles, tools, and workflows.

GitHub

Website

Difficulty: Intermediate

Updated: Mar 23, 2025

🔌

ModelContextProtocol

AI Integration

An open protocol that connects AI models to data sources and tools with a standardized interface

GitHub

Website

Difficulty: Intermediate

Updated: Mar 23, 2025

🤖

PydanticAI

Agentic Frameworks

PydanticAI is a Python agent framework designed to make it less painful to build production-grade applications with Generative AI, featuring strong type safety and validation.

GitHub

Website

Difficulty: Intermediate

Updated: Mar 23, 2025

WhisperX

Alternative To

Difficulty Level

Overview

System Requirements

Installation Guide

Prerequisites

Installation

Practical Exercise: Getting Started with WhisperX

Step 1: Basic Transcription with Word-Level Timestamps

Step 2: Adding Speaker Diarization

Step 3: Using Different Languages

Step 4: Advanced Usage with Python API

Resources

Official Documentation

Related Projects and Dependencies

Community Support

Online Demo

Tutorials and Guides

Suggested Projects

CrewAI

ModelContextProtocol

PydanticAI