Speech Recognition Audio Processing Transcription
🎙️

WhisperX

Fast automatic speech recognition with word-level timestamps and speaker diarization

Intermediate open-source self-hosted speaker-diarization timestamps

Alternative To

  • • Google Speech-to-Text
  • • Amazon Transcribe
  • • AssemblyAI

Difficulty Level

Intermediate

Requires some technical experience. Moderate setup complexity.

Overview

WhisperX is an enhanced version of OpenAI’s Whisper that provides fast automatic speech recognition with accurate word-level timestamps and speaker diarization. It achieves 70x realtime transcription speed with the large-v2 model and can identify different speakers in audio recordings. The tool combines OpenAI’s Whisper with WAV2VEC2 alignment and pyannote-audio diarization for a comprehensive audio transcription solution.

System Requirements

  • CPU: 4+ cores (8+ recommended for faster processing)
  • RAM: 16GB+ recommended
  • GPU: NVIDIA GPU with 4GB+ VRAM (8GB+ recommended for large-v2 model)
  • Storage: 10GB+ for installation and models
  • OS: Linux, Windows, or macOS (tested on Python 3.10 with PyTorch 2.0)
  • Dependencies: NVIDIA libraries cuBLAS 11.x and cuDNN 8.x for GPU execution

Installation Guide

Prerequisites

  • Python 3.10 and PyTorch 2.0+ (other versions may work but are not officially supported)
  • Git installed on your system
  • NVIDIA GPU with appropriate drivers, cuBLAS 11.x, and cuDNN 8.x (for GPU acceleration)
  • Hugging Face account (for accessing diarization models)

Installation

  1. Clone the repository:

    git clone https://github.com/m-bain/whisperX.git
    
  2. Navigate to the project directory:

    cd whisperX
    
  3. Install the package:

    pip install -e .
    

    Alternatively, install directly from GitHub:

    pip install git+https://github.com/m-bain/whisperX.git
    
  4. Set up Hugging Face access for diarization (optional but recommended):

    • Create a Hugging Face account at huggingface.co

    • Get your access token from your Hugging Face profile settings

    • Accept the user conditions for pyannote/speaker-diarization and pyannote/segmentation models

    • Set the environment variable:

      export HF_TOKEN=your_hugging_face_token
      

Note: WhisperX is a command-line tool, not a web application. It is run via terminal commands to process audio files rather than through a web interface.

Practical Exercise: Getting Started with WhisperX

Let’s walk through a simple exercise to help you get familiar with WhisperX’s features.

Step 1: Basic Transcription with Word-Level Timestamps

  1. Prepare an audio file for transcription (e.g., sample.mp3)

  2. Run the basic transcription command:

    whisperx sample.mp3 --model large-v2
    
  3. This will create several output files in the same directory:

    • .json file with full transcription and word-level timestamps
    • .srt subtitle file
    • .txt plain text transcription
    • .vtt Web Video Text Tracks file

Step 2: Adding Speaker Diarization

Now let’s identify who is speaking in a conversation:

  1. Use the same audio file and add the diarization flag:

    whisperx sample.mp3 --model large-v2 --diarize --highlight_words True
    
  2. This will:

    • Transcribe the audio with word-level timestamps
    • Identify different speakers, labeling them as “SPEAKER_00”, “SPEAKER_01”, etc.
    • Create enhanced output files with speaker labels
  3. If you know the number of speakers in advance, you can specify them:

    whisperx sample.mp3 --model large-v2 --diarize --min_speakers 2 --max_speakers 2
    

Step 3: Using Different Languages

WhisperX supports multiple languages with automatic language detection:

  1. For non-English audio, let WhisperX auto-detect the language:

    whisperx foreign_sample.mp3 --model large-v2
    
  2. Or specify the language for better results (e.g., French):

    whisperx french_sample.mp3 --model large-v2 --language fr
    

Step 4: Advanced Usage with Python API

For more control, you can use the Python API in your scripts:

import whisperx
import gc

# Device setup
device = "cuda"  # or "cpu" for CPU processing
audio_file = "sample.mp3"
batch_size = 16  # Reduce if low on GPU memory
compute_type = "float16"  # Change to "int8" if low on GPU memory

# 1. Transcribe with original whisper
model = whisperx.load_model("large-v2", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align whisper output to get word-level timestamps
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

# 3. Speaker diarization
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN", device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Print the diarized output
for segment in result["segments"]:
    print(f"Speaker {segment['speaker']}: {segment['text']}")

# Free up memory
del model, diarize_model
gc.collect()

Resources

Official Documentation

Community Support

Online Demo

Tutorials and Guides

Suggested Projects

You might also be interested in these similar projects:

🤖

CrewAI

CrewAI is a standalone Python framework for orchestrating role-playing, autonomous AI agents that collaborate intelligently to tackle complex tasks through defined roles, tools, and workflows.

Difficulty: Intermediate
Updated: Mar 23, 2025

An open protocol that connects AI models to data sources and tools with a standardized interface

Difficulty: Intermediate
Updated: Mar 23, 2025

PydanticAI is a Python agent framework designed to make it less painful to build production-grade applications with Generative AI, featuring strong type safety and validation.

Difficulty: Intermediate
Updated: Mar 23, 2025