Whisper

OpenAI's state-of-the-art speech recognition model for transcription and translation

Beginner to Intermediate open-source transcription translation multilingual

GitHub Repository Official Website

Alternative To

• Google Speech-to-Text
• Amazon Transcribe
• Azure Speech Services

Difficulty Level

Beginner to Intermediate

For experienced users. Complex setup and configuration required.

Overview

Whisper is OpenAI’s open-source, state-of-the-art speech recognition model that can transcribe and translate audio in multiple languages. Trained on 5 million hours of audio data, Whisper v3 offers significantly improved performance with 10-20% reduction in error rates compared to previous versions. It supports automatic speech recognition (ASR), speech translation, language identification, and voice activity detection across a wide range of languages.

Model Sizes and Capabilities

Whisper is available in various sizes to accommodate different hardware constraints and accuracy needs:

Tiny: 39M parameters (~1GB VRAM) - Fastest, least accurate
Base: 74M parameters
Small: 244M parameters
Medium: 769M parameters
Large: 1.5B parameters
Large-v3: 1.5B parameters (latest) - Most accurate, trained on 5M hours of audio

The English-only models (e.g., tiny.en, base.en) tend to perform better for English transcription, especially for the smaller models.

System Requirements

Requirements vary based on the model size you choose:

Tiny/Base models:
- CPU: 2+ cores
- RAM: 4GB+
- GPU: Optional, improves performance
- Storage: 1GB+ for model files
Medium/Large models:
- CPU: 4+ cores
- RAM: 8GB+
- GPU: Recommended, 4GB+ VRAM for better performance
- Storage: 3GB+ for model files

Installation Guide

Prerequisites

Python 3.8-3.11
PyTorch (1.10.1 or later)
FFmpeg (for loading audio files)

Option 1: Install with pip (Recommended)

Install FFmpeg:

# On Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# On macOS using Homebrew
brew install ffmpeg

# On Windows using Chocolatey
choco install ffmpeg

Install Whisper:
```
pip install openai-whisper
```

Option 2: Install from source

Clone the repository:

git clone https://github.com/openai/whisper.git

Navigate to the project directory:
```
cd whisper
```
Install the package and its dependencies:
```
pip install -e .
```

Using Whisper

Basic Transcription

import whisper

# Load the model
model = whisper.load_model("base")  # Options: tiny, base, small, medium, large, large-v3

# Transcribe an audio file
result = model.transcribe("audio.mp3")
print(result["text"])

Detect Language

import whisper

model = whisper.load_model("base")
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# Make log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# Detect the spoken language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")

Translate to English

import whisper

model = whisper.load_model("base")
result = model.transcribe("non_english_audio.mp3", task="translate")
print(result["text"])  # English translation

Using Whisper with Hugging Face Transformers

For more advanced use cases or to process longer audio files:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load processor and model
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# Create pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,  # To get word-level timestamps
    torch_dtype=torch_dtype,
    device=device,
)

# Transcribe audio
result = pipe("audio.mp3")
print(result["text"])