Whisper
OpenAI's state-of-the-art speech recognition model for transcription and translation
Alternative To
- β’ Google Speech-to-Text
- β’ Amazon Transcribe
- β’ Azure Speech Services
Difficulty Level
For experienced users. Complex setup and configuration required.
Overview
Whisper is OpenAI’s open-source, state-of-the-art speech recognition model that can transcribe and translate audio in multiple languages. Trained on 5 million hours of audio data, Whisper v3 offers significantly improved performance with 10-20% reduction in error rates compared to previous versions. It supports automatic speech recognition (ASR), speech translation, language identification, and voice activity detection across a wide range of languages.
Model Sizes and Capabilities
Whisper is available in various sizes to accommodate different hardware constraints and accuracy needs:
- Tiny: 39M parameters (~1GB VRAM) - Fastest, least accurate
- Base: 74M parameters
- Small: 244M parameters
- Medium: 769M parameters
- Large: 1.5B parameters
- Large-v3: 1.5B parameters (latest) - Most accurate, trained on 5M hours of audio
The English-only models (e.g., tiny.en, base.en) tend to perform better for English transcription, especially for the smaller models.
System Requirements
Requirements vary based on the model size you choose:
Tiny/Base models:
- CPU: 2+ cores
- RAM: 4GB+
- GPU: Optional, improves performance
- Storage: 1GB+ for model files
Medium/Large models:
- CPU: 4+ cores
- RAM: 8GB+
- GPU: Recommended, 4GB+ VRAM for better performance
- Storage: 3GB+ for model files
Installation Guide
Prerequisites
- Python 3.8-3.11
- PyTorch (1.10.1 or later)
- FFmpeg (for loading audio files)
Option 1: Install with pip (Recommended)
Install FFmpeg:
# On Ubuntu or Debian sudo apt update && sudo apt install ffmpeg # On macOS using Homebrew brew install ffmpeg # On Windows using Chocolatey choco install ffmpegInstall Whisper:
pip install openai-whisper
Option 2: Install from source
Clone the repository:
git clone https://github.com/openai/whisper.gitNavigate to the project directory:
cd whisperInstall the package and its dependencies:
pip install -e .
Using Whisper
Basic Transcription
import whisper
# Load the model
model = whisper.load_model("base") # Options: tiny, base, small, medium, large, large-v3
# Transcribe an audio file
result = model.transcribe("audio.mp3")
print(result["text"])
Detect Language
import whisper
model = whisper.load_model("base")
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
# Make log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# Detect the spoken language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")
Translate to English
import whisper
model = whisper.load_model("base")
result = model.transcribe("non_english_audio.mp3", task="translate")
print(result["text"]) # English translation
Using Whisper with Hugging Face Transformers
For more advanced use cases or to process longer audio files:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load processor and model
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
# Create pipeline for automatic speech recognition
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True, # To get word-level timestamps
torch_dtype=torch_dtype,
device=device,
)
# Transcribe audio
result = pipe("audio.mp3")
print(result["text"])
Resources
Official Documentation and Tools
- GitHub Repository
- Whisper Large-v3 on Hugging Face
- OpenAI Whisper API
- Transformers Documentation for Whisper
Tutorials and Guides
Community Support
For more information and the latest updates, visit the Whisper GitHub repository.
Suggested Projects
You might also be interested in these similar projects:
Fast automatic speech recognition with word-level timestamps and speaker diarization
Chroma is the AI-native open-source embedding database for storing and searching vector embeddings
Blazing-fast, AI-ready web crawler and scraper designed specifically for LLMs, AI agents, and data pipelines