Speech Recognition
πŸŽ™οΈ

Whisper

OpenAI's state-of-the-art speech recognition model for transcription and translation

Beginner to Intermediate open-source transcription translation multilingual

Alternative To

  • β€’ Google Speech-to-Text
  • β€’ Amazon Transcribe
  • β€’ Azure Speech Services

Difficulty Level

Beginner to Intermediate

For experienced users. Complex setup and configuration required.

Overview

Whisper is OpenAI’s open-source, state-of-the-art speech recognition model that can transcribe and translate audio in multiple languages. Trained on 5 million hours of audio data, Whisper v3 offers significantly improved performance with 10-20% reduction in error rates compared to previous versions. It supports automatic speech recognition (ASR), speech translation, language identification, and voice activity detection across a wide range of languages.

Model Sizes and Capabilities

Whisper is available in various sizes to accommodate different hardware constraints and accuracy needs:

  • Tiny: 39M parameters (~1GB VRAM) - Fastest, least accurate
  • Base: 74M parameters
  • Small: 244M parameters
  • Medium: 769M parameters
  • Large: 1.5B parameters
  • Large-v3: 1.5B parameters (latest) - Most accurate, trained on 5M hours of audio

The English-only models (e.g., tiny.en, base.en) tend to perform better for English transcription, especially for the smaller models.

System Requirements

Requirements vary based on the model size you choose:

  • Tiny/Base models:

    • CPU: 2+ cores
    • RAM: 4GB+
    • GPU: Optional, improves performance
    • Storage: 1GB+ for model files
  • Medium/Large models:

    • CPU: 4+ cores
    • RAM: 8GB+
    • GPU: Recommended, 4GB+ VRAM for better performance
    • Storage: 3GB+ for model files

Installation Guide

Prerequisites

  • Python 3.8-3.11
  • PyTorch (1.10.1 or later)
  • FFmpeg (for loading audio files)
  1. Install FFmpeg:

    # On Ubuntu or Debian
    sudo apt update && sudo apt install ffmpeg
    
    # On macOS using Homebrew
    brew install ffmpeg
    
    # On Windows using Chocolatey
    choco install ffmpeg
    
  2. Install Whisper:

    pip install openai-whisper
    

Option 2: Install from source

  1. Clone the repository:

    git clone https://github.com/openai/whisper.git
    
  2. Navigate to the project directory:

    cd whisper
    
  3. Install the package and its dependencies:

    pip install -e .
    

Using Whisper

Basic Transcription

import whisper

# Load the model
model = whisper.load_model("base")  # Options: tiny, base, small, medium, large, large-v3

# Transcribe an audio file
result = model.transcribe("audio.mp3")
print(result["text"])

Detect Language

import whisper

model = whisper.load_model("base")
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# Make log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# Detect the spoken language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")

Translate to English

import whisper

model = whisper.load_model("base")
result = model.transcribe("non_english_audio.mp3", task="translate")
print(result["text"])  # English translation

Using Whisper with Hugging Face Transformers

For more advanced use cases or to process longer audio files:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load processor and model
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# Create pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,  # To get word-level timestamps
    torch_dtype=torch_dtype,
    device=device,
)

# Transcribe audio
result = pipe("audio.mp3")
print(result["text"])

Resources

Official Documentation and Tools

Tutorials and Guides

Community Support

For more information and the latest updates, visit the Whisper GitHub repository.

Suggested Projects

You might also be interested in these similar projects:

πŸŽ™οΈ

WhisperX

Fast automatic speech recognition with word-level timestamps and speaker diarization

Difficulty: Intermediate
Updated: Mar 1, 2025
πŸ—„οΈ

Chroma

Chroma is the AI-native open-source embedding database for storing and searching vector embeddings

Difficulty: Beginner to Intermediate
Updated: Mar 23, 2025
πŸ•ΈοΈ

Crawl4AI

Blazing-fast, AI-ready web crawler and scraper designed specifically for LLMs, AI agents, and data pipelines

Difficulty: Beginner to Intermediate
Updated: Mar 23, 2025