Pretrained Audio Models Intermediate

Explore pretrained models for speech recognition, text-to-speech synthesis, music generation, and audio classification — with practical Python code for each task.

Speech-to-Text (ASR)

Whisper (OpenAI)

The most popular speech recognition model. Trained on 680,000 hours of multilingual audio. Supports 99 languages, translation, and timestamps. Available in tiny, base, small, medium, and large variants.

Python
from transformers import pipeline

# Load Whisper for speech recognition
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base")

# Transcribe audio file
result = transcriber("audio.mp3")
print(result["text"])

# With timestamps
result = transcriber("audio.mp3", return_timestamps=True)
for chunk in result["chunks"]:
    print(f"[{chunk['timestamp'][0]:.1f}s - {chunk['timestamp'][1]:.1f}s] {chunk['text']}")

wav2vec 2.0 (Meta)

Self-supervised speech representation model. Learns from unlabeled audio, then fine-tuned on labeled data. Excellent for low-resource languages.

Python
from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
result = asr("speech.wav")
print(result["text"])

Conformer

Combines CNNs and Transformers for state-of-the-art ASR. Used in Google's speech recognition systems.

Text-to-Speech (TTS)

Bark (Suno AI)

Open-source text-to-speech model that can generate highly realistic speech, music, and sound effects. Supports multiple languages and speaker voices.

Python
from transformers import AutoProcessor, BarkModel
import scipy

processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")

inputs = processor("Hello, this is a test of the Bark text to speech model.")
audio_array = model.generate(**inputs)
audio_array = audio_array.cpu().numpy().squeeze()

scipy.io.wavfile.write("output.wav", rate=model.generation_config.sample_rate, data=audio_array)

VITS

End-to-end TTS model that produces natural-sounding speech. Available in many languages on Hugging Face.

Coqui TTS

Open-source TTS toolkit with multiple model architectures. Supports voice cloning and multi-speaker synthesis.

Music Generation

MusicGen (Meta)

Generates music from text descriptions or melody conditioning. Available in small (300M), medium (1.5B), and large (3.3B) sizes.

Python
from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

inputs = processor(text=["upbeat electronic dance music with synths"], padding=True, return_tensors="pt")
audio_values = model.generate(**inputs, max_new_tokens=512)

AudioCraft (Meta)

A library that includes MusicGen, AudioGen (sound effects), and EnCodec (audio codec). A complete toolkit for audio generation.

Audio Classification

AST (Audio Spectrogram Transformer)

Applies Vision Transformer to audio spectrograms. Excellent for environmental sound classification and music tagging.

HuBERT

Hidden-Unit BERT for self-supervised speech representation. Useful for speaker verification, emotion recognition, and audio classification.

Python
from transformers import pipeline

# Audio classification
classifier = pipeline("audio-classification", model="MIT/ast-finetuned-audioset-10-10-0.4593")
result = classifier("audio.wav")
print(result)
# [{'label': 'Speech', 'score': 0.95}, ...]

Audio Models Comparison

ModelTaskSizeLanguages
Whisper Large-v3Speech-to-text1.5B99 languages
Whisper BaseSpeech-to-text74M99 languages
wav2vec 2.0Speech-to-text300MEnglish (+ fine-tuned variants)
BarkText-to-speech~1B13+ languages
MusicGen SmallMusic generation300MText-conditioned
ASTAudio classification87MN/A

Next Up

Explore multi-modal models that work across text, images, video, and documents.

Next: Multi-Modal Models →