Pretrained Audio Models Intermediate

Explore pretrained models for speech recognition, text-to-speech synthesis, music generation, and audio classification — with practical Python code for each task.

Speech-to-Text (ASR)

Whisper (OpenAI)

The most popular speech recognition model. Trained on 680,000 hours of multilingual audio. Supports 99 languages, translation, and timestamps. Available in tiny, base, small, medium, and large variants.

Python

from transformers import pipeline

# Load Whisper for speech recognition
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base")

# Transcribe audio file
result = transcriber("audio.mp3")
print(result["text"])

# With timestamps
result = transcriber("audio.mp3", return_timestamps=True)
for chunk in result["chunks"]:
    print(f"[{chunk['timestamp'][0]:.1f}s - {chunk['timestamp'][1]:.1f}s] {chunk['text']}")

wav2vec 2.0 (Meta)

Self-supervised speech representation model. Learns from unlabeled audio, then fine-tuned on labeled data. Excellent for low-resource languages.

Python

from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
result = asr("speech.wav")
print(result["text"])

Conformer

Combines CNNs and Transformers for state-of-the-art ASR. Used in Google's speech recognition systems.

Text-to-Speech (TTS)

Bark (Suno AI)

Open-source text-to-speech model that can generate highly realistic speech, music, and sound effects. Supports multiple languages and speaker voices.

Python

from transformers import AutoProcessor, BarkModel
import scipy

processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")

inputs = processor("Hello, this is a test of the Bark text to speech model.")
audio_array = model.generate(**inputs)
audio_array = audio_array.cpu().numpy().squeeze()

scipy.io.wavfile.write("output.wav", rate=model.generation_config.sample_rate, data=audio_array)

VITS

End-to-end TTS model that produces natural-sounding speech. Available in many languages on Hugging Face.

Coqui TTS

Open-source TTS toolkit with multiple model architectures. Supports voice cloning and multi-speaker synthesis.

Music Generation

MusicGen (Meta)

Generates music from text descriptions or melody conditioning. Available in small (300M), medium (1.5B), and large (3.3B) sizes.

Python

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

inputs = processor(text=["upbeat electronic dance music with synths"], padding=True, return_tensors="pt")
audio_values = model.generate(**inputs, max_new_tokens=512)

AudioCraft (Meta)

A library that includes MusicGen, AudioGen (sound effects), and EnCodec (audio codec). A complete toolkit for audio generation.

Audio Classification

AST (Audio Spectrogram Transformer)

Applies Vision Transformer to audio spectrograms. Excellent for environmental sound classification and music tagging.

HuBERT

Hidden-Unit BERT for self-supervised speech representation. Useful for speaker verification, emotion recognition, and audio classification.

Python

from transformers import pipeline

# Audio classification
classifier = pipeline("audio-classification", model="MIT/ast-finetuned-audioset-10-10-0.4593")
result = classifier("audio.wav")
print(result)
# [{'label': 'Speech', 'score': 0.95}, ...]

Audio Models Comparison

Model	Task	Size	Languages
Whisper Large-v3	Speech-to-text	1.5B	99 languages
Whisper Base	Speech-to-text	74M	99 languages
wav2vec 2.0	Speech-to-text	300M	English (+ fine-tuned variants)
Bark	Text-to-speech	~1B	13+ languages
MusicGen Small	Music generation	300M	Text-conditioned
AST	Audio classification	87M	N/A

Next Up

Explore multi-modal models that work across text, images, video, and documents.

Next: Multi-Modal Models →

← Language Models Multi-Modal Models →