Pretrained Audio Models Intermediate
Explore pretrained models for speech recognition, text-to-speech synthesis, music generation, and audio classification — with practical Python code for each task.
Speech-to-Text (ASR)
Whisper (OpenAI)
The most popular speech recognition model. Trained on 680,000 hours of multilingual audio. Supports 99 languages, translation, and timestamps. Available in tiny, base, small, medium, and large variants.
from transformers import pipeline # Load Whisper for speech recognition transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base") # Transcribe audio file result = transcriber("audio.mp3") print(result["text"]) # With timestamps result = transcriber("audio.mp3", return_timestamps=True) for chunk in result["chunks"]: print(f"[{chunk['timestamp'][0]:.1f}s - {chunk['timestamp'][1]:.1f}s] {chunk['text']}")
wav2vec 2.0 (Meta)
Self-supervised speech representation model. Learns from unlabeled audio, then fine-tuned on labeled data. Excellent for low-resource languages.
from transformers import pipeline asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") result = asr("speech.wav") print(result["text"])
Conformer
Combines CNNs and Transformers for state-of-the-art ASR. Used in Google's speech recognition systems.
Text-to-Speech (TTS)
Bark (Suno AI)
Open-source text-to-speech model that can generate highly realistic speech, music, and sound effects. Supports multiple languages and speaker voices.
from transformers import AutoProcessor, BarkModel import scipy processor = AutoProcessor.from_pretrained("suno/bark") model = BarkModel.from_pretrained("suno/bark") inputs = processor("Hello, this is a test of the Bark text to speech model.") audio_array = model.generate(**inputs) audio_array = audio_array.cpu().numpy().squeeze() scipy.io.wavfile.write("output.wav", rate=model.generation_config.sample_rate, data=audio_array)
VITS
End-to-end TTS model that produces natural-sounding speech. Available in many languages on Hugging Face.
Coqui TTS
Open-source TTS toolkit with multiple model architectures. Supports voice cloning and multi-speaker synthesis.
Music Generation
MusicGen (Meta)
Generates music from text descriptions or melody conditioning. Available in small (300M), medium (1.5B), and large (3.3B) sizes.
from transformers import AutoProcessor, MusicgenForConditionalGeneration processor = AutoProcessor.from_pretrained("facebook/musicgen-small") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") inputs = processor(text=["upbeat electronic dance music with synths"], padding=True, return_tensors="pt") audio_values = model.generate(**inputs, max_new_tokens=512)
AudioCraft (Meta)
A library that includes MusicGen, AudioGen (sound effects), and EnCodec (audio codec). A complete toolkit for audio generation.
Audio Classification
AST (Audio Spectrogram Transformer)
Applies Vision Transformer to audio spectrograms. Excellent for environmental sound classification and music tagging.
HuBERT
Hidden-Unit BERT for self-supervised speech representation. Useful for speaker verification, emotion recognition, and audio classification.
from transformers import pipeline # Audio classification classifier = pipeline("audio-classification", model="MIT/ast-finetuned-audioset-10-10-0.4593") result = classifier("audio.wav") print(result) # [{'label': 'Speech', 'score': 0.95}, ...]
Audio Models Comparison
| Model | Task | Size | Languages |
|---|---|---|---|
| Whisper Large-v3 | Speech-to-text | 1.5B | 99 languages |
| Whisper Base | Speech-to-text | 74M | 99 languages |
| wav2vec 2.0 | Speech-to-text | 300M | English (+ fine-tuned variants) |
| Bark | Text-to-speech | ~1B | 13+ languages |
| MusicGen Small | Music generation | 300M | Text-conditioned |
| AST | Audio classification | 87M | N/A |
Next Up
Explore multi-modal models that work across text, images, video, and documents.
Next: Multi-Modal Models →
Lilly Tech Systems