Best Practices Intermediate

Getting accurate transcription in production requires more than just calling an API. This lesson covers audio preprocessing, custom vocabularies, post-processing pipelines, error handling, and deployment strategies that maximize accuracy and reliability.

Audio Preprocessing

The quality of your audio input is the single biggest factor affecting transcription accuracy. Apply these preprocessing steps before sending audio to any ASR system:

Python
from pydub import AudioSegment
import noisereduce as nr
import numpy as np

# Load and normalize audio
audio = AudioSegment.from_file("raw_audio.mp3")

# Convert to mono, 16kHz (optimal for most ASR)
audio = audio.set_channels(1).set_frame_rate(16000)

# Normalize volume
audio = audio.normalize()

# Apply noise reduction
samples = np.array(audio.get_array_of_samples(), dtype=np.float32)
reduced = nr.reduce_noise(y=samples, sr=16000)

# Export clean audio
clean_audio = AudioSegment(
    reduced.astype(np.int16).tobytes(),
    frame_rate=16000, sample_width=2, channels=1
)
clean_audio.export("clean_audio.wav", format="wav")

Audio Quality Checklist

Factor Recommendation Impact on WER
Sample rate 16kHz for speech (8kHz for telephony) High
Channels Mono (convert stereo to mono) Medium
Noise Apply noise reduction for noisy environments High
Volume Normalize to consistent levels Medium
Format WAV (lossless) or FLAC over MP3 Low-Medium

Post-Processing the Transcript

Raw ASR output often needs cleanup before it is usable:

Python
import re

def post_process_transcript(text):
    # Fix common ASR errors with domain-specific terms
    replacements = {
        "eye school": "AI School",
        "chat gee pee tee": "ChatGPT",
        "pie torch": "PyTorch",
    }
    for wrong, right in replacements.items():
        text = re.sub(wrong, right, text, flags=re.IGNORECASE)

    # Remove filler words
    fillers = [r"\bum\b", r"\buh\b", r"\byou know\b", r"\blike\b"]
    for filler in fillers:
        text = re.sub(filler, "", text, flags=re.IGNORECASE)

    # Clean up extra spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

Measuring Accuracy

Python
from jiwer import wer, cer

reference = "The quick brown fox jumps over the lazy dog"
hypothesis = "The quick brown box jumps over the lazy dog"

word_error_rate = wer(reference, hypothesis)
char_error_rate = cer(reference, hypothesis)

print(f"WER: {word_error_rate:.1%}")   # 11.1%
print(f"CER: {char_error_rate:.1%}")   # 2.6%

Production Deployment Tips

Key Production Guidelines:
  • Chunk long audio — Split files longer than 10 minutes into chunks to avoid timeouts and memory issues
  • Implement retries — Cloud APIs can have transient failures; use exponential backoff
  • Cache results — Store transcripts to avoid re-processing the same audio
  • Monitor WER — Track word error rate over time to catch quality regressions
  • Use async processing — For batch jobs, use task queues (Celery, SQS) to handle audio in parallel
  • Consider privacy — For sensitive audio, use local Whisper instead of cloud APIs

Course Complete!

You now have the skills to build production-grade speech-to-text applications. Return to the course overview to review any lessons or explore other AI School courses.

← Course Overview