Cloud Speech-to-Text APIs Intermediate

Cloud STT services provide production-ready speech recognition with high availability, real-time streaming, and specialized models. Learn to integrate Google Cloud Speech-to-Text, Azure Cognitive Services, and AWS Transcribe into your applications.

Comparing Cloud STT Services

Feature Google Cloud STT Azure Speech AWS Transcribe
Languages 125+ 100+ 100+
Streaming Yes Yes Yes
Speaker diarization Yes Yes Yes
Custom models Yes (adaptation) Yes (Custom Speech) Yes (custom vocabulary)
Medical model Yes No Yes (Medical)
Free tier 60 min/month 5 hrs/month 60 min/month

Google Cloud Speech-to-Text

Bash
# Install the Google Cloud client library
pip install google-cloud-speech
Python
from google.cloud import speech

client = speech.SpeechClient()

# Read audio file
with open("audio.wav", "rb") as f:
    audio_content = f.read()

audio = speech.RecognitionAudio(content=audio_content)
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    enable_automatic_punctuation=True,
    enable_word_time_offsets=True,
)

response = client.recognize(config=config, audio=audio)

for result in response.results:
    print(result.alternatives[0].transcript)

Azure Cognitive Services Speech

Python
import azure.cognitiveservices.speech as speechsdk

# Configure the speech service
speech_config = speechsdk.SpeechConfig(
    subscription="YOUR_AZURE_KEY",
    region="eastus"
)
speech_config.speech_recognition_language = "en-US"

# Transcribe from file
audio_config = speechsdk.AudioConfig(filename="audio.wav")
recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(f"Recognized: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized")

AWS Transcribe

Python
import boto3
import time

transcribe = boto3.client("transcribe", region_name="us-east-1")

# Start a transcription job (audio must be in S3)
transcribe.start_transcription_job(
    TranscriptionJobName="my-transcription",
    Media={"MediaFileUri": "s3://my-bucket/audio.mp3"},
    MediaFormat="mp3",
    LanguageCode="en-US",
    Settings={
        "ShowSpeakerLabels": True,
        "MaxSpeakerLabels": 4
    }
)

# Poll for completion
while True:
    status = transcribe.get_transcription_job(
        TranscriptionJobName="my-transcription"
    )
    if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["COMPLETED", "FAILED"]:
        break
    time.sleep(5)

print(status["TranscriptionJob"]["Transcript"]["TranscriptFileUri"])
Cost Optimization: All three cloud providers charge per second of audio processed. For batch workloads where latency is not critical, consider using Whisper locally to save costs. Reserve cloud APIs for real-time streaming and production applications where uptime matters.

Try It Yourself

Sign up for the free tier of one cloud STT service and transcribe a sample audio file. Compare the results with Whisper's output from the previous lesson.

Next: Real-Time Transcription →