Cloud Speech-to-Text APIs Intermediate

Cloud STT services provide production-ready speech recognition with high availability, real-time streaming, and specialized models. Learn to integrate Google Cloud Speech-to-Text, Azure Cognitive Services, and AWS Transcribe into your applications.

Comparing Cloud STT Services

Feature	Google Cloud STT	Azure Speech	AWS Transcribe
Languages	125+	100+	100+
Streaming	Yes	Yes	Yes
Speaker diarization	Yes	Yes	Yes
Custom models	Yes (adaptation)	Yes (Custom Speech)	Yes (custom vocabulary)
Medical model	Yes	No	Yes (Medical)
Free tier	60 min/month	5 hrs/month	60 min/month

Google Cloud Speech-to-Text

Bash

# Install the Google Cloud client library
pip install google-cloud-speech

Python

from google.cloud import speech

client = speech.SpeechClient()

# Read audio file
with open("audio.wav", "rb") as f:
    audio_content = f.read()

audio = speech.RecognitionAudio(content=audio_content)
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    enable_automatic_punctuation=True,
    enable_word_time_offsets=True,
)

response = client.recognize(config=config, audio=audio)

for result in response.results:
    print(result.alternatives[0].transcript)

Azure Cognitive Services Speech

Python

import azure.cognitiveservices.speech as speechsdk

# Configure the speech service
speech_config = speechsdk.SpeechConfig(
    subscription="YOUR_AZURE_KEY",
    region="eastus"
)
speech_config.speech_recognition_language = "en-US"

# Transcribe from file
audio_config = speechsdk.AudioConfig(filename="audio.wav")
recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(f"Recognized: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized")

AWS Transcribe

Python

import boto3
import time

transcribe = boto3.client("transcribe", region_name="us-east-1")

# Start a transcription job (audio must be in S3)
transcribe.start_transcription_job(
    TranscriptionJobName="my-transcription",
    Media={"MediaFileUri": "s3://my-bucket/audio.mp3"},
    MediaFormat="mp3",
    LanguageCode="en-US",
    Settings={
        "ShowSpeakerLabels": True,
        "MaxSpeakerLabels": 4
    }
)

# Poll for completion
while True:
    status = transcribe.get_transcription_job(
        TranscriptionJobName="my-transcription"
    )
    if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["COMPLETED", "FAILED"]:
        break
    time.sleep(5)

print(status["TranscriptionJob"]["Transcript"]["TranscriptFileUri"])

Cost Optimization: All three cloud providers charge per second of audio processed. For batch workloads where latency is not critical, consider using Whisper locally to save costs. Reserve cloud APIs for real-time streaming and production applications where uptime matters.

Try It Yourself

Sign up for the free tier of one cloud STT service and transcribe a sample audio file. Compare the results with Whisper's output from the previous lesson.

Next: Real-Time Transcription →

← OpenAI Whisper Real-Time Transcription →