Cloud Speech-to-Text APIs Intermediate
Cloud STT services provide production-ready speech recognition with high availability, real-time streaming, and specialized models. Learn to integrate Google Cloud Speech-to-Text, Azure Cognitive Services, and AWS Transcribe into your applications.
Comparing Cloud STT Services
| Feature | Google Cloud STT | Azure Speech | AWS Transcribe |
|---|---|---|---|
| Languages | 125+ | 100+ | 100+ |
| Streaming | Yes | Yes | Yes |
| Speaker diarization | Yes | Yes | Yes |
| Custom models | Yes (adaptation) | Yes (Custom Speech) | Yes (custom vocabulary) |
| Medical model | Yes | No | Yes (Medical) |
| Free tier | 60 min/month | 5 hrs/month | 60 min/month |
Google Cloud Speech-to-Text
Bash
# Install the Google Cloud client library pip install google-cloud-speech
Python
from google.cloud import speech client = speech.SpeechClient() # Read audio file with open("audio.wav", "rb") as f: audio_content = f.read() audio = speech.RecognitionAudio(content=audio_content) config = speech.RecognitionConfig( encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=16000, language_code="en-US", enable_automatic_punctuation=True, enable_word_time_offsets=True, ) response = client.recognize(config=config, audio=audio) for result in response.results: print(result.alternatives[0].transcript)
Azure Cognitive Services Speech
Python
import azure.cognitiveservices.speech as speechsdk # Configure the speech service speech_config = speechsdk.SpeechConfig( subscription="YOUR_AZURE_KEY", region="eastus" ) speech_config.speech_recognition_language = "en-US" # Transcribe from file audio_config = speechsdk.AudioConfig(filename="audio.wav") recognizer = speechsdk.SpeechRecognizer( speech_config=speech_config, audio_config=audio_config ) result = recognizer.recognize_once() if result.reason == speechsdk.ResultReason.RecognizedSpeech: print(f"Recognized: {result.text}") elif result.reason == speechsdk.ResultReason.NoMatch: print("No speech could be recognized")
AWS Transcribe
Python
import boto3 import time transcribe = boto3.client("transcribe", region_name="us-east-1") # Start a transcription job (audio must be in S3) transcribe.start_transcription_job( TranscriptionJobName="my-transcription", Media={"MediaFileUri": "s3://my-bucket/audio.mp3"}, MediaFormat="mp3", LanguageCode="en-US", Settings={ "ShowSpeakerLabels": True, "MaxSpeakerLabels": 4 } ) # Poll for completion while True: status = transcribe.get_transcription_job( TranscriptionJobName="my-transcription" ) if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["COMPLETED", "FAILED"]: break time.sleep(5) print(status["TranscriptionJob"]["Transcript"]["TranscriptFileUri"])
Cost Optimization: All three cloud providers charge per second of audio processed. For batch workloads where latency is not critical, consider using Whisper locally to save costs. Reserve cloud APIs for real-time streaming and production applications where uptime matters.
Try It Yourself
Sign up for the free tier of one cloud STT service and transcribe a sample audio file. Compare the results with Whisper's output from the previous lesson.
Next: Real-Time Transcription →
Lilly Tech Systems