Intermediate

LLM APIs

Compare LLM API providers, understand pricing, and learn best practices for streaming, function calling, and production usage.

API Providers Comparison

Provider	Top Models	Input $/1M tokens	Output $/1M tokens	Key Feature
OpenAI	GPT-4o, GPT-4 Turbo, o1	$2.50-$15	$10-$60	Largest ecosystem, function calling
Anthropic	Claude 4, 3.5 Sonnet/Haiku	$0.25-$15	$1.25-$75	200K context, strong safety, coding
Google	Gemini 2.0 Flash/Pro	$0.075-$7	$0.30-$21	1M+ context, multimodal native
Mistral	Mistral Large, Medium	$2-$8	$6-$24	European provider, open models
Cohere	Command R+	$2.50	$10	RAG-optimized, enterprise focus
Together AI	Open models (LLaMA, Mistral)	$0.20-$1.20	$0.20-$1.20	Cheapest open model hosting
Groq	LLaMA, Mistral, Gemma	$0.05-$0.64	$0.08-$0.80	Fastest inference (LPU hardware)
Fireworks	Open models + fine-tuned	$0.20-$0.90	$0.20-$0.90	Fast inference, custom models

Prices are approximate and change frequently. Check provider websites for current pricing.

OpenAI-Compatible APIs

Many providers implement the OpenAI API format, making it easy to switch between them:

Python — Using OpenAI-compatible APIs

from openai import OpenAI

# OpenAI
client = OpenAI(api_key="sk-...")

# Anthropic (via their SDK is better, but OpenAI-compat works too)
# Use the anthropic SDK instead for best experience

# Together AI
client = OpenAI(
    api_key="your-together-key",
    base_url="https://api.together.xyz/v1",
)

# Groq
client = OpenAI(
    api_key="your-groq-key",
    base_url="https://api.groq.com/openai/v1",
)

# Local (Ollama)
client = OpenAI(
    api_key="ollama",
    base_url="http://localhost:11434/v1",
)

# Same code works with any provider!
response = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Hello!"}],
)

Rate Limits

Every provider has rate limits. Plan your architecture accordingly:

Requests per minute (RPM): How many API calls you can make per minute.
Tokens per minute (TPM): Total tokens (input + output) per minute.
Tokens per day (TPD): Daily token budget.

✅

Handling rate limits: Implement exponential backoff with jitter. Use a token bucket or leaky bucket rate limiter. Batch requests where possible. Consider multiple API keys or providers for high-volume workloads.

Streaming

Python — Streaming responses

from openai import OpenAI

client = OpenAI()

# Stream the response token by token
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Function Calling (Tool Use)

Python — Function calling with OpenAI

import json
from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in London?"}],
    tools=tools,
    tool_choice="auto",
)

# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {args}")
    # {"location": "London", "unit": "celsius"}

Best Practices for API Usage

Cache responses: Cache identical or similar requests to reduce costs and latency.
Use the cheapest model that works: Start with smaller/cheaper models and only upgrade when needed.
Minimize tokens: Be concise in prompts. Remove unnecessary context.
Set max_tokens: Always set a reasonable max_tokens limit to prevent runaway costs.
Monitor usage: Track spending by feature, user, or team. Set budget alerts.
Implement fallbacks: If one provider is down or rate-limited, fall back to another.
Use batching: Many providers offer batch APIs at 50% discount for non-urgent requests.

← Previous Running Local LLMs Next → Best Practices