Intermediate

LLM APIs

Compare LLM API providers, understand pricing, and learn best practices for streaming, function calling, and production usage.

API Providers Comparison

ProviderTop ModelsInput $/1M tokensOutput $/1M tokensKey Feature
OpenAIGPT-4o, GPT-4 Turbo, o1$2.50-$15$10-$60Largest ecosystem, function calling
AnthropicClaude 4, 3.5 Sonnet/Haiku$0.25-$15$1.25-$75200K context, strong safety, coding
GoogleGemini 2.0 Flash/Pro$0.075-$7$0.30-$211M+ context, multimodal native
MistralMistral Large, Medium$2-$8$6-$24European provider, open models
CohereCommand R+$2.50$10RAG-optimized, enterprise focus
Together AIOpen models (LLaMA, Mistral)$0.20-$1.20$0.20-$1.20Cheapest open model hosting
GroqLLaMA, Mistral, Gemma$0.05-$0.64$0.08-$0.80Fastest inference (LPU hardware)
FireworksOpen models + fine-tuned$0.20-$0.90$0.20-$0.90Fast inference, custom models

Prices are approximate and change frequently. Check provider websites for current pricing.

OpenAI-Compatible APIs

Many providers implement the OpenAI API format, making it easy to switch between them:

Python — Using OpenAI-compatible APIs
from openai import OpenAI

# OpenAI
client = OpenAI(api_key="sk-...")

# Anthropic (via their SDK is better, but OpenAI-compat works too)
# Use the anthropic SDK instead for best experience

# Together AI
client = OpenAI(
    api_key="your-together-key",
    base_url="https://api.together.xyz/v1",
)

# Groq
client = OpenAI(
    api_key="your-groq-key",
    base_url="https://api.groq.com/openai/v1",
)

# Local (Ollama)
client = OpenAI(
    api_key="ollama",
    base_url="http://localhost:11434/v1",
)

# Same code works with any provider!
response = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Hello!"}],
)

Rate Limits

Every provider has rate limits. Plan your architecture accordingly:

  • Requests per minute (RPM): How many API calls you can make per minute.
  • Tokens per minute (TPM): Total tokens (input + output) per minute.
  • Tokens per day (TPD): Daily token budget.
Handling rate limits: Implement exponential backoff with jitter. Use a token bucket or leaky bucket rate limiter. Batch requests where possible. Consider multiple API keys or providers for high-volume workloads.

Streaming

Python — Streaming responses
from openai import OpenAI

client = OpenAI()

# Stream the response token by token
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Function Calling (Tool Use)

Python — Function calling with OpenAI
import json
from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in London?"}],
    tools=tools,
    tool_choice="auto",
)

# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {args}")
    # {"location": "London", "unit": "celsius"}

Best Practices for API Usage

  • Cache responses: Cache identical or similar requests to reduce costs and latency.
  • Use the cheapest model that works: Start with smaller/cheaper models and only upgrade when needed.
  • Minimize tokens: Be concise in prompts. Remove unnecessary context.
  • Set max_tokens: Always set a reasonable max_tokens limit to prevent runaway costs.
  • Monitor usage: Track spending by feature, user, or team. Set budget alerts.
  • Implement fallbacks: If one provider is down or rate-limited, fall back to another.
  • Use batching: Many providers offer batch APIs at 50% discount for non-urgent requests.