Intermediate
LLM APIs
Compare LLM API providers, understand pricing, and learn best practices for streaming, function calling, and production usage.
API Providers Comparison
| Provider | Top Models | Input $/1M tokens | Output $/1M tokens | Key Feature |
|---|---|---|---|---|
| OpenAI | GPT-4o, GPT-4 Turbo, o1 | $2.50-$15 | $10-$60 | Largest ecosystem, function calling |
| Anthropic | Claude 4, 3.5 Sonnet/Haiku | $0.25-$15 | $1.25-$75 | 200K context, strong safety, coding |
| Gemini 2.0 Flash/Pro | $0.075-$7 | $0.30-$21 | 1M+ context, multimodal native | |
| Mistral | Mistral Large, Medium | $2-$8 | $6-$24 | European provider, open models |
| Cohere | Command R+ | $2.50 | $10 | RAG-optimized, enterprise focus |
| Together AI | Open models (LLaMA, Mistral) | $0.20-$1.20 | $0.20-$1.20 | Cheapest open model hosting |
| Groq | LLaMA, Mistral, Gemma | $0.05-$0.64 | $0.08-$0.80 | Fastest inference (LPU hardware) |
| Fireworks | Open models + fine-tuned | $0.20-$0.90 | $0.20-$0.90 | Fast inference, custom models |
Prices are approximate and change frequently. Check provider websites for current pricing.
OpenAI-Compatible APIs
Many providers implement the OpenAI API format, making it easy to switch between them:
Python — Using OpenAI-compatible APIs
from openai import OpenAI
# OpenAI
client = OpenAI(api_key="sk-...")
# Anthropic (via their SDK is better, but OpenAI-compat works too)
# Use the anthropic SDK instead for best experience
# Together AI
client = OpenAI(
api_key="your-together-key",
base_url="https://api.together.xyz/v1",
)
# Groq
client = OpenAI(
api_key="your-groq-key",
base_url="https://api.groq.com/openai/v1",
)
# Local (Ollama)
client = OpenAI(
api_key="ollama",
base_url="http://localhost:11434/v1",
)
# Same code works with any provider!
response = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": "Hello!"}],
)
Rate Limits
Every provider has rate limits. Plan your architecture accordingly:
- Requests per minute (RPM): How many API calls you can make per minute.
- Tokens per minute (TPM): Total tokens (input + output) per minute.
- Tokens per day (TPD): Daily token budget.
Handling rate limits: Implement exponential backoff with jitter. Use a token bucket or leaky bucket rate limiter. Batch requests where possible. Consider multiple API keys or providers for high-volume workloads.
Streaming
Python — Streaming responses
from openai import OpenAI
client = OpenAI()
# Stream the response token by token
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about coding"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Function Calling (Tool Use)
Python — Function calling with OpenAI
import json
from openai import OpenAI
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the weather in London?"}],
tools=tools,
tool_choice="auto",
)
# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {args}")
# {"location": "London", "unit": "celsius"}
Best Practices for API Usage
- Cache responses: Cache identical or similar requests to reduce costs and latency.
- Use the cheapest model that works: Start with smaller/cheaper models and only upgrade when needed.
- Minimize tokens: Be concise in prompts. Remove unnecessary context.
- Set max_tokens: Always set a reasonable max_tokens limit to prevent runaway costs.
- Monitor usage: Track spending by feature, user, or team. Set budget alerts.
- Implement fallbacks: If one provider is down or rate-limited, fall back to another.
- Use batching: Many providers offer batch APIs at 50% discount for non-urgent requests.
Lilly Tech Systems