Token Counting
Learn to count tokens accurately for different content types, understand token-to-word ratios across languages, and count tokens programmatically in Python and JavaScript.
Token Counting for Different Content Types
Different types of content produce very different token counts for the same amount of text. Here is how common content types compare:
English Text
Standard English prose is the most efficiently tokenized content. Common words map to single tokens, and the ratio is roughly 1 token per 0.75 words (or about 1.33 tokens per word).
# Input (13 words): "The quick brown fox jumps over the lazy dog near the river bank." # Token count (o200k_base): ~14 tokens # Ratio: ~1.08 tokens per word
Code
Code is typically less efficient than English prose because of special characters, indentation, and variable names. Code generally uses 1.5–2.5 tokens per "word" (counting operators and symbols).
# Python code (~25 tokens with o200k_base): def factorial(n): if n <= 1: return 1 return n * factorial(n - 1)
JSON
JSON is one of the least token-efficient formats because of repeated structural characters like braces, colons, quotes, and commas. A JSON object uses significantly more tokens than the same data in plain text.
# This JSON (~35 tokens with o200k_base): { "name": "John Doe", "age": 30, "email": "john@example.com", "active": true } # Same data in plain text (~12 tokens): Name: John Doe, Age: 30, Email: john@example.com, Active: yes
Markdown
Markdown adds some token overhead for formatting syntax (headers, bold, lists, links), but is generally more efficient than HTML or JSON for structured content.
Token-to-Word Ratios
Understanding token-to-word ratios helps you quickly estimate token counts without running the tokenizer:
| Content Type | Tokens per Word (approx.) | Words per 1K Tokens |
|---|---|---|
| English prose | 1.0 – 1.3 | ~750 |
| Technical English | 1.2 – 1.5 | ~700 |
| Code (Python/JS) | 1.5 – 2.5 | ~500 |
| JSON | 2.0 – 3.0 | ~400 |
| Chinese / Japanese | 2.0 – 3.0 per character | ~350 |
| Korean | 2.0 – 4.0 per character | ~300 |
| Arabic / Hindi | 2.0 – 3.5 per word | ~350 |
Multi-Language Tokenization Differences
Tokenizers are heavily optimized for English. Non-English text often requires significantly more tokens to represent the same amount of information:
# "Hello, how are you?" in different languages (o200k_base) English: "Hello, how are you?" ~6 tokens Spanish: "Hola, como estas?" ~7 tokens French: "Bonjour, comment allez-vous?" ~9 tokens Chinese: "你好,你怎么样?" ~9 tokens Japanese: "こんにちは、お元気ですか?" ~12 tokens Korean: "안녕하세요, 어떻게 지내세요?" ~15 tokens Arabic: "مرحبا، كيف حالك؟" ~13 tokens
Special Tokens
AI models use special tokens that are not visible in your text but count toward the token total:
- <|im_start|> and <|im_end|>: Mark the beginning and end of messages in the chat format. Each message adds these overhead tokens.
- Role tokens: The role labels ("system", "user", "assistant") consume tokens.
- BOS/EOS tokens: Beginning-of-sequence and end-of-sequence markers.
System Prompt Token Overhead
System prompts are included in every API call and consume tokens from your context window. A typical system prompt can use 100–500 tokens:
# Short system prompt (~20 tokens): "You are a helpful coding assistant." # Medium system prompt (~150 tokens): "You are an expert Python developer. Always provide well-documented code with type hints. Follow PEP 8 guidelines. Include error handling and edge cases. Format responses with markdown code blocks." # Long system prompt (~500+ tokens): # Detailed instructions, examples, formatting rules, # persona descriptions, constraint lists, etc. # Remember: system prompt tokens are charged on EVERY request!
Counting Tokens Programmatically
While Tiktokenizer is great for quick checks, you often need to count tokens in your code. Here are the main libraries:
Python: tiktoken
# Install: pip install tiktoken import tiktoken # Get the tokenizer for GPT-4o enc = tiktoken.encoding_for_model("gpt-4o") # Count tokens text = "Hello, how are you doing today?" tokens = enc.encode(text) print(f"Token count: {len(tokens)}") # Token count: 8 print(f"Tokens: {tokens}") # [13225, 11, 1495, 553, ...] # Decode individual tokens to see what they represent for token in tokens: print(f" {token} -> '{enc.decode([token])}'") # Count tokens for a chat message (with overhead) def count_message_tokens(messages, model="gpt-4o"): enc = tiktoken.encoding_for_model(model) total = 0 for msg in messages: total += 4 # message overhead tokens for key, value in msg.items(): total += len(enc.encode(value)) total += 2 # reply priming tokens return total
JavaScript: js-tiktoken
// Install: npm install js-tiktoken import { encodingForModel } from "js-tiktoken"; // Get the tokenizer for GPT-4o const enc = encodingForModel("gpt-4o"); // Count tokens const text = "Hello, how are you doing today?"; const tokens = enc.encode(text); console.log(`Token count: ${tokens.length}`); // Decode to see individual tokens for (const token of tokens) { const decoded = new TextDecoder().decode(enc.decode([token])); console.log(` ${token} -> '${decoded}'`); } // Free the encoder when done enc.free();
Lilly Tech Systems