Intermediate

Token Counting

Learn to count tokens accurately for different content types, understand token-to-word ratios across languages, and count tokens programmatically in Python and JavaScript.

Token Counting for Different Content Types

Different types of content produce very different token counts for the same amount of text. Here is how common content types compare:

English Text

Standard English prose is the most efficiently tokenized content. Common words map to single tokens, and the ratio is roughly 1 token per 0.75 words (or about 1.33 tokens per word).

English Text Example
# Input (13 words):
"The quick brown fox jumps over the lazy dog near the river bank."

# Token count (o200k_base): ~14 tokens
# Ratio: ~1.08 tokens per word

Code

Code is typically less efficient than English prose because of special characters, indentation, and variable names. Code generally uses 1.5–2.5 tokens per "word" (counting operators and symbols).

Code Example
# Python code (~25 tokens with o200k_base):
def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)

JSON

JSON is one of the least token-efficient formats because of repeated structural characters like braces, colons, quotes, and commas. A JSON object uses significantly more tokens than the same data in plain text.

JSON Example
# This JSON (~35 tokens with o200k_base):
{
  "name": "John Doe",
  "age": 30,
  "email": "john@example.com",
  "active": true
}

# Same data in plain text (~12 tokens):
Name: John Doe, Age: 30, Email: john@example.com, Active: yes

Markdown

Markdown adds some token overhead for formatting syntax (headers, bold, lists, links), but is generally more efficient than HTML or JSON for structured content.

Token-to-Word Ratios

Understanding token-to-word ratios helps you quickly estimate token counts without running the tokenizer:

Content Type Tokens per Word (approx.) Words per 1K Tokens
English prose 1.0 – 1.3 ~750
Technical English 1.2 – 1.5 ~700
Code (Python/JS) 1.5 – 2.5 ~500
JSON 2.0 – 3.0 ~400
Chinese / Japanese 2.0 – 3.0 per character ~350
Korean 2.0 – 4.0 per character ~300
Arabic / Hindi 2.0 – 3.5 per word ~350

Multi-Language Tokenization Differences

Tokenizers are heavily optimized for English. Non-English text often requires significantly more tokens to represent the same amount of information:

Multi-Language Token Comparison
# "Hello, how are you?" in different languages (o200k_base)

English:  "Hello, how are you?"     ~6 tokens
Spanish:  "Hola, como estas?"       ~7 tokens
French:   "Bonjour, comment allez-vous?"  ~9 tokens
Chinese:  "你好,你怎么样?"           ~9 tokens
Japanese: "こんにちは、お元気ですか?"   ~12 tokens
Korean:   "안녕하세요, 어떻게 지내세요?"  ~15 tokens
Arabic:   "مرحبا، كيف حالك؟"         ~13 tokens
💡
Cost implication: If your application primarily handles non-English text, your API costs may be 2–3x higher than for the same amount of English content. Factor this into your budget planning.

Special Tokens

AI models use special tokens that are not visible in your text but count toward the token total:

  • <|im_start|> and <|im_end|>: Mark the beginning and end of messages in the chat format. Each message adds these overhead tokens.
  • Role tokens: The role labels ("system", "user", "assistant") consume tokens.
  • BOS/EOS tokens: Beginning-of-sequence and end-of-sequence markers.

System Prompt Token Overhead

System prompts are included in every API call and consume tokens from your context window. A typical system prompt can use 100–500 tokens:

System Prompt Token Cost
# Short system prompt (~20 tokens):
"You are a helpful coding assistant."

# Medium system prompt (~150 tokens):
"You are an expert Python developer. Always provide
well-documented code with type hints. Follow PEP 8
guidelines. Include error handling and edge cases.
Format responses with markdown code blocks."

# Long system prompt (~500+ tokens):
# Detailed instructions, examples, formatting rules,
# persona descriptions, constraint lists, etc.

# Remember: system prompt tokens are charged on EVERY request!

Counting Tokens Programmatically

While Tiktokenizer is great for quick checks, you often need to count tokens in your code. Here are the main libraries:

Python: tiktoken

Python (tiktoken)
# Install: pip install tiktoken
import tiktoken

# Get the tokenizer for GPT-4o
enc = tiktoken.encoding_for_model("gpt-4o")

# Count tokens
text = "Hello, how are you doing today?"
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")   # Token count: 8
print(f"Tokens: {tokens}")              # [13225, 11, 1495, 553, ...]

# Decode individual tokens to see what they represent
for token in tokens:
    print(f"  {token} -> '{enc.decode([token])}'")

# Count tokens for a chat message (with overhead)
def count_message_tokens(messages, model="gpt-4o"):
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4  # message overhead tokens
        for key, value in msg.items():
            total += len(enc.encode(value))
    total += 2  # reply priming tokens
    return total

JavaScript: js-tiktoken

JavaScript (js-tiktoken)
// Install: npm install js-tiktoken
import { encodingForModel } from "js-tiktoken";

// Get the tokenizer for GPT-4o
const enc = encodingForModel("gpt-4o");

// Count tokens
const text = "Hello, how are you doing today?";
const tokens = enc.encode(text);
console.log(`Token count: ${tokens.length}`);

// Decode to see individual tokens
for (const token of tokens) {
  const decoded = new TextDecoder().decode(enc.decode([token]));
  console.log(`  ${token} -> '${decoded}'`);
}

// Free the encoder when done
enc.free();
Best practice: Always count tokens using the same tokenizer that your target model uses. Use Tiktokenizer for quick visual checks and the programmatic libraries (tiktoken/js-tiktoken) for accurate counts in your applications.