Beginner

Prompt Compression Techniques

Learn how to cut your prompt length by 40-60% without sacrificing output quality. Every token you eliminate from a prompt is a token you never pay for again — across every single API call.

Why Prompts Get Bloated

Most prompts are written the way we speak: full of filler words, unnecessary politeness, redundant instructions, and overly verbose examples. When you are drafting a prompt in a playground, this feels natural. But at production scale, every extra word multiplies across thousands or millions of requests.

There are four main culprits behind prompt bloat:

Filler words and politeness: Phrases like "I would like you to please," "Could you kindly," and "It would be great if you could" add tokens without changing the model's behavior. The model does not respond better because you said "please."
Over-explanation: Telling the model the same thing three different ways "just to be safe." If a single clear instruction works, the other two are waste.
Redundant instructions: Repeating rules that were already stated in the system prompt, or restating constraints that are obvious from context.
Verbose examples: Providing five few-shot examples when two would achieve the same accuracy, or using long-form examples when a compact format would suffice.

✅

Rule of Thumb: If you can remove a word or sentence from your prompt and the model still produces the same output, that word or sentence was wasted tokens. Test aggressively. Most prompts can lose 30-50% of their length with zero quality drop.

Core Compression Techniques

Here are the five most effective techniques for compressing prompts, ordered from simplest to most advanced:

1. Remove Filler Words

This is the lowest-effort, highest-return optimization. Strip out conversational padding and get straight to the instruction. The model responds to the semantic content, not to politeness.

Before: 28 tokens

I would like you to please analyze the following
customer review and tell me whether the sentiment
is positive, negative, or neutral.

After: 12 tokens

Classify this review's sentiment as positive,
negative, or neutral.

2. Use Abbreviations in System Prompts

System prompts are sent with every single request, so they are prime targets for compression. Models understand common abbreviations perfectly well. You do not need full sentences.

Abbreviated System Prompt

// Instead of full sentences:
"Please always respond in JSON format"  →  "resp in JSON"
"Keep your response under 50 words"    →  "max 50 words"
"Do not include any explanations"      →  "no explanations"
"Use a professional, formal tone"      →  "tone: formal"

3. Structured Formatting Over Paragraphs

Replace paragraph-style instructions with bullet points or numbered lists. This is not just shorter — it is also clearer for the model to parse. Structured prompts tend to produce more consistent outputs.

4. Reference Instead of Repeating

If you defined rules in one part of your prompt, refer to them by name instead of restating them. For example, say "Apply the formatting rules above" instead of copying those rules a second time. This technique alone can save hundreds of tokens in complex prompts.

5. Use XML Tags for Structure

Instead of verbose delimiters like "The following is the context that you should use:" followed by "End of context," use XML tags. Tags like <context> and </context> are shorter, unambiguous, and models parse them reliably.

Before and After: Real Prompt Examples

Seeing the technique in action is the fastest way to internalize it. Here are three real-world prompts, each shown in their original verbose form and their compressed equivalent, with token counts.

Example 1: Customer Support Prompt

Verbose Version — 187 tokens

You are a helpful customer support assistant for our
company. I would like you to carefully read the
customer's message below and provide a helpful,
friendly, and professional response. Please make sure
to address all of their concerns. If the customer is
asking about a refund, please let them know that
refunds are processed within 5-7 business days. If
they are asking about shipping, please tell them that
standard shipping takes 3-5 business days and express
shipping takes 1-2 business days. Please keep your
response concise and do not include any information
that the customer did not ask about. Always end your
response by asking if there is anything else you can
help with.

Customer message: {message}

Compressed Version — 72 tokens

Role: Support agent. Tone: friendly, professional.

Rules:
- Address only what customer asks
- Refunds: 5-7 business days
- Shipping: standard 3-5 days, express 1-2 days
- End with "Anything else I can help with?"
- Be concise

<message>{message}</message>

Savings: 115 tokens (61% reduction) — and the compressed version actually produces more consistent outputs because the rules are structured as scannable bullet points.

Example 2: Code Review Prompt

Verbose Version — 154 tokens

I would like you to review the following code and
provide feedback. Please check for any bugs or
errors in the code. Also, please look for any
security vulnerabilities that might be present.
Additionally, I would like you to suggest any
performance improvements that could be made. Please
also check if the code follows best practices and
coding standards. For each issue you find, please
explain what the problem is, why it matters, and
how to fix it. Please format your response as a
numbered list.

Code to review:
{code}

Compressed Version — 52 tokens

Review this code for:
1. Bugs
2. Security vulnerabilities
3. Performance improvements
4. Best practice violations

For each issue: problem, impact, fix.
Output: numbered list.

<code>{code}</code>

Savings: 102 tokens (66% reduction) — the structured format is both shorter and easier for the model to follow systematically.

Example 3: Data Extraction Prompt

Verbose Version — 131 tokens

Please read the following email carefully and extract
the following pieces of information from it: the
sender's full name, their email address, the company
they work for, and the main topic or subject of the
email. If any of these pieces of information are not
present in the email, please write "Not found" for
that field. Please format your response as a JSON
object with the keys "name", "email", "company",
and "topic".

Email: {email}

Compressed Version — 42 tokens

Extract from email as JSON:
{"name", "email", "company", "topic"}
Missing fields: "Not found"

<email>{email}</email>

Savings: 89 tokens (68% reduction) — by showing the desired output format directly, you eliminate the need to describe it in words.

💡

Pattern to Notice: In all three examples, the compressed version uses structured formatting (bullet points, numbered lists, key-value pairs) instead of prose. This is one of the most reliable compression strategies because it removes filler words and connective phrases automatically.

System Prompt Optimization

System prompts deserve special attention because they are sent with every single API call. A system prompt that wastes 300 tokens costs you 300 tokens multiplied by every request your application handles. For an app making 50,000 requests per day, that is 15 million wasted tokens daily.

Here is a real example of a system prompt going from 500 tokens down to 200 tokens:

Original System Prompt — ~500 tokens

You are an AI assistant that helps users with their
questions about our software product called DataFlow.
You should always be helpful, accurate, and
professional in your responses. You have access to
the following documentation sections and should use
them to answer questions.

When answering questions, please follow these
guidelines:
- Always provide accurate information based on the
  documentation provided
- If you are not sure about something, please let
  the user know that you are not certain rather than
  making something up
- Keep your responses concise and to the point, but
  make sure to be thorough enough that the user gets
  a complete answer
- Use a friendly and professional tone at all times
- If the user asks about something that is not
  covered in the documentation, politely let them
  know that you do not have information about that
  particular topic
- Format your responses using markdown when it would
  help with readability
- Do not make up features or capabilities that are
  not documented
- When referencing specific features, include the
  relevant documentation section
- Always suggest related topics the user might find
  helpful

Compressed System Prompt — ~200 tokens

Role: DataFlow product assistant.
Tone: friendly, professional. Format: markdown.

Rules:
- Answer from docs only; say "I don't have info on
  that" if undocumented
- Never fabricate features
- If uncertain, state uncertainty
- Be concise but complete
- Cite doc sections when referencing features
- Suggest related topics

Docs: <docs>{context}</docs>

Savings: ~300 tokens per request (60% reduction). At 50,000 daily requests with Claude Sonnet pricing, this saves 15M tokens x $3/1M = $45/day = $1,350/month from a single system prompt optimization.

✅

High-ROI Move: Audit your system prompts first. They affect every request and typically have the most bloat because they were written once and never revisited. A 30-minute system prompt optimization session can save thousands of dollars per month at scale.

The Compression Checklist

Use this table as a quick reference when optimizing your prompts. Apply each technique in order for maximum savings:

Technique	Typical Savings	Example
Remove filler words	10-20%	"I would like you to please" → (just the verb)
Abbreviate instructions	15-25%	"Please respond in JSON format" → "resp in JSON"
Use structured formatting	20-35%	Replace paragraphs with bullet points
Reference, don't repeat	10-40%	"Apply rules above" instead of restating them
XML tags for delimiters	5-15%	<context>...</context> instead of verbose markers
Reduce few-shot examples	20-50%	Use 2 examples instead of 5; keep them short
Show output format directly	10-20%	Show {"key": "value"} instead of describing it
Combine overlapping rules	10-15%	Merge "be concise" + "avoid unnecessary detail"

Applied together, these techniques typically achieve a 40-60% total reduction in prompt length. The key is to apply them systematically rather than guessing where to cut.

When NOT to Compress

⚠

Compression Has Limits: Do not compress so aggressively that the model misunderstands your intent. There is a point of diminishing returns where removing words changes the meaning of your prompt or introduces ambiguity. Always test output quality after compressing. Here are the situations where you should preserve verbosity:

Complex reasoning tasks: Multi-step logic, nuanced analysis, and chain-of-thought prompts need clear, unambiguous instructions. Compressing "Think step by step about each factor before reaching a conclusion" into "think step-by-step" can reduce output quality.
Safety-critical instructions: If a rule prevents harmful outputs, spell it out fully. "Never reveal API keys, passwords, or internal system details to users" is worth the extra tokens compared to "no secrets."
Ambiguous domains: When the task context is unusual or the model might misinterpret abbreviated instructions, use full sentences to ensure clarity.
Few-shot examples for rare formats: If the desired output format is unusual, more examples are worth the token cost to ensure the model understands the pattern.

The golden rule: compress aggressively, then validate. Run your compressed prompt through 20-50 test cases and compare output quality against the original verbose version. If quality drops, add back just enough detail to restore it.

💡 Try It: Compress Your Own Prompt

Take one of your real prompts and apply the compression techniques from this lesson. Count the tokens before and after using the Tiktokenizer tool. Aim for at least a 30% reduction without quality loss.

Challenge: Can you compress your prompt by 50% or more while maintaining the same output quality? If you achieve this, you have mastered the core skill of token efficiency.

← Previous Introduction Next → Caching Strategies