Prompt Patterns in BMAD
A catalog of proven prompt engineering patterns with examples, plus guidance on prompt versioning, A/B testing, prompt libraries, and anti-patterns to avoid.
Pattern Catalog
Chain of Thought (CoT)
Instructs the model to reason step-by-step before providing a final answer. This dramatically improves accuracy on complex reasoning tasks.
System: You are a code review assistant. When analyzing code, think through each issue step by step before providing your assessment. User: Review this function for bugs: ```python def calculate_discount(price, discount_pct): return price * discount_pct / 100 ``` Assistant: Let me analyze this step by step: Step 1: Check the function signature - Takes price and discount_pct as parameters - No type hints (minor issue) Step 2: Check the calculation logic - price * discount_pct / 100 calculates the discount AMOUNT, not the discounted PRICE - If price=100, discount_pct=20, returns 20 - But the user likely wants 80 (the final price) Step 3: Check edge cases - No validation for negative values - No validation for discount_pct > 100 Bug found: Function returns the discount amount instead of the discounted price. Should be: return price * (1 - discount_pct / 100)
Tree of Thought (ToT)
Explores multiple reasoning paths and evaluates them before selecting the best approach. Useful for complex decisions with trade-offs.
System: When solving architecture decisions, explore at least 3 different approaches. For each approach, list pros, cons, and a score from 1-10. Then recommend the best option with your reasoning. User: We need to implement real-time notifications. Our stack is Node.js with PostgreSQL. We have ~10K concurrent users. Assistant: Approach 1: WebSockets (Socket.io) Pros: True real-time, bidirectional, mature library Cons: Stateful connections, harder to scale Score: 7/10 Approach 2: Server-Sent Events (SSE) Pros: Simple, HTTP-based, auto-reconnect Cons: Unidirectional, limited browser connections Score: 8/10 Approach 3: Polling with Long-Poll Fallback Pros: Simplest, works everywhere, stateless Cons: Higher latency, more server load Score: 5/10 Recommendation: Approach 2 (SSE) For 10K users with mostly server-to-client notifications, SSE provides the best balance of simplicity and real-time capability.
ReAct (Reasoning + Acting)
Combines reasoning with action steps. The model thinks about what to do, takes an action, observes the result, and repeats. Essential for AI agent workflows.
System: You are a debugging assistant. Use the Thought/Action/Observation cycle to diagnose issues. Available actions: - READ_FILE(path) - Read a source file - SEARCH(query) - Search the codebase - RUN_TEST(name) - Run a specific test User: The login endpoint returns 500 errors. Assistant: Thought: I need to find the login endpoint handler and check for potential error sources. Action: SEARCH("login endpoint handler") Observation: Found in src/routes/auth.ts, line 42 Thought: Let me read the file to see the handler implementation and identify the error source. Action: READ_FILE("src/routes/auth.ts") Observation: The handler calls db.users.findOne() but doesn't await the promise. Thought: Missing await would cause the handler to proceed with an undefined user object, causing a TypeError when accessing user.password. Diagnosis: Missing await on line 45. The database query returns a Promise that is not being awaited.
Self-Consistency
Run the same prompt multiple times and select the most common answer. Reduces variability and improves reliability for critical outputs.
async def classify_with_consistency(text, n=5): """Run classification N times, return majority.""" results = [] for _ in range(n): result = await llm.classify( text, temperature=0.7 # Some randomness ) results.append(result) # Return most common classification counter = Counter(results) best, count = counter.most_common(1)[0] confidence = count / n return { "classification": best, "confidence": confidence, "all_results": dict(counter) }
Prompt Versioning
Treat prompts like code — version them, track changes, and test before deploying:
# prompts/code-review/v2.3.yaml name: code-review version: 2.3 model: claude-sonnet temperature: 0.3 changelog: - v2.3: Added security check instructions - v2.2: Improved false positive rate - v2.1: Added step-by-step reasoning system: | You are a senior code reviewer. Analyze code changes for bugs, security issues, and style problems. Use step-by-step reasoning. eval_dataset: datasets/code-review-v2.json baseline_accuracy: 0.91
A/B Testing Prompts
Compare prompt versions in production to find the best performer:
class PromptABTest: def __init__(self, variants, split=0.5): self.variants = variants self.split = split self.metrics = defaultdict(list) async def run(self, input_data, user_id): # Deterministic assignment based on user variant = "A" if hash(user_id) % 100 < \ self.split * 100 else "B" prompt = self.variants[variant] result = await llm.call(prompt, input_data) # Track metrics for analysis self.metrics[variant].append({ "latency": result.latency, "tokens": result.token_count, "user_rating": None # Filled later }) return result
Prompt Libraries
Organize and share proven prompts across your team:
| Library Category | Example Prompts |
|---|---|
| Code Analysis | Code review, bug detection, refactoring suggestions, documentation generation |
| Data Processing | Classification, extraction, summarization, translation |
| Content Generation | Technical writing, email drafting, report generation |
| Quality Assurance | Test case generation, output validation, bias detection |
Anti-Patterns to Avoid
- Mega-prompt: Cramming every instruction into one massive prompt. Break complex tasks into a chain of focused prompts instead.
- Hope-driven development: Testing a prompt once, seeing it work, and shipping it. Always test against a diverse evaluation dataset.
- Prompt hardcoding: Embedding prompts directly in source code. Use external prompt files that can be versioned and updated independently.
- Ignoring token costs: Not tracking or optimizing token usage. A prompt that costs $0.01 per call adds up to $10,000 at 1M requests.
- No fallback plan: Assuming the AI model will always be available and produce good results. Always build fallback mechanisms.
- Temperature neglect: Using default temperature for all tasks. Use low temperature (0.0-0.3) for deterministic tasks, higher (0.7-1.0) for creative ones.
Lilly Tech Systems