Advanced

Production LLM Questions

These 12 questions test whether you can ship LLM systems that work at scale. This is the lesson that separates candidates with production experience from those who have only built demos. Expect 3–5 of these in any senior GenAI role interview.

Q1: How do you optimize LLM inference cost in production? Walk through your approach.

💡

Model Answer:

Cost = requests × (input_tokens + output_tokens) × price_per_token. Optimize each factor:

Model routing: Route 70–80% of simple queries to a small model (GPT-4o-mini at $0.15/M tokens) and only complex queries to a large model (GPT-4o at $5/M tokens). Use a classifier or heuristic (query length, topic) for routing. This alone can cut costs 60–80%.
Prompt compression: Remove redundant system prompt text. Use abbreviations. Compress few-shot examples. A 2000-token system prompt compressed to 800 tokens saves 60% on input cost per request.
Caching: Exact match cache for identical queries. Semantic cache for similar queries (embed queries, cache if similarity > 0.95). Can achieve 20–40% cache hit rate for support/FAQ workloads.
Prompt caching: OpenAI and Anthropic offer prompt caching — repeated prefixes (system prompt + common context) are cached and charged at 50–90% discount.
Output length control: Set max_tokens appropriately. Use "Answer in one sentence" for simple queries. Generating 500 unnecessary tokens per request adds up at scale.
Batching: For non-real-time workloads, batch requests for higher throughput and lower per-request cost. OpenAI's batch API offers 50% discount.
Self-hosted models: For very high volume (>1M requests/day), self-hosting open-source models (LLaMA, Mistral) on GPU instances can be 5–10x cheaper than API pricing.

Q2: How do you reduce LLM latency for real-time applications?

💡

Model Answer:

Latency breakdown: Network round-trip (~50ms) + Time to First Token (TTFT, 200–1000ms) + Token generation (20–100ms per token) + Post-processing (~10ms).

Optimization strategies:

Streaming: Return tokens as they are generated. User sees response starting in 200–500ms instead of waiting 2–5s for the complete response. Perceived latency drops dramatically.
Smaller models: GPT-4o-mini is 3–5x faster than GPT-4o. Often good enough for the task.
Speculative decoding: Use a small draft model to propose tokens, verify in parallel with the target model. 2–3x throughput improvement.
Shorter prompts: Fewer input tokens = faster TTFT. Compress system prompts and reduce context length.
Parallel processing: If the task involves multiple independent LLM calls (e.g., summarize 5 documents), run them in parallel.
Edge deployment: Deploy smaller models closer to users (regional GPU instances). Reduces network latency.
KV cache reuse: For multi-turn conversations, reuse the KV cache from previous turns instead of reprocessing the full context.
Quantization: Run self-hosted models in INT4/INT8 for 2–4x faster inference with minimal quality loss.

Benchmarks to target: TTFT < 500ms, total latency < 2s for short responses, streaming enabled for long responses.

Q3: What LLM guardrails would you implement? Describe a layered approach.

💡

Model Answer:

Input guardrails (before the LLM):

PII detection: Scan input for SSN, credit card numbers, email addresses. Mask or reject. Use regex + NER model.
Prompt injection detection: Classifier trained on known injection patterns. Flag and block suspicious inputs.
Topic filtering: Reject queries outside the application's scope. "This is a cooking assistant. I cannot help with medical advice."
Rate limiting: Per-user and per-IP rate limits to prevent abuse.

Output guardrails (after the LLM):

Content filtering: Check for harmful, toxic, or inappropriate content. Use a classifier (OpenAI Moderation API, Llama Guard).
Hallucination detection: For RAG, verify that claims in the response are supported by the retrieved context.
PII leakage check: Ensure the response does not contain PII from the training data or retrieved context that should be masked.
Format validation: Verify JSON output is valid, required fields are present, values are within expected ranges.
Factual consistency: For critical applications, use a second LLM call to verify the response against known facts.

Tools: Guardrails AI, NeMo Guardrails (NVIDIA), Llama Guard (Meta), custom classifiers. The key is layering multiple checks — no single guardrail catches everything.

Q4: How do you monitor an LLM application in production? What do you alert on?

💡

Model Answer:

Operational metrics (standard SRE):

Latency: p50, p95, p99 for TTFT and total response time
Error rate: API errors, timeouts, rate limits
Throughput: requests per second, tokens per second
Cost: daily/weekly cost tracking, cost per request

Quality metrics (LLM-specific):

User feedback: Thumbs up/down ratio. Track over time for regression detection.
Hallucination rate: Sample outputs and check factual accuracy (automated + manual).
Guardrail trigger rate: How often do input/output guardrails fire? Rising rate may indicate an attack or model degradation.
Refusal rate: How often does the model refuse to answer? Too high = over-cautious. Too low = under-cautious.
Response length distribution: Sudden changes may indicate model behavior shifts.

Alert on:

Latency p95 > 5s (user experience degradation)
Error rate > 5% (service reliability issue)
Daily cost > 2x baseline (cost runaway or abuse)
User satisfaction drops > 10% week-over-week
Guardrail trigger rate spikes > 3x baseline (possible attack)

Tools: LangSmith, Langfuse, Helicone (cost tracking), Datadog/New Relic (infra), custom dashboards.

Q5: What is semantic caching? How does it differ from exact-match caching?

💡

Model Answer:

Exact-match caching: Cache key is the hash of the exact prompt text. Only returns cached results for identical queries. Hit rate: typically 5–15%.

Semantic caching: Embed the query, find cached queries with cosine similarity above a threshold (e.g., 0.95). Return the cached response for semantically equivalent queries.

Example: "What's the weather in NYC?" and "Tell me New York City weather" are semantically equivalent but textually different. Exact-match misses; semantic cache hits.

Implementation:

Embed each incoming query
Search the cache (vector store) for similar queries above threshold
If match found: return cached response (zero LLM cost, ~10ms latency)
If no match: call LLM, cache the (query_embedding, response) pair

Trade-offs:

Higher hit rate (20–40% for FAQ/support workloads) but requires vector similarity search per request
Risk of returning stale or slightly wrong answers for queries that are similar but not equivalent
Threshold tuning: too high (0.99) = low hit rate; too low (0.85) = wrong answers
Cache invalidation: when underlying data changes, cached answers may become incorrect

Tools: GPTCache, Redis with vector search, custom implementation with pgvector.

Q6: How do you handle LLM model upgrades in production without breaking things?

💡

Model Answer:

The problem: When OpenAI updates GPT-4 or you switch from Claude 3 to Claude 3.5, prompts that worked before may behave differently. Output format, tone, refusal behavior, and quality can all change.

Migration strategy:

Pin model versions: Always specify exact model versions (gpt-4-0613, not gpt-4). This prevents silent upgrades.
Eval suite: Maintain a comprehensive evaluation dataset (100–500 examples). Run against new model versions before switching.
Shadow mode: Route 5–10% of production traffic to the new model. Compare outputs side-by-side with the current model. Look for regressions in quality, format, and latency.
Gradual rollout: If shadow results look good, gradually increase traffic to the new model (10% → 25% → 50% → 100%) over 1–2 weeks. Monitor closely at each stage.
Rollback plan: Keep the previous model version available. Instant rollback if quality degrades.
Prompt adaptation: Different models respond differently to the same prompt. Budget time for prompt re-optimization with each model change.

Q7: When should you self-host an LLM vs use an API? What are the trade-offs?

💡

Model Answer:

Factor	API (OpenAI, Anthropic)	Self-Hosted (vLLM, TGI)
Setup time	Minutes	Days to weeks
Cost at low volume	Cheaper (pay per token)	Expensive (GPU idle time)
Cost at high volume	Expensive ($5–15/M tokens)	Cheaper (fixed GPU cost amortized)
Data privacy	Data sent to third party	Data stays on your infrastructure
Model quality	Best models (GPT-4o, Claude 3.5 Opus)	Open models (LLaMA, Mistral, Qwen) slightly lower quality
Customization	Limited (prompt, fine-tune via API)	Full control (custom decoding, LoRA, quantization)
Reliability	Provider handles uptime, scaling	You handle everything: monitoring, scaling, failover
Latency	Network round-trip + provider queue	Direct GPU access, no queue

Decision framework:

Use API when: Volume < 100K requests/day, need best quality, limited ML ops team, fast time-to-market
Self-host when: Volume > 1M requests/day, strict data privacy requirements (healthcare, finance), need custom model modifications, have ML ops expertise
Hybrid: Self-host for high-volume simple queries (80%), API for complex queries needing frontier model quality (20%)

Q8: How do you implement LLM-based content moderation at scale?

💡

Model Answer:

Multi-tier architecture:

Tier 1 — Fast filters (1ms): Keyword blocklists, regex patterns for known harmful content, profanity filters. Catches obvious violations cheaply.
Tier 2 — ML classifiers (10–50ms): Lightweight models (BERT-based) trained on moderation datasets. Classify content into categories (hate, violence, sexual, self-harm). High throughput, moderate accuracy.
Tier 3 — LLM evaluation (500–2000ms): Use an LLM (Llama Guard, GPT-4o-mini) for nuanced cases that Tier 2 is uncertain about. Handles context-dependent moderation (sarcasm, educational content, news reporting).
Tier 4 — Human review: Edge cases and appeals. Provides training data for improving Tier 2 and 3.

Scaling: Tier 1 handles 99% of traffic at near-zero cost. Tier 2 handles the 1% that passes. Tier 3 handles the 0.1% that Tier 2 is uncertain about. This cascade reduces LLM cost by 1000x compared to LLM-for-everything.

Key metrics: Precision (do not over-censor), recall (do not miss harmful content), latency (do not slow down the user experience), false positive rate (do not frustrate legitimate users).

Q9: How do you handle multi-tenant LLM applications? What isolation guarantees are needed?

💡

Model Answer:

Isolation requirements:

Data isolation: Tenant A's data must never appear in Tenant B's responses. This includes RAG retrieval, conversation history, and cached responses.
Cost isolation: Track and limit usage per tenant. Prevent one tenant from consuming all resources.
Configuration isolation: Each tenant may have different system prompts, model choices, tool access, and guardrail settings.

Implementation:

Vector DB: Use metadata filtering on tenant_id. Every retrieval query includes filter: {tenant_id: "abc123"}. Never return chunks from other tenants.
Conversation store: Partition by tenant. Use separate database schemas or tenant_id columns with row-level security.
Cache: Include tenant_id in cache keys. Tenant A's cached responses must not be served to Tenant B.
Rate limits: Per-tenant rate limits and cost budgets. Alert when approaching limits.
Audit logging: Log all queries and responses with tenant_id for compliance and debugging.

Architecture pattern: Shared LLM infrastructure (API calls, GPU cluster) with per-tenant configuration (system prompts, tools, RAG indices, guardrails). This gives cost efficiency of shared infra with the data isolation of separate deployments.

Q10: What is model distillation? When would you use it in production?

💡

Model Answer:

Model distillation trains a smaller "student" model to mimic the behavior of a larger "teacher" model. The student learns from the teacher's output probabilities (soft labels), not just the hard labels.

LLM distillation pipeline:

Generate high-quality outputs from the teacher model (GPT-4, Claude 3.5 Opus) on your specific task
Fine-tune a smaller model (LLaMA 8B, Mistral 7B) on these (input, teacher_output) pairs
Evaluate: the student should achieve 80–95% of teacher quality at 10–50x lower inference cost

When to use:

High-volume, task-specific workloads where a smaller model can learn the pattern
Latency-critical applications where large model inference is too slow
Cost reduction: replacing $5/M token API calls with $0.10/M token self-hosted inference
Offline/edge deployment where large models cannot run

Limitations: The student is only as good as the teacher on the specific task it was trained on. It does not generalize to new tasks. If the task changes, you need to re-distill.

Legal note: Some model licenses (OpenAI's Terms of Service) restrict using their outputs to train competing models. Check license terms before distilling.

Q11: How do you implement A/B testing for LLM features?

💡

Model Answer:

Challenges unique to LLM A/B testing:

Non-determinism: Same input can produce different outputs. Need larger sample sizes for statistical significance.
Subjective quality: "Better" is hard to measure automatically. Need human evaluation or LLM-as-judge.
Long-term effects: A prompt change might improve short-term engagement but hurt long-term trust (sycophancy).

Implementation:

Define metrics: Primary (task completion rate, user satisfaction) and secondary (latency, cost, response length).
User assignment: Hash user ID to consistently assign to control/treatment groups. Ensure no crossover.
Minimum sample size: Calculate required sample for statistical significance. LLM variance means you typically need 2–3x more samples than traditional A/B tests.
Run duration: At least 1–2 weeks to capture day-of-week effects and user adaptation.
Evaluation: Combine automated metrics (latency, cost, format accuracy) with human evaluation (sample 100–200 responses from each group for manual quality review).

What to A/B test: Prompt versions, model choices, system prompt changes, RAG configuration (chunk size, top-K), temperature settings. Change one variable at a time for clean attribution.

Q12: Walk me through debugging a production LLM issue: "Users report the chatbot is giving wrong answers since yesterday."

💡

Model Answer:

Systematic debugging approach:

Reproduce: Get specific examples of wrong answers. Compare to expected answers. Is it a pattern or random?
Check for changes:
- Did we deploy a prompt change? (Check git history, deployment logs)
- Did the model provider update the model? (Check model version, release notes)
- Did the RAG index get updated with bad data? (Check ingestion pipeline logs)
- Did a dependency change? (API schema, tool behavior)
Isolate the component:
- Is retrieval returning wrong documents? (Check retrieval results for the failing queries)
- Is the LLM generating wrong answers from correct context? (Test with manually correct context)
- Is a guardrail interfering? (Check guardrail logs for false positives)
- Is caching returning stale answers? (Bypass cache and retest)
Fix and verify: Apply the fix. Run the failing examples through the eval suite. Deploy with monitoring.
Post-mortem: How did this escape testing? Add the failing examples to the eval suite. Consider adding monitoring to catch similar issues earlier.

Key tools: Request tracing (see the full prompt, retrieval results, and response for each failing request), eval suite (automated regression testing), diff comparison (compare outputs before and after the issue started).

← Previous AI Agents & Tool Use Next → Practice Questions & Tips