Advanced

LLM & GenAI Interview Questions

These 15 questions cover the most in-demand topics in 2024–2026 interviews. If you are interviewing for an LLM/GenAI role, expect 5–8 of these questions. They test whether you understand how LLMs work in production, not just how to call an API.

Q1: What is prompt engineering? What techniques improve LLM output quality?

💡

Model Answer:

Prompt engineering is the practice of designing input prompts to guide LLM behavior without changing model weights. Key techniques:

Zero-shot: Direct instruction with no examples. "Classify this review as positive or negative: [text]"
Few-shot: Include 3–5 input-output examples in the prompt. The model learns the pattern from examples. Order and diversity of examples matter.
Chain-of-Thought (CoT): Add "Let's think step by step" or show reasoning traces in examples. Dramatically improves performance on math, logic, and multi-step tasks.
Self-Consistency: Generate multiple CoT responses with temperature > 0, then take the majority answer. Improves over single-sample CoT by 5–15%.
System prompts: Set role, constraints, and output format. "You are a medical expert. Only answer based on provided context. Respond in JSON."
Structured output: Request JSON, XML, or specific formats. Use schema definitions to constrain output structure.

Anti-patterns: Overly long prompts waste tokens and can confuse the model. Vague instructions produce vague outputs. Conflicting constraints cause inconsistent behavior.

Q2: Explain RLHF (Reinforcement Learning from Human Feedback). Why is it necessary?

💡

Model Answer:

A base language model (trained on next-token prediction) generates plausible text but is not aligned with human intent — it might produce harmful, dishonest, or unhelpful responses because it has learned to mimic all text on the internet.

RLHF pipeline (3 steps):

Supervised Fine-Tuning (SFT): Fine-tune the base model on high-quality (prompt, response) pairs written by human annotators. This teaches the model to follow instructions.
Reward Model Training: Collect comparison data where humans rank multiple model responses to the same prompt (A > B > C). Train a reward model to predict human preferences. The reward model outputs a scalar score for any (prompt, response) pair.
PPO Optimization: Use Proximal Policy Optimization to fine-tune the SFT model to maximize the reward model's score, with a KL divergence penalty to prevent the model from deviating too far from the SFT policy (which would cause reward hacking).

DPO (Direct Preference Optimization) simplifies RLHF by eliminating the reward model entirely. It directly optimizes the policy from preference pairs using a clever loss function. Used by LLaMA 3, Mistral, and Zephyr. Advantages: simpler to implement, more stable training, no reward model needed.

Why RLHF matters: The difference between a base model and a chat model (e.g., LLaMA vs LLaMA-Chat) is primarily RLHF/DPO. It is what makes models useful and safe.

Q3: What is RAG (Retrieval-Augmented Generation)? Design a production RAG system.

💡

Model Answer:

RAG augments LLMs with external knowledge by retrieving relevant documents before generating a response. This grounds the LLM's output in factual sources and enables knowledge updates without retraining.

Production RAG architecture:

Document Processing: Parse documents (PDF, HTML, DOCX). Split into chunks (500–1000 tokens with 50–100 token overlap). Handle tables, images, and metadata.
Embedding & Indexing: Embed chunks with a bi-encoder (e5-large-v2, BGE-large). Store in vector database (Pinecone, Weaviate, pgvector). Also maintain BM25 index for keyword search.
Query Processing: Rewrite the user query for retrieval (HyDE: generate a hypothetical answer, embed that instead). Decompose complex queries into sub-queries.
Hybrid Retrieval: BM25 + dense retrieval, merged with Reciprocal Rank Fusion. Retrieve top-20 candidates.
Re-ranking: Cross-encoder re-ranks top-20 to top-5. Dramatically improves relevance.
Generation: Feed retrieved chunks + query to LLM. Use structured prompts with citation instructions.
Post-processing: Verify citations, filter hallucinations, format response.

Common failure modes: Chunks too small (missing context) or too large (diluting signal). Wrong chunking boundaries (splitting mid-sentence). Retriever returns irrelevant results. LLM ignores retrieved context and hallucinates.

Q4: What causes LLM hallucinations? How do you mitigate them?

💡

Model Answer:

Causes of hallucination:

Training data memorization gaps: The model generates plausible-sounding text for topics it has limited training data on
Next-token prediction objective: The model optimizes for fluency, not factual accuracy. A fluent but incorrect statement is rewarded during training.
Knowledge cutoff: Cannot know facts after training date
Exposure bias: During generation, errors compound — one wrong token shifts the probability of all subsequent tokens
Long-context degradation: Model attention weakens over long contexts, leading to fabricated details

Mitigation strategies (layered):

RAG: Ground responses in retrieved documents. Reduces hallucination by providing factual context.
Temperature=0: Deterministic decoding reduces random fabrication
Citation requirements: Instruct the model to cite sources. If it cannot cite, it should say "I don't know."
Self-verification: Generate answer, then ask the model "Is this factually supported by the provided context?" in a second pass
Constrained decoding: Use structured output (JSON with schema) to limit where the model can hallucinate
Fine-tuning on verified data: Train the model to say "I don't know" when uncertain rather than guessing
Automated fact-checking: Cross-reference generated claims against knowledge bases

Q5: What is instruction tuning? How does it differ from regular fine-tuning?

💡

Model Answer:

Regular fine-tuning: Train on a single task's dataset (e.g., sentiment classification with labeled reviews). The model becomes a specialist for that task but loses generality.

Instruction tuning: Fine-tune on a diverse collection of tasks, all formatted as natural language instructions:

# Same model handles all these via instructions:
{"instruction": "Summarize the following article", "input": "[article]", "output": "[summary]"}
{"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"}
{"instruction": "Is this email spam?", "input": "[email]", "output": "Yes, this is spam because..."}
{"instruction": "Write Python code to sort a list", "input": "", "output": "def sort_list(lst):..."}

Key datasets for instruction tuning:

FLAN: 1,836 tasks from 62 datasets (Google)
Self-Instruct / Alpaca: LLM-generated instruction datasets (Stanford)
Open Assistant: Human-written multi-turn conversations
Dolly: 15K human-written instruction pairs (Databricks)

Why it works: Training on diverse instructions teaches the model a general "instruction following" capability that transfers to novel instructions at inference time. A model instruction-tuned on 1,000 task types can often perform well on the 1,001st task type it has never seen.

Q6: Explain context window limitations. How do long-context models work?

💡

Model Answer:

The context window is the maximum number of tokens the model can process in a single forward pass. It determines how much information you can give the model.

Why is it limited? Self-attention has O(n^2) memory and compute complexity. Doubling the context length quadruples the attention cost. For 128K tokens with 128 attention heads of dimension 128: the attention matrix alone requires 128K * 128K * 2 bytes = 32 GB per layer.

Context windows by model (2024–2025):

BERT: 512 tokens
GPT-3: 2,048 / 4,096 tokens
GPT-4: 8K / 32K / 128K tokens
Claude 3.5: 200K tokens
Gemini 1.5 Pro: 1M / 2M tokens

Techniques for long context:

RoPE scaling (NTK-aware, YaRN): Modify rotary position embeddings to extrapolate beyond training length. Can extend 4K-trained models to 32K+ with brief fine-tuning.
Sliding window attention: Each token attends to only a local window (e.g., 4096 tokens). Used by Mistral. O(n * w) instead of O(n^2).
Ring attention: Distribute long sequences across multiple devices, each processing a chunk and passing KV cache in a ring topology.
Sparse attention: Only compute attention for a subset of positions (local + global tokens). Used by Longformer, BigBird.

Key caveat: Longer context windows do not mean the model uses all context equally. The "lost in the middle" problem means information in the middle of long contexts is retrieved less reliably than information at the beginning or end.

Q7: What is the cost of running LLMs in production? How do you optimize token economics?

💡

Model Answer:

Cost breakdown for LLM inference:

Input tokens (prompt processing): Cheaper because they can be processed in parallel. GPT-4 Turbo: $10/M input tokens.
Output tokens (generation): More expensive because they are generated sequentially. GPT-4 Turbo: $30/M output tokens.
Latency cost: Longer prompts increase time-to-first-token (TTFT). Longer generations increase total response time.

Optimization strategies:

Prompt caching: Cache common system prompts and few-shot examples. Anthropic and OpenAI offer prompt caching with 50–90% cost reduction for cached prefixes.
Model routing: Use a cheap/fast model (GPT-4o-mini, Haiku) for simple queries and route complex queries to expensive models (GPT-4, Opus). A classifier or LLM judge decides routing.
Prompt compression: Summarize long contexts before passing to LLM. Use extractive methods to select only relevant passages.
Batching: Process multiple requests together. Tools like vLLM use continuous batching to maximize GPU utilization.
Quantization: Run models in 4-bit (GPTQ, AWQ) instead of 16-bit. 4x memory reduction with minimal quality loss.
Self-hosted models: For high volume (>$10K/month API spend), self-hosting open models (LLaMA 3, Mistral) on rented GPUs is often cheaper.

Cost calculation example: 1M users, each sending 5 messages/day with average 500 input + 200 output tokens. Monthly cost with GPT-4 Turbo: (5M * 500 * $10 + 5M * 200 * $30) * 30 / 1M = $1.65M/month. This is why optimization matters.

Q8: What is a Mixture of Experts (MoE) architecture? Why is it popular for LLMs?

💡

Model Answer:

MoE replaces the single dense FFN layer in each transformer block with multiple "expert" FFN networks and a gating mechanism:

Each transformer block has N expert FFN layers (e.g., 8 experts)
A gating network (small linear layer + softmax) routes each token to the top-k experts (usually k=2)
Only the selected experts compute the FFN for that token
Outputs are weighted-summed based on gating scores

Why it is popular:

More parameters, same compute: Mixtral 8x7B has 46.7B total parameters but only uses ~12.9B per token (2 out of 8 experts). It matches models 2–3x its active parameter count.
Faster inference: Only 2 experts compute per token, so inference is similar to a 13B dense model despite having 47B parameters.
Specialization: Different experts learn to specialize in different domains (code, math, language) without explicit training for this.

Challenges: Memory requirements are high (all expert weights must be in memory even though only 2 are active). Load balancing across experts requires auxiliary loss terms. Expert routing can be unstable during training.

GPT-4 is widely believed to use a MoE architecture (rumored 8 experts of ~220B each, total ~1.8T parameters).

Q9: How do you evaluate LLM outputs? What metrics and approaches exist?

💡

Model Answer:

LLM evaluation is one of the hardest problems in the field. There is no single metric that captures output quality.

Approach	How It Works	When to Use
Automated benchmarks	MMLU, HumanEval, GSM8K, HellaSwag — standardized test sets with ground truth answers	Comparing models during development. Useful for tracking progress, but can be gamed.
LLM-as-Judge	Use a strong LLM (GPT-4) to rate outputs on helpfulness, accuracy, safety. Often uses pairwise comparison.	Automated evaluation of open-ended outputs. Correlates well with human judgment but has biases (prefers longer, more verbose responses).
Human evaluation	Expert annotators rate responses on multiple dimensions (accuracy, helpfulness, safety, fluency)	Gold standard. Required for safety evaluation and final quality assessment before launch. Expensive and slow.
Task-specific metrics	BLEU (translation), ROUGE (summarization), pass@k (code), exact match (QA)	When the output can be compared to a reference. Does not capture open-ended quality.
Arena / Elo rating	Users compare two model outputs side by side. Elo ratings computed from win rates.	Chatbot Arena style. Best for overall model comparison with real user preferences.

Production monitoring: Track user feedback (thumbs up/down), response regeneration rate, task completion rate, and time-on-task. These proxy metrics capture real-world quality better than benchmarks.

Q10: What are LLM guardrails? How do you make LLMs safe for production?

💡

Model Answer:

Guardrails are safety mechanisms that constrain LLM behavior to prevent harmful, biased, or incorrect outputs.

Input guardrails:

Prompt injection detection: Detect attempts to override system prompts ("ignore all previous instructions")
Topic filtering: Block queries on prohibited topics (illegal activities, PII extraction)
Rate limiting: Prevent abuse by limiting request volume per user
Input validation: Check for malicious payloads, excessive length, encoding attacks

Output guardrails:

Content safety classifier: Run a lightweight classifier on generated output to detect toxic/harmful content before showing to the user
PII detection: Scan for and redact personal information (SSN, emails, phone numbers) in responses
Factuality checking: Cross-reference claims against retrieved documents (in RAG systems)
Format validation: Ensure JSON output is valid, code compiles, SQL is syntactically correct
Refusal detection: Detect when the model should refuse but does not, and vice versa

Frameworks: NVIDIA NeMo Guardrails, Guardrails AI, LlamaGuard (Meta's safety classifier). In practice, most production systems stack multiple guardrails with different latency/accuracy tradeoffs.

Q11: What is an AI agent? How do tool-using LLMs work?

💡

Model Answer:

An AI agent is an LLM that can take actions in the real world by calling external tools (APIs, databases, code interpreters) in a loop until the task is complete.

Architecture (ReAct pattern):

Thought: LLM reasons about what to do next
Action: LLM selects a tool and provides arguments (e.g., search("current weather in NYC"))
Observation: Tool returns results, which are appended to the context
Repeat: LLM reasons about the observation and decides next action or final answer

Function calling: Modern LLMs (GPT-4, Claude, Gemini) are trained to output structured tool calls. You define tools as JSON schemas in the system prompt, and the model outputs structured calls when appropriate. The runtime executes the tool and feeds results back.

Frameworks: LangChain, LlamaIndex, CrewAI, AutoGen, Semantic Kernel

Challenges:

Error propagation: One bad tool call can derail the entire chain
Cost explosion: Multi-step reasoning uses many LLM calls (5–20 per task)
Reliability: Agents fail unpredictably. Retry logic and fallbacks are essential.
Security: Tool-using agents can perform dangerous actions (delete files, send emails). Require human approval for sensitive actions.

Q12: What is speculative decoding? How does it speed up LLM inference?

💡

Model Answer:

Speculative decoding uses a small "draft" model to generate candidate tokens quickly, then the large "target" model verifies them in a single forward pass.

Algorithm:

Draft model generates k tokens (e.g., k=5) very quickly
Target model processes all k tokens in a single forward pass (parallel verification)
Accept tokens where draft and target agree. Reject and regenerate from the first disagreement.
On average, if draft acceptance rate is 70%, you get 3.5 tokens per target model forward pass instead of 1

Why it works: Most tokens are "easy" (predictable from context). A small 1B model can correctly predict 60–80% of what a 70B model would generate. The target model's forward pass for k tokens costs nearly the same as for 1 token (GPU memory is the bottleneck, not compute for small k).

Speed improvement: 2–3x faster generation with mathematically identical output distribution. Used by Medusa, Eagle, and implemented in vLLM and TensorRT-LLM.

Q13: How would you fine-tune an LLM for a specific domain (e.g., medical, legal)?

💡

Model Answer:

Decision framework (in order of preference):

RAG first: Index domain documents and retrieve at inference time. No training needed. Works well when the knowledge is in documents and questions are factoid. Try this before fine-tuning.
Prompt engineering: Craft domain-specific system prompts with terminology, constraints, and examples. Often sufficient for formatting and tone adjustments.
LoRA/QLoRA fine-tuning: When RAG is insufficient — the model needs to learn domain-specific reasoning patterns, terminology usage, or output formats that cannot be captured in prompts.
Full fine-tuning: Only when you have 100K+ domain examples and need maximum performance. Requires significant compute.
Continued pre-training: Train on raw domain text (medical papers, legal documents) with the CLM objective to teach domain vocabulary and knowledge. Then apply SFT on task-specific data.

Data preparation for domain fine-tuning:

Curate 5K–50K high-quality (instruction, response) pairs from domain experts
Include diverse task types: QA, summarization, classification, extraction
Add "refusal" examples where the model should say "I don't know" or "consult a professional"
Validate with domain experts — incorrect training examples are worse than no fine-tuning

Evaluation: Domain-specific benchmarks + expert review. General-purpose benchmarks may not reflect domain performance. Always check for "catastrophic forgetting" of general capabilities.

Q14: What is model distillation? How does it differ from quantization?

💡

Model Answer:

Knowledge Distillation: Train a smaller "student" model to mimic the behavior of a larger "teacher" model.

Teacher generates soft probability distributions over vocabulary (logits) for training data
Student is trained to match teacher's output distributions (KL divergence loss) + ground truth labels
The student learns from the teacher's "dark knowledge" — what the teacher considers plausible alternatives, not just the correct answer
Result: A 1B student can achieve 85–95% of a 70B teacher's performance

Quantization: Reduce the numerical precision of an existing model's weights.

FP32 (4 bytes) → FP16 (2 bytes) → INT8 (1 byte) → INT4 (0.5 bytes)
No training required for post-training quantization (PTQ). GPTQ, AWQ, and GGUF are popular formats.
4-bit quantization reduces model size by 4x with 1–3% quality degradation on benchmarks
QAT (Quantization-Aware Training) includes quantization during training for better quality

Key difference: Distillation changes the model architecture (fewer layers/parameters). Quantization keeps the same architecture but uses fewer bits per parameter. They can be combined: distill to a smaller model, then quantize it.

Q15: Compare fine-tuning vs RAG. When do you choose each?

💡

Model Answer:

Factor	Fine-Tuning	RAG
Knowledge updates	Requires retraining (hours/days)	Update index in real-time (minutes)
Factual accuracy	Can hallucinate learned facts	Grounded in retrieved documents, citable
Reasoning patterns	Can learn new reasoning styles	Limited to base model's reasoning ability
Output format/style	Excellent at learning specific formats	Relies on prompting for format control
Cost	GPU training cost (one-time) + storage per model	Embedding + vector DB + retrieval latency (ongoing)
Latency	Same as base model	+100–500ms for retrieval step
Data needed	1K–100K labeled examples	Documents (unlabeled) in any format

Decision:

Use RAG when: knowledge changes frequently, you need citations, you have documents but no labeled data, or you need to query a large knowledge base
Use fine-tuning when: you need to change model behavior/style/format, teach domain-specific reasoning, or when RAG retrieval quality is poor for your use case
Use both when: you need domain-adapted reasoning AND up-to-date factual knowledge. Fine-tune for style and reasoning, RAG for facts.

Key Takeaways

💡

RAG is the most commonly tested LLM topic — know the full pipeline from chunking to generation
Understand RLHF/DPO: why base models need alignment and how the training pipeline works
Know token economics: how to estimate costs and optimize them (caching, routing, batching)
Hallucination mitigation requires multiple layers: RAG + verification + guardrails
Be ready to compare fine-tuning vs RAG with concrete decision criteria
MoE, speculative decoding, and KV cache optimization show production-level knowledge that impresses interviewers

← Previous NLP Task Questions Next → Practical NLP Challenges