Advanced

Practice Questions & Tips

This final lesson brings everything together with rapid-fire questions to test your knowledge, coding challenges to practice, and strategic tips from successful NLP interview candidates.

Rapid-Fire Questions

Time yourself: try to answer each in under 60 seconds. These test breadth of knowledge and quick recall — both critical for phone screens and early interview rounds.

#	Question	Expected Answer (1–2 sentences)
1	What does the attention mechanism compute?	A weighted sum of value vectors, where weights come from the compatibility (dot product) between query and key vectors, scaled by sqrt(d_k) and normalized with softmax.
2	Why does BERT use [CLS] token?	The [CLS] token's output representation aggregates information from the entire sequence through self-attention, making it suitable as a fixed-length input for classification heads.
3	What is perplexity?	The exponentiation of cross-entropy loss: PPL = exp(loss). It measures how "surprised" the model is by test data. Lower is better. PPL of 10 means the model is as uncertain as choosing from 10 equally likely options.
4	Name three differences between BERT and GPT.	BERT: encoder-only, bidirectional, MLM objective. GPT: decoder-only, causal (left-to-right), next-token prediction. BERT is for understanding, GPT for generation.
5	What is teacher forcing?	During training of seq2seq models, use the ground truth previous token as input to the decoder at each step, rather than the model's own prediction. Speeds convergence but creates exposure bias.
6	What is the curse of dimensionality in embeddings?	As embedding dimensions increase, data points become equidistant in high-dimensional space, making similarity metrics less discriminative. This is why 768-dim BERT embeddings work well but 10,000-dim would not.
7	What is label smoothing?	Instead of one-hot targets (1.0 for correct class, 0 for others), use soft targets (e.g., 0.9 for correct, 0.1/(K-1) for others). Prevents the model from becoming overconfident and improves generalization.
8	Why is layer normalization preferred over batch normalization in transformers?	Layer norm normalizes across features for each sample independently, so it works with variable-length sequences and small batch sizes. Batch norm normalizes across the batch dimension, which is unstable with variable-length text.
9	What is the "lost in the middle" problem?	LLMs retrieve information more accurately from the beginning and end of long contexts, but struggle with information placed in the middle. This affects RAG systems where relevant context may land in the middle of the prompt.
10	What is contrastive learning in NLP?	Training embeddings by pulling similar (positive) pairs closer and pushing dissimilar (negative) pairs apart. SimCSE uses dropout as augmentation: the same sentence passed through BERT twice with different dropout masks creates a positive pair.
11	What is BM25?	A ranking function based on TF-IDF with document length normalization and term frequency saturation. It is the standard baseline for information retrieval and is used in Elasticsearch. Still competitive with dense retrieval for keyword-heavy queries.
12	What is the difference between precision and recall in NER?	Precision: of entities the model predicted, how many are correct. Recall: of entities that exist in the gold labels, how many did the model find. In medical NER, recall is typically more important (missing a diagnosis is worse than a false positive).
13	What is tokenizer fertility?	Average number of tokens a tokenizer produces per word. English: ~1.3 with BPE 32K vocab. Chinese: ~1.0 (character-level). Low-resource languages: ~2.0+ (many subword splits).
14	What is Flash Attention?	An IO-aware attention algorithm that reduces GPU memory reads/writes by tiling the attention computation. Same output as standard attention but 2–4x faster and uses O(n) memory instead of O(n^2). Used by most modern LLM implementations.
15	What is the difference between SFT and RLHF?	SFT (Supervised Fine-Tuning) trains on expert demonstrations (instruction, response pairs). RLHF uses human preference rankings to train a reward model, then optimizes the policy to maximize reward. SFT teaches "what to say," RLHF teaches "how to say it better."

Coding Challenges

These are actual coding tasks you might encounter in an NLP interview. Practice implementing them without referring to documentation.

Challenge 1: Implement BPE Tokenization from Scratch

def train_bpe(corpus: list[str], num_merges: int) -> dict:
    """Train a BPE tokenizer on a corpus.

    Args:
        corpus: List of words (pre-tokenized)
        num_merges: Number of merge operations to learn

    Returns:
        Dictionary of merge rules {(token_a, token_b): merged_token}
    """
    # Step 1: Initialize vocabulary as character-level tokens
    # Each word becomes a tuple of characters + end-of-word marker
    vocab = {}
    for word in corpus:
        chars = tuple(list(word) + ["</w>"])
        vocab[chars] = vocab.get(chars, 0) + 1

    merges = {}

    for i in range(num_merges):
        # Step 2: Count all adjacent pairs
        pairs = {}
        for word, freq in vocab.items():
            for j in range(len(word) - 1):
                pair = (word[j], word[j + 1])
                pairs[pair] = pairs.get(pair, 0) + freq

        if not pairs:
            break

        # Step 3: Find the most frequent pair
        best_pair = max(pairs, key=pairs.get)
        merges[best_pair] = best_pair[0] + best_pair[1]

        # Step 4: Merge that pair everywhere in the vocabulary
        new_vocab = {}
        for word, freq in vocab.items():
            new_word = []
            j = 0
            while j < len(word):
                if j < len(word) - 1 and (word[j], word[j + 1]) == best_pair:
                    new_word.append(best_pair[0] + best_pair[1])
                    j += 2
                else:
                    new_word.append(word[j])
                    j += 1
            new_vocab[tuple(new_word)] = freq
        vocab = new_vocab

    return merges


# Test it
corpus = ["low"] * 5 + ["lowest"] * 2 + ["newer"] * 6 + ["wider"] * 3
merges = train_bpe(corpus, num_merges=10)
print("Learned merges:", merges)

Challenge 2: Implement Self-Attention in PyTorch

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape

        # Project to Q, K, V
        Q = self.W_q(x)  # (batch, seq_len, d_model)
        K = self.W_k(x)
        V = self.W_v(x)

        # Reshape for multi-head: (batch, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # Apply mask (for causal attention)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # Softmax + weighted sum
        attn_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attn_weights, V)

        # Reshape back: (batch, seq_len, d_model)
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)

        return self.W_o(output)


# Test it
attn = SelfAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512)  # batch=2, seq_len=10, d_model=512
output = attn(x)
print(f"Input shape: {x.shape}, Output shape: {output.shape}")
# Both should be torch.Size([2, 10, 512])

Challenge 3: Implement a Simple RAG Pipeline

from sentence_transformers import SentenceTransformer
import numpy as np

class SimpleRAG:
    def __init__(self, documents: list[str]):
        self.documents = documents
        self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
        # Encode all documents at initialization
        self.doc_embeddings = self.encoder.encode(documents, normalize_embeddings=True)

    def retrieve(self, query: str, top_k: int = 3) -> list[tuple[str, float]]:
        """Retrieve top-k most relevant documents for a query."""
        query_embedding = self.encoder.encode([query], normalize_embeddings=True)

        # Cosine similarity (embeddings are normalized, so dot product = cosine)
        similarities = np.dot(self.doc_embeddings, query_embedding.T).flatten()

        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.documents[i], float(similarities[i])) for i in top_indices]

    def generate_prompt(self, query: str, top_k: int = 3) -> str:
        """Create a RAG prompt with retrieved context."""
        retrieved = self.retrieve(query, top_k)

        context = "\n\n".join([f"[Source {i+1}]: {doc}"
                               for i, (doc, score) in enumerate(retrieved)])

        prompt = f"""Answer the question based ONLY on the provided context.
If the context does not contain the answer, say "I don't have enough information."

Context:
{context}

Question: {query}

Answer:"""
        return prompt


# Usage
docs = [
    "BERT uses masked language modeling where 15% of tokens are masked.",
    "GPT models use causal language modeling with left-to-right attention.",
    "LoRA adds low-rank matrices to transformer weights for efficient fine-tuning.",
    "RAG combines retrieval with generation to ground LLM outputs in facts.",
]
rag = SimpleRAG(docs)
prompt = rag.generate_prompt("How does BERT train?")
print(prompt)

Interview Strategy Tips

Structure Your Answers

Use the STAR-T format for NLP answers: State the concept, explain the Technical details, discuss Alternatives, mention Real-world considerations, and end with Trade-offs. This shows both breadth and depth in a structured way.

Always Start with "It Depends"

When asked "Should I use X or Y?", never jump to one answer. Say "It depends on [factor 1], [factor 2], [factor 3]." Then discuss each factor. This shows senior-level thinking and avoids the trap of giving a one-size-fits-all answer.

Draw Diagrams

For system design and architecture questions, draw before you talk. A clear diagram of the transformer architecture, RAG pipeline, or NLP serving system communicates more than 5 minutes of verbal explanation.

Mention Papers by Name

Referencing papers shows depth: "As shown in the Chinchilla paper by Hoffmann et al..." or "DeBERTa introduced disentangled attention..." You do not need to memorize every detail, but knowing key papers and their contributions impresses interviewers.

Know Your Projects Cold

Prepare to go 3 levels deep on any project you mention. "I fine-tuned BERT" will get follow-ups: "What learning rate? How did you handle class imbalance? What was the latency in production? How did you evaluate?" Have specific numbers ready.

Practice Out Loud

Reading answers is not enough. Set a timer for 3 minutes and explain a concept (attention, RLHF, BPE) out loud as if to an interviewer. Record yourself and listen back. Most candidates stumble on their first attempt — practice eliminates this.

Frequently Asked Questions

How many hours should I prepare for an NLP interview?

Plan for 40–60 hours of focused preparation spread over 3–4 weeks. Break it down: 15 hours on fundamentals (tokenization, embeddings, transformers), 15 hours on LLM/GenAI topics (RAG, RLHF, prompt engineering), 10 hours on coding practice (implement attention, BPE, simple models), and 10 hours on mock interviews. If you already work with NLP daily, you can reduce this to 25–35 hours focused on gaps.

Do I need to know PyTorch for an NLP interview?

For NLP/ML engineer roles: yes. You should be able to write a training loop, implement custom layers, and use HuggingFace Transformers fluently. For LLM/GenAI engineer roles: PyTorch is less critical — focus on API usage, prompting, and system design instead. For research roles: deep PyTorch knowledge is essential, including custom CUDA kernels and distributed training.

Should I focus on classical NLP or modern transformers?

Spend 70% of your time on modern topics (transformers, LLMs, RAG, RLHF) and 30% on classical foundations (tokenization, TF-IDF, word embeddings). Classical topics test your depth of understanding and often appear in phone screens. But the majority of interview time, especially in later rounds, will be on modern approaches.

What if I am asked about a paper or model I have not read?

Be honest: "I have not read that specific paper, but based on the name and context, it likely addresses [X]." Then pivot to what you do know: "In related work, I am familiar with [Y] which solves a similar problem by..." Interviewers respect honesty and the ability to reason from first principles far more than they penalize knowledge gaps.

How do I answer "Tell me about an NLP project you worked on"?

Use this structure: (1) Business problem and why NLP was needed, (2) Data: size, quality, how you collected/labeled it, (3) Approach: what you tried first (baseline), what worked best, and why, (4) Results: specific metrics and business impact (latency reduced 40%, accuracy improved from 78% to 93%, saved $200K/year), (5) Lessons learned: what you would do differently. Keep it to 3–4 minutes, then let the interviewer drill into specifics.

What are the most common reasons NLP candidates fail interviews?

Based on interviewer feedback from top companies: (1) Cannot explain transformers beyond surface level — when pushed on attention computation or positional encoding, they cannot go deeper, (2) No production experience or mindset — only talk about models in Jupyter notebooks, never mention latency, cost, monitoring, or deployment, (3) Cannot code — can discuss architectures verbally but cannot implement basic components, (4) Poor communication — answers are rambling, disorganized, or use jargon without explanation, (5) No opinion on trade-offs — give one answer without considering alternatives.

How important are papers and staying current?

You do not need to read every paper on arXiv, but you should know the landmark papers and recent trends. Must-know papers: Attention Is All You Need (transformer), BERT, GPT-2/3, LoRA, DPO, RAG. Nice to know: Chinchilla, Flash Attention, Mistral, Mixture of Experts. Follow 3–5 NLP researchers on Twitter/X and read their paper summaries. Subscribing to newsletters like "The Batch" (Andrew Ng) or "NLP News" gives weekly updates in 10 minutes.

Should I mention LLM limitations and failures in my answers?

Absolutely. Discussing limitations shows maturity and real-world experience. When asked about RAG, mention retrieval failures and hallucination risks. When asked about fine-tuning, mention catastrophic forgetting and data quality issues. When asked about LLM agents, mention reliability problems and cost concerns. Interviewers are specifically looking for candidates who understand what can go wrong, not just the happy path.

Final Checklist

💡

Before your interview, make sure you can:

Explain the transformer architecture and self-attention from first principles (Q/K/V computation, multi-head, why sqrt(d_k))
Compare BERT vs GPT vs T5: architectures, training objectives, and use cases
Describe BPE tokenization step by step and implement it in code
Explain LoRA: what it is, why it works, how to configure rank and alpha
Design a complete RAG pipeline from document ingestion to answer generation
Discuss RLHF/DPO: why alignment is needed and how each method works
Calculate LLM inference costs and propose optimization strategies
Explain BLEU, ROUGE, and BERTScore with their strengths and limitations
Write a transformer training loop in PyTorch using HuggingFace
Tell 3 project stories with specific metrics and technical depth
Discuss at least 3 recent developments (MoE, long-context, multimodal, reasoning models)
Articulate trade-offs for common decisions: fine-tuning vs RAG, BERT vs GPT, build vs buy

💡

Good luck with your NLP interview! Remember: the goal is not to memorize every answer in this course. It is to understand the concepts deeply enough that you can reason about novel questions from first principles. If you can explain why something works (not just how), you will stand out from 90% of candidates.

← Previous Practical NLP Challenges