Python Coding in ML Interviews
Understand what interviewers expect from your Python code in machine learning interviews. This lesson covers the format, tools allowed, coding style preferences, and evaluation criteria that determine whether you pass or fail.
Why Python Fluency Matters in ML Interviews
ML interviews are not just about knowing algorithms — they test whether you can translate mathematical concepts into clean, working Python code under time pressure. Interviewers evaluate your Python fluency as a proxy for how productive you will be on the team. Candidates who write idiomatic Python with proper use of NumPy, Pandas, and library APIs consistently outperform those who write verbose, loop-heavy code.
The Python ML Interview Landscape
Python coding questions in ML interviews fall into distinct categories. Each requires different libraries and coding patterns:
| Category | Primary Libraries | Time Limit | Example Question |
|---|---|---|---|
| Numerical Computing | NumPy | 15–20 min | “Compute pairwise cosine similarity for a matrix of embeddings without loops” |
| Data Wrangling | Pandas | 15–25 min | “Given sales data, compute rolling 7-day average revenue per region” |
| ML Pipelines | Scikit-Learn | 20–30 min | “Build a pipeline with custom transformer, imputer, and cross-validated model” |
| Deep Learning | PyTorch / TensorFlow | 25–40 min | “Implement a custom dataset class and training loop for image classification” |
| Data Puzzles | Pandas + NumPy | 20–30 min | “Deduplicate records with fuzzy matching and merge with a reference table” |
Tools and Environments You Will Encounter
Always Available
Python 3.8+, NumPy, Pandas, and the Python standard library (collections, itertools, functools, math). You can assume these are imported.
Usually Available
Scikit-Learn for pipeline and preprocessing questions, matplotlib for quick plots, scipy for statistical tests. Ask before using.
Ask First
PyTorch / TensorFlow (only for deep learning roles), XGBoost / LightGBM, and specialized libraries like Hugging Face transformers.
Coding Style That Impresses Interviewers
Your coding style sends strong signals about your experience level. Here are the patterns interviewers look for:
Use Vectorized Operations, Not Loops
# BAD - Loop-based (screams "beginner")
result = []
for i in range(len(X)):
dot = 0
for j in range(len(X[0])):
dot += X[i][j] * w[j]
result.append(dot)
# GOOD - Vectorized (shows NumPy fluency)
result = X @ w
Use Pandas Idioms, Not Row Iteration
# BAD - Iterating rows
for idx, row in df.iterrows():
df.at[idx, 'ratio'] = row['revenue'] / row['cost']
# GOOD - Vectorized pandas
df['ratio'] = df['revenue'] / df['cost']
Use List Comprehensions and Built-ins
# BAD - Verbose loop
filtered = []
for item in data:
if item > threshold:
filtered.append(item)
# GOOD - Pythonic
filtered = [x for x in data if x > threshold]
# EVEN BETTER for NumPy arrays
filtered = data[data > threshold]
Write Type Hints and Docstrings
import numpy as np
from typing import Tuple
def normalize(X: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
"""Standardize features to zero mean, unit variance.
Args:
X: Feature matrix of shape (n_samples, n_features).
Returns:
Tuple of (X_normalized, means, stds).
"""
means = X.mean(axis=0)
stds = X.std(axis=0)
stds[stds == 0] = 1 # Avoid division by zero
return (X - means) / stds, means, stds
Time Management Strategy
Most Python coding questions in ML interviews give you 20–30 minutes. Here is how to allocate your time:
| Phase | Time | What to Do |
|---|---|---|
| Clarify | 2–3 min | Ask about input format (DataFrame vs array), edge cases, allowed libraries |
| Plan | 3–5 min | Write pseudocode or outline. State your approach aloud. |
| Code | 12–18 min | Write clean, modular code. Narrate as you go. |
| Test | 3–5 min | Run with simple data. Print intermediate results. Check edge cases. |
Evaluation Rubric: How Your Python Code Is Scored
| Criterion | Weight | What Gets High Marks |
|---|---|---|
| Correctness | 35% | Code produces correct output. Handles edge cases (empty arrays, NaN values, zero divisions). |
| Python Fluency | 25% | Uses vectorized ops, list comprehensions, proper library APIs. No unnecessary loops. |
| Code Quality | 20% | Readable variable names, modular functions, docstrings. Code a teammate would want to review. |
| Problem Solving | 10% | Systematic approach. Breaks problem into steps. Handles complexity incrementally. |
| Communication | 10% | Explains reasoning aloud. Discusses trade-offs. Responds to hints productively. |
Common Python Pitfalls in Interviews
def func(data=[]). The list is shared across calls. Use def func(data=None) and initialize inside the function. This is a classic Python gotcha that interviewers specifically test.Top 8 Python Mistakes in ML Interviews
| # | Mistake | Fix |
|---|---|---|
| 1 | Using loops where NumPy vectorization works | Always try np.vectorize or broadcasting first |
| 2 | Modifying a DataFrame while iterating | Use .apply(), .transform(), or vectorized ops |
| 3 | Forgetting axis parameter in NumPy/Pandas | Always specify axis=0 (columns) or axis=1 (rows) |
| 4 | Not handling NaN values before computation | Check with df.isna().sum() and use fillna() or dropna() |
| 5 | Confusing .copy() vs reference assignment | Use df.copy() when you need an independent copy |
| 6 | Integer division instead of float division | Use from __future__ import division or ensure float operands |
| 7 | Not resetting index after filtering | Use .reset_index(drop=True) after filtering DataFrames |
| 8 | Importing everything with from module import * | Import specifically: import numpy as np, import pandas as pd |
Quick Warm-Up: Test Your Python Instincts
Before diving into the challenge lessons, try this warm-up question. It tests whether you think in vectorized Python or fall back on loops.
import numpy as np
def pairwise_distances(X: np.ndarray) -> np.ndarray:
"""Compute pairwise Euclidean distance matrix without loops.
Uses the identity: ||a - b||^2 = ||a||^2 + ||b||^2 - 2*a.b
Args:
X: Array of shape (n, d).
Returns:
Distance matrix of shape (n, n).
"""
# ||x_i||^2 for each row
sq_norms = np.sum(X ** 2, axis=1) # shape: (n,)
# ||x_i - x_j||^2 = ||x_i||^2 + ||x_j||^2 - 2 * x_i . x_j
dist_sq = sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 2 * X @ X.T
# Numerical stability: clamp negative values from floating point errors
dist_sq = np.maximum(dist_sq, 0)
return np.sqrt(dist_sq)
# Test
X = np.array([[0, 0], [3, 4], [1, 0]], dtype=float)
D = pairwise_distances(X)
print(D)
# Expected:
# [[0. 5. 1. ]
# [5. 0. 4.12...]
# [1. 4.12... 0. ]]
What makes this a strong answer:
- No loops — fully vectorized using broadcasting and matrix multiplication
- Uses the mathematical identity to avoid computing (n*n*d) differences explicitly
- Handles numerical stability with
np.maximum - Includes a docstring explaining the approach
- Includes a test with verifiable results
Course Roadmap
The remaining six lessons each focus on a specific library or problem type. Each contains 8–10 real interview challenges with complete solutions:
| Lesson | Focus | Challenges |
|---|---|---|
| NumPy Challenges | Array ops, broadcasting, vectorization, distances | 10 |
| Pandas Challenges | Groupby, merge, pivot, window functions, time series | 10 |
| Scikit-Learn Challenges | Pipelines, custom transformers, CV, grid search | 10 |
| PyTorch Challenges | Custom datasets, layers, training loops, debugging | 8 |
| Data Manipulation Puzzles | Dedup, merge strategies, aggregation, performance | 10 |
| Practice Problems & Tips | Timed challenges, optimization tips, FAQ | 10 |
Lilly Tech Systems