Intermediate

Technical Depth Questions

AI PMs do not need to write code, but they must understand ML concepts well enough to make informed product decisions and earn the trust of their data science teams. These 10 questions test the technical literacy that separates strong AI PMs from generic PMs.

Q1: How would you explain machine learning to a non-technical executive?

💡

Model Answer:

I would use an analogy: "Machine learning is like training a new employee. Instead of giving them a rule book with exact instructions for every situation, you show them thousands of examples of correct decisions. They learn the patterns and can then make decisions on new cases they have never seen before."

Key points to communicate:

It learns from data, not rules: "Traditional software follows explicit rules we write. ML discovers rules from data. This is powerful when the rules are too complex to write manually — like detecting spam or recommending products."
It is probabilistic, not deterministic: "Unlike traditional software that gives the same output every time, ML makes predictions with a confidence level. It might be 95% confident an email is spam. We need to decide what confidence threshold is acceptable."
It improves with more data: "The more examples we show it, the better it gets. This means our product gets smarter over time as users interact with it — a competitive advantage that compounds."
It has limitations: "ML can only learn patterns that exist in the training data. If the data is biased, the model will be biased. If the world changes (new user behaviors, market shifts), the model needs retraining."

What not to do: Do not mention algorithms, gradient descent, or neural networks unless asked. Executives want to understand implications for their business, not the mechanics.

Q2: When should you use AI/ML vs a rules-based system?

💡

Model Answer:

This is a critical decision that many teams get wrong by defaulting to AI when simpler solutions work. Here is my decision framework:

Use rules-based systems when:

The logic can be expressed in fewer than ~50 rules that domain experts can articulate
The rules rarely change (tax calculations, regulatory compliance checks)
Explainability is mandatory and every decision must be auditable
You have very little data (fewer than 1,000 labeled examples)
Errors are catastrophic and must be zero (medical device safety checks)

Use AI/ML when:

The patterns are too complex or numerous for humans to articulate (image recognition, natural language understanding)
The rules change frequently or vary by context (personalization, fraud detection)
You have abundant data and the patterns are learnable
Approximate answers are acceptable (recommendations, predictions)
You need to scale beyond what human reviewers can handle

The hybrid approach (often best): Use rules as guardrails around ML decisions. Example: ML predicts credit risk, but rules enforce regulatory limits. ML personalizes content, but rules filter out prohibited categories. This gives you the adaptability of ML with the safety of rules.

Q3: What do you need to know about data quality as an AI PM?

💡

Model Answer:

Data quality is the single biggest determinant of AI product success. "Garbage in, garbage out" is not a cliche — it is the daily reality of every AI product team. As an AI PM, I focus on five data quality dimensions:

Volume: Do we have enough labeled examples? Rule of thumb: thousands for simple classification, hundreds of thousands for complex tasks, millions for generative AI. If we do not have enough, what is the plan to collect it?
Representativeness: Does the data reflect the actual population of users? If training data over-represents certain demographics, languages, or use cases, the model will underperform on underrepresented groups. This is a product and ethical issue, not just a technical one.
Freshness: How quickly does the data become stale? In rapidly changing domains (trending topics, fashion, financial markets), models trained on old data make outdated predictions. What is our data refresh cadence?
Label quality: Who labeled the data, and how consistent are the labels? Inter-annotator agreement below 85% means the labels themselves are noisy, and no model can learn clean patterns from noisy labels. Invest in clear labeling guidelines and quality assurance.
Bias: Does the data encode historical biases we want to perpetuate? Hiring data may reflect past discrimination. Loan data may encode redlining. The PM must identify these risks before model training begins.

PM-specific data responsibilities: Define what data needs to be collected in the product (implicit signals, explicit feedback). Design data collection as a product feature, not an afterthought. Advocate for data quality investments to leadership even when they are not visible to users.

Q4: Explain the precision-recall trade-off and when an AI PM needs to care about it.

💡

Model Answer:

Precision: Of the items the model flagged as positive, how many are actually positive? "When the model says yes, how often is it right?"

Recall: Of all the actual positive items, how many did the model find? "Of everything that should be flagged, how much did the model catch?"

Why it matters for product decisions:

Scenario	Optimize For	Why
Spam filter	High precision	A legitimate email in spam (false positive) is worse than a spam email in inbox (false negative). Users lose trust if important emails disappear.
Fraud detection	High recall	Missing a fraudulent transaction (false negative) costs money. Flagging a legitimate transaction for review (false positive) is a minor inconvenience.
Content moderation	Balance with context	Too aggressive (high recall) silences legitimate voices. Too lenient (high precision) lets harmful content through. The right balance depends on platform values and regulatory requirements.
Medical screening	Very high recall	Missing a disease (false negative) can be life-threatening. A false positive leads to additional testing, which is acceptable.

The PM's role: You do not tune the model yourself. But you must define the business context that determines which direction to optimize. This means talking to users, legal, and business stakeholders to understand the relative cost of false positives vs false negatives.

Q5: What should an AI PM understand about model limitations?

💡

Model Answer:

Understanding model limitations is what separates AI PMs who ship reliable products from those who ship disasters. Here are the limitations every AI PM must internalize:

Distribution shift: Models perform well on data similar to training data. When real-world data diverges (new user demographics, seasonal patterns, market shifts), performance degrades silently. You need monitoring to catch this.
Edge cases: ML models learn the common patterns well but struggle with rare cases. If 0.1% of users trigger a failure mode and you have 10 million users, that is 10,000 unhappy users. Edge cases must be handled with fallback logic.
Correlation vs causation: Models find correlations in data, not causal relationships. A model might learn that umbrella purchases predict rain, but selling umbrellas does not cause rain. Be careful about actions based on model predictions.
Adversarial vulnerability: Users (malicious or creative) will find inputs that break the model. Content moderation models can be bypassed with creative misspellings. Recommendation models can be gamed with fake engagement.
Explainability gap: Complex models (deep learning) often cannot explain why they made a specific prediction. If your product requires explainability (lending decisions, medical recommendations), this limits which models you can use.
Data dependency: Unlike traditional software that you ship and maintain, AI models continuously depend on data quality. A data pipeline failure at 3 AM can silently degrade your product for millions of users.

Q6: How do you decide the minimum viable accuracy for an AI feature launch?

💡

Model Answer:

There is no universal answer — the minimum viable accuracy depends entirely on the product context. Here is my framework:

Step 1 — Define the baseline: What is the current user experience without AI? If users manually sort 200 emails, any reasonable AI sorting is an improvement. If users already have a good heuristic system at 85% accuracy, the AI needs to beat 85% to justify the switch.

Step 2 — Calculate the error cost: What happens when the model is wrong?

Low cost errors (wrong song recommendation): 70% accuracy might be fine if users can skip easily
Medium cost errors (wrong product search result): 85%+ needed to maintain trust
High cost errors (wrong medical suggestion): 99%+ or do not ship

Step 3 — Consider user expectations: If you present AI as a suggestion, users tolerate more errors. If you present it as a decision, they expect near-perfection. The UX framing changes the accuracy bar.

Step 4 — Test with users: Show users the AI at its current accuracy and measure satisfaction. Often the minimum viable accuracy is lower than you expect because users value speed more than perfection, or higher than you expect because trust is fragile in your domain.

Step 5 — Plan for improvement: Launch with a model you know will get better. Communicate this to stakeholders: "We are launching at 88% accuracy with a plan to reach 93% in Q3 as we collect more data."

Q7: What is the difference between fine-tuning, RAG, and prompt engineering? When would you use each?

💡

Model Answer:

These are three approaches to customizing LLMs for your product, with very different trade-offs:

Approach	What It Is	When to Use	Trade-offs
Prompt engineering	Crafting instructions and examples within the prompt	Quick prototyping, tasks where the base model already knows how, simple customization	Cheap and fast, but limited by context window, no persistent learning, inconsistent results
RAG	Retrieving relevant documents and injecting them into the prompt	Knowledge-intensive tasks, frequently changing information, when you need source attribution	Uses up-to-date data, provides citations, but adds latency and depends on retrieval quality
Fine-tuning	Training the model on your specific data to change its behavior	Domain-specific language, consistent style/format, when you need the model to learn new patterns	Best quality, but expensive, requires labeled data, risk of catastrophic forgetting

My decision tree as a PM:

Start with prompt engineering (1 day effort). If quality is good enough, ship it.
If the model lacks knowledge about your domain, add RAG (1–2 week effort).
If the model's behavior (tone, format, reasoning style) needs to change, fine-tune (4–8 week effort).
Often the best solution is a combination: fine-tuned model + RAG + good prompts.

Q8: How do you manage the build vs buy decision for ML infrastructure?

💡

Model Answer:

The build vs buy decision for ML infrastructure is more nuanced than for traditional software because ML has unique maintenance costs that are easy to underestimate.

Hidden costs of building in-house:

ML ops infrastructure: Model training, versioning, deployment, monitoring, retraining pipelines. This is 60–70% of the effort, not the model itself.
Data pipeline maintenance: Data schemas change, sources go down, quality degrades. Someone must monitor and fix data issues 24/7.
Talent retention: ML engineers are expensive and in high demand. Building in-house means you need 3–5 ML engineers just for maintenance, not innovation.
Evaluation infrastructure: You need test datasets, benchmark suites, and human evaluation workflows. These are ongoing investments, not one-time costs.

My framework:

Buy when the ML capability is commoditized (OCR, speech-to-text, basic NLP), your team has fewer than 3 ML engineers, or time-to-market is critical.
Build when the ML is your core competitive advantage, you need deep customization for your domain, data cannot leave your infrastructure, or third-party pricing does not scale with your volume.
Start by buying, plan to build: Use third-party solutions to validate the product value. If validated, migrate to in-house with clear ROI justification.

Q9: A data scientist says "We need 6 more months to improve the model." How do you respond?

💡

Model Answer:

This is a classic AI PM challenge: balancing scientific rigor with product timelines. My approach:

Step 1 — Understand the request: Ask "What specifically will 6 more months achieve? What is the expected accuracy improvement? What data or techniques make you believe this is achievable?" This separates optimism from evidence.

Step 2 — Quantify the gap: "Where is the model now vs where does it need to be for a useful product? Is the gap 70% to 95% (large, possibly unrealistic) or 88% to 93% (achievable, worth the wait)?"

Step 3 — Explore alternatives:

Can we launch with the current model to a subset of users? 88% accuracy might be fine for low-stakes use cases.
Can we use a human-in-the-loop approach? AI handles easy cases, humans handle hard cases. Launch sooner, improve over time.
Can we change the UX to tolerate lower accuracy? Suggestions vs auto-actions.
Is there a 2-month improvement that gets us 80% of the benefit? Diminishing returns are common in ML.

Step 4 — Negotiate milestones: Instead of waiting 6 months with no visibility, set monthly checkpoints. "In month 2, we should see X improvement. If not, we re-evaluate the approach." This prevents open-ended research projects.

Key principle: Respect the science but own the product timeline. Your job is to find creative ways to deliver user value sooner while the model continues to improve in parallel.

Q10: How do you evaluate whether a new ML model is ready for production?

💡

Model Answer:

A model is not production-ready just because it passes offline metrics. Here is my production readiness checklist:

Quality gates:

Offline metrics meet thresholds: Accuracy, precision, recall, F1 — whatever metrics were agreed upon before training.
Slice analysis passed: Performance is acceptable across all critical user segments (demographics, languages, device types, edge cases). Overall accuracy hiding segment-level problems is a common failure.
Regression testing: The new model does not break any cases that the old model handled correctly. Model regressions are especially frustrating for users.
Adversarial testing: The model handles known attack vectors (prompt injection for LLMs, adversarial inputs for classifiers).

Operational readiness:

Latency within budget: p50 and p99 latency meet product requirements under production load.
Monitoring in place: Dashboards tracking prediction distribution, error rates, latency, and data drift are live and alert-enabled.
Rollback plan defined: Can we revert to the previous model within minutes if something goes wrong?
Gradual rollout plan: Start with 1% of traffic, then 10%, then 50%, then 100%. Define success criteria for each stage.

Business readiness:

Support team is trained on new model behavior and common issues
Legal and compliance have reviewed the model's outputs for regulatory risk
User communication plan is ready if behavior changes are noticeable

← Previous Metrics & Measurement Next → Strategy & Execution