Research Proposal Questions
Research proposal questions test whether you can generate, evaluate, and defend original research ideas. Interviewers ask questions like "What would you work on in your first year here?" or "Propose a research project that addresses X." This lesson covers how to structure compelling proposals, assess feasibility and novelty, and includes 3 fully worked examples.
How to Pitch a Research Idea
A strong research proposal in an interview follows a predictable structure. Practice this framework until it becomes second nature:
The PROPOSAL Framework
| Step | What to Cover | Time |
|---|---|---|
| P — Problem | What is the open problem? Why is it important? Who cares about solving it? | 60 sec |
| R — Related Work | What has been tried? Why have existing approaches fallen short? | 60 sec |
| O — Observation | What is your key insight or observation that suggests a new approach is possible? | 45 sec |
| P — Plan | What is your concrete research plan? What experiments will you run? What will you build? | 90 sec |
| O — Outcomes | What would success look like? What metrics will you use? What would a positive result tell us? | 45 sec |
| S — Scope & Resources | How long will this take? What compute, data, and team do you need? | 30 sec |
| A — Alternatives | What is your backup plan if the main approach does not work? | 30 sec |
| L — Limitations | What are the known risks and limitations of your approach? | 30 sec |
Evaluating Feasibility
Interviewers will probe whether your proposal is actually doable. Before pitching, stress-test your idea against these criteria:
Compute Requirements
Can you estimate the GPU-hours needed? A proposal requiring 10,000 A100-hours for initial experiments is feasible at a top lab. A proposal requiring 1M A100-hours for a single experiment is not. Show you understand the compute landscape by providing concrete estimates.
Data Availability
Does the data you need exist? Can you create it? If your proposal requires a dataset that does not exist, include a plan for creating it and estimate the cost. Proposals that rely on data that is impossible to collect are dead on arrival.
Timeline Realism
Can meaningful progress be made in 6–12 months? Research proposals that require 3 years of work before any results are risky to pitch. Break your proposal into milestones with intermediate deliverables that demonstrate progress.
Technical Risk
What is the probability of failure? Every research project has risk, but strong proposals have fallback positions. "If the main hypothesis is wrong, we will still learn X and can pivot to Y" shows maturity.
Novelty Assessment
The interviewer will ask: "How is this different from [existing work]?" Prepare your novelty argument by categorizing your contribution:
| Type of Novelty | Description | Example |
|---|---|---|
| New formulation | Reframing an existing problem in a way that enables new solutions | Reformulating sequence modeling as a state space model instead of attention |
| New method | A fundamentally new algorithm or architecture | Diffusion models for image generation (vs GANs and VAEs) |
| New connection | Connecting ideas from different fields | Applying optimal transport theory to domain adaptation |
| New scale | Applying existing ideas at unprecedented scale to discover new phenomena | Scaling language models to reveal emergent capabilities |
| New understanding | Providing theoretical or empirical insight into why something works | Mechanistic interpretability of in-context learning |
Example Proposal 1: Mechanistic Interpretability of Chain-of-Thought Reasoning
Problem: Chain-of-thought (CoT) prompting dramatically improves reasoning performance, but we do not understand what happens mechanistically inside the model. Does CoT actually enable multi-step reasoning in the model's internal representations, or does it merely provide a scaffolding for pattern matching? This distinction matters for safety: if CoT reasoning is not faithful, we cannot trust it for high-stakes decisions.
Related work: Existing interpretability work (probing classifiers, activation patching) has focused on simple tasks. Lanham et al. (2023) showed CoT can be unfaithful but did not identify the mechanism. Causal tracing (Meng et al.) provides tools but has not been applied to multi-step reasoning.
Key insight: By combining activation patching with causal interventions at each step of the chain of thought, we can test whether intermediate reasoning steps causally influence the final answer through internal model representations, or whether the model arrives at the answer independently and generates the chain post hoc.
Plan: (1) Select 5 reasoning tasks of increasing difficulty. (2) Collect model activations at each CoT step. (3) Apply activation patching: replace activations at step k with activations from a corrupted input and measure downstream effects. (4) If CoT is faithful, corrupting early steps should degrade later steps and the final answer. (5) Compare across model sizes to test whether faithfulness scales with capability.
Resources: 2 researchers, 6 months, ~5,000 A100-hours for activation collection and patching experiments across model sizes (7B to 70B). Requires access to open-weight models with hooks for intermediate activations.
Fallback: Even if CoT is entirely unfaithful, the mechanistic analysis will reveal what the model is actually doing when it produces reasoning chains, which is valuable for interpretability research broadly.
Example Proposal 2: Sample-Efficient Alignment via Synthetic Preference Bootstrapping
Problem: Current alignment methods (RLHF, DPO) require large amounts of human preference data, which is expensive and introduces annotator biases. Can we develop an alignment method that achieves comparable safety and helpfulness with 10x fewer human labels?
Related work: Constitutional AI reduces human labeling for safety but still needs preference data for helpfulness. Self-play and debate show promise but lack empirical validation at scale. RLAIF uses AI feedback but inherits the base model's biases.
Key insight: Use a small seed set of high-quality human preferences (500–1000 examples) to train an initial reward model, then use that model to generate synthetic preference pairs from the model's own outputs, filtered by consistency checks and uncertainty quantification. The key is that the synthetic pairs are validated by measuring agreement with a held-out set of human judgments, creating a calibration loop.
Plan: (1) Collect 1,000 high-quality preference pairs across safety and helpfulness dimensions. (2) Train an initial reward model. (3) Generate 50,000 synthetic preference pairs using the model's own outputs. (4) Filter synthetic pairs using reward model confidence and consistency checks. (5) Train the final model on the combined dataset. (6) Evaluate against a model trained on 10,000 human-labeled pairs (the baseline).
Resources: 2–3 researchers, 9 months, ~20,000 A100-hours for training and evaluation. Human annotation budget for 1,000 seed pairs and 500 evaluation pairs.
Fallback: If the bootstrapped model underperforms, the calibration analysis between synthetic and human preferences will still yield insights about where AI judgment diverges from human judgment, informing future RLAIF work.
Example Proposal 3: Test-Time Compute Scaling for Mathematical Reasoning
Problem: Current language models allocate the same compute per token regardless of problem difficulty. Humans spend more time thinking about hard problems. Can we develop methods that allow models to adaptively allocate more compute at inference time for harder problems, specifically for mathematical reasoning?
Related work: Best-of-N sampling and self-consistency voting provide brute-force test-time compute scaling but are wasteful. Tree-of-thought and beam search over reasoning paths show promise but lack learned allocation policies. Adaptive computation (early exit, mixture of experts) operates at the architecture level rather than the reasoning level.
Key insight: Train a lightweight "difficulty estimator" that predicts how many reasoning steps a problem needs, then use this to allocate a variable compute budget per problem. Combine with a learned verifier that checks intermediate steps, allowing the model to backtrack and try alternative reasoning paths when the verifier detects errors.
Plan: (1) Build a dataset of math problems with difficulty labels based on the number of steps expert humans need. (2) Train the difficulty estimator on problem statements. (3) Implement an adaptive beam search that expands more paths for harder problems. (4) Train a step-level verifier on intermediate reasoning steps using process reward modeling. (5) Evaluate on MATH, GSM8K, and competition mathematics, comparing accuracy-vs-compute curves against fixed-budget baselines.
Resources: 2 researchers, 8 months, ~15,000 A100-hours. Requires a curated dataset of math problems with step-level annotations (can leverage existing process reward model datasets).
Fallback: Even if the difficulty estimator is imperfect, the step-level verifier alone should improve reasoning accuracy. The compute-accuracy tradeoff curves will be informative regardless of the adaptive allocation's success.
Common Proposal Mistakes
- Too vague: "I want to work on making AI safer" is not a proposal. Specify the concrete problem, method, and evaluation.
- Too ambitious: "I will solve AGI alignment in 12 months" is not credible. Scope your proposal to what one small team can accomplish.
- No fallback plan: Every research project can fail. Showing you have a plan B demonstrates maturity.
- Ignoring related work: If you cannot articulate how your idea differs from existing approaches, the interviewer will assume you have not done your homework.
- No concrete evaluation: "We will know it works when the model is better" is not a plan. Specify datasets, metrics, and baselines.
Key Takeaways
- Use the PROPOSAL framework: Problem, Related work, Observation, Plan, Outcomes, Scope, Alternatives, Limitations
- Stress-test feasibility before pitching: compute, data, timeline, and technical risk
- Articulate your novelty clearly — is it a new formulation, method, connection, scale, or understanding?
- Always include a fallback plan and be honest about what could go wrong
- Prepare 2–3 proposals in advance and practice presenting each in under 8 minutes
Lilly Tech Systems