Intermediate

Reinforcement Learning Models

Reinforcement learning is the paradigm where an agent learns to make decisions by interacting with an environment, receiving rewards for good actions and penalties for bad ones. From mastering Go to aligning large language models, RL is behind some of the most impressive achievements in AI.

What Is Reinforcement Learning?

Unlike supervised learning where you provide labeled examples, or unsupervised learning where you find patterns in data, reinforcement learning (RL) is about learning through trial and error. An agent takes actions in an environment, observes the consequences, and gradually learns a strategy (called a policy) that maximizes cumulative reward over time.

Think of it like training a dog: you don't show the dog a manual, you reward it when it does the right thing and it learns over time which behaviors lead to treats. RL works the same way, except the "dog" is an algorithm and the "treat" is a numerical reward signal.

Key Concepts

Every RL system is built from a small set of core components that interact in a loop:

Agent: The learner and decision-maker. This is the algorithm or model being trained.
Environment: Everything the agent interacts with. It could be a game, a robotic simulation, or a real-world system.
State (s): A representation of the current situation. In a chess game, the state is the board position.
Action (a): A choice the agent can make. In chess, an action is moving a specific piece.
Reward (r): A numerical signal received after taking an action. Positive rewards encourage behavior; negative rewards (penalties) discourage it.
Policy (π): The agent's strategy — a mapping from states to actions. The goal of RL is to find the optimal policy.
Value Function V(s): The expected total future reward from a given state. Helps the agent evaluate how "good" a state is.
Q-Function Q(s, a): The expected total future reward from taking action a in state s. Helps evaluate specific actions.
Discount Factor (γ): A number between 0 and 1 that determines how much the agent values future rewards vs. immediate ones.

💡

The RL loop: At each time step, the agent observes the state, chooses an action based on its policy, receives a reward, and transitions to a new state. This loop repeats until the episode ends (e.g., the game is won or lost).

Types of Reinforcement Learning

Model-Free RL

Model-free methods learn directly from experience without building an internal model of how the environment works. They are simpler to implement and work well in complex environments where modeling dynamics is impractical.

Value-Based Methods

Q-Learning: The classic algorithm. Maintains a table of Q-values for every state-action pair and updates them based on the Bellman equation. Works well for small, discrete state spaces.
DQN (Deep Q-Network): Replaces the Q-table with a neural network, allowing RL to work with high-dimensional inputs like images. Introduced by DeepMind for Atari games in 2015. Uses techniques like experience replay and target networks for stability.
Double DQN: Fixes the overestimation bias in DQN by using two networks to decouple action selection and evaluation.
Dueling DQN: Separates the value and advantage streams in the network architecture for better learning efficiency.

Policy Gradient Methods

REINFORCE: Directly optimizes the policy by computing gradients of the expected reward. Simple but high variance.
Actor-Critic: Combines value-based and policy-based approaches. The "actor" learns the policy; the "critic" learns the value function to reduce variance.
PPO (Proximal Policy Optimization): The workhorse of modern RL. Clips policy updates to prevent destructively large changes, making training more stable. Used extensively in robotics and RLHF.
SAC (Soft Actor-Critic): Adds an entropy bonus to encourage exploration. Excellent for continuous action spaces like robotic control.
A3C / A2C: Asynchronous / synchronous advantage actor-critic. Uses parallel environments for faster, more stable training.

Model-Based RL

Model-based methods learn or are given a model of the environment's dynamics (how states transition given actions). This allows the agent to "imagine" future outcomes and plan ahead without needing as many real interactions.

World Models: Learn a compressed representation of the environment and train a policy entirely within the learned "dream" world. Dramatically reduces the number of real-world interactions needed.
Dreamer (v1, v2, v3): A family of world model architectures that learn latent dynamics and use imagination for planning. Dreamer v3 achieves strong performance across diverse domains from a single algorithm.
MuZero: DeepMind's algorithm that learns a model, a policy, and a value function simultaneously. Mastered Chess, Go, Shogi, and Atari without being told the rules of any game.
MBPO (Model-Based Policy Optimization): Uses short model rollouts to augment real data, getting the best of both model-free and model-based approaches.

Offline RL / Batch RL

Standard RL requires interacting with the environment during training, which can be expensive, dangerous, or impractical. Offline RL learns entirely from a pre-collected dataset of past interactions, without any new exploration.

Use cases: Healthcare (learning treatment policies from patient records), autonomous driving (learning from driving logs), recommendation systems
Key algorithms: CQL (Conservative Q-Learning), IQL (Implicit Q-Learning), Decision Transformer
Challenge: Distribution shift — the learned policy may encounter states not well-represented in the offline data

Deep Reinforcement Learning

Deep RL combines the decision-making framework of RL with the representational power of deep neural networks. This combination allows RL to tackle problems with high-dimensional inputs (images, sensor data) and complex action spaces that were previously intractable.

Key innovations that made deep RL practical:

Experience replay: Store past experiences in a buffer and sample mini-batches for training, breaking correlation between sequential samples
Target networks: Use a slowly-updated copy of the network to compute targets, preventing oscillation
Reward shaping: Design intermediate rewards to guide learning toward the final goal
Curriculum learning: Start with easy tasks and gradually increase difficulty
Parallel environments: Run many environment copies simultaneously to collect diverse experience faster

RLHF: Aligning Language Models

Reinforcement Learning from Human Feedback (RLHF) is perhaps the most impactful recent application of RL. It is the technique used to make LLMs like ChatGPT, Claude, and Gemini helpful, harmless, and honest.

How RLHF Works

Step 1 — Supervised Fine-Tuning (SFT): Start with a pre-trained LLM and fine-tune it on high-quality human-written demonstrations of desired behavior.
Step 2 — Reward Model Training: Collect pairs of model outputs and have humans rank them by preference. Train a reward model to predict which output a human would prefer.
Step 3 — PPO Optimization: Use the reward model as the environment's reward signal. Optimize the LLM's policy using PPO to generate outputs that maximize the predicted human preference score, while staying close to the SFT model (KL penalty) to prevent reward hacking.

⚠

Reward hacking in RLHF: Without the KL penalty, the LLM may learn to exploit quirks in the reward model rather than genuinely improving quality. For example, it might generate overly verbose or sycophantic responses that score highly but aren't actually better. Careful reward model design and constrained optimization are essential.

Beyond RLHF

DPO (Direct Preference Optimization): Eliminates the separate reward model by directly optimizing the policy from preference data. Simpler and often equally effective.
RLAIF: Uses AI feedback instead of human feedback to scale the process.
Constitutional AI: Defines a set of principles and uses AI self-critique to align behavior.

Key Breakthroughs

AlphaGo (2016): DeepMind's system defeated the world champion in Go, a game with more possible positions than atoms in the universe. Combined deep RL with Monte Carlo tree search.
AlphaZero (2017): Learned Chess, Go, and Shogi from scratch (self-play only), surpassing all previous programs within hours of training.
AlphaFold (2020): Solved the protein folding problem, a 50-year grand challenge in biology. Used RL-inspired techniques to predict 3D protein structures with atomic accuracy.
OpenAI Five (2019): Defeated the world champion team in Dota 2, a complex real-time strategy game with imperfect information and 10 cooperating agents.
ChatGPT (2022): RLHF transformed GPT-3.5 from a text completion engine into a helpful assistant, launching the AI chatbot revolution.
Robotics: RL enables robots to learn dexterous manipulation, locomotion, and navigation from simulation and transfer to reality (sim-to-real).

RL Algorithms Comparison

Algorithm	Type	Pros	Cons	Best For
Q-Learning	Value-based, model-free	Simple, guaranteed convergence (tabular)	Only discrete actions, doesn't scale	Small state/action spaces, education
DQN	Value-based, model-free	Handles high-dimensional inputs (images)	Only discrete actions, can overestimate	Atari games, discrete control
PPO	Policy gradient, model-free	Stable, works with continuous actions, versatile	Sample-inefficient, requires tuning	Robotics, RLHF, general-purpose RL
SAC	Actor-critic, model-free	Great exploration, sample-efficient (off-policy)	More complex, continuous actions only	Robotic manipulation, locomotion
A3C / A2C	Actor-critic, model-free	Parallelizable, stable	Requires many CPU cores	Game AI, parallel training setups
Dreamer v3	Model-based	Very sample-efficient, works across domains	Complex implementation, model errors	Sample-limited environments, diverse tasks
MuZero	Model-based	No rules needed, plans ahead	Computationally expensive	Board games, planning-heavy tasks
CQL	Offline RL	No environment interaction needed	Conservative, may underperform online RL	Healthcare, safety-critical domains

RL Frameworks and Tools

You don't have to implement RL algorithms from scratch. These libraries provide battle-tested implementations:

Stable Baselines3: The go-to library for RL in Python. Clean PyTorch implementations of PPO, SAC, DQN, A2C, and more. Excellent documentation and easy to use.
RLlib (Ray): Scalable RL library built on Ray. Supports distributed training across clusters. Best choice for production-scale RL.
CleanRL: Single-file implementations of RL algorithms for research and education. Prioritizes readability and reproducibility.
Gymnasium (formerly OpenAI Gym): The standard API for RL environments. Provides classic control, Atari, MuJoCo, and custom environment support.
PettingZoo: Multi-agent RL environments with a Gymnasium-like API.
TRL (Transformer Reinforcement Learning): Hugging Face library specifically for RLHF and LLM alignment with PPO and DPO.

Code Example: Q-Learning

Here is a simple Q-learning implementation for a grid world environment using Gymnasium:

Python

import gymnasium as gym
import numpy as np

# Create the FrozenLake environment (4x4 grid)
env = gym.make("FrozenLake-v1", is_slippery=False)

# Initialize Q-table with zeros
n_states = env.observation_space.n   # 16 states
n_actions = env.action_space.n       # 4 actions (left, down, right, up)
Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha = 0.1      # Learning rate
gamma = 0.99     # Discount factor
epsilon = 1.0    # Exploration rate
epsilon_decay = 0.995
epsilon_min = 0.01
episodes = 10000

for episode in range(episodes):
    state, _ = env.reset()
    done = False

    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state])        # Exploit

        # Take action, observe result
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # Q-learning update (Bellman equation)
        best_next = np.max(Q[next_state])
        Q[state, action] += alpha * (
            reward + gamma * best_next - Q[state, action]
        )

        state = next_state

    # Decay exploration rate
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

print("Training complete!")
print(f"Success rate: {evaluate(env, Q, 100):.0%}")  # Test learned policy

Using Stable Baselines3 with PPO

Python

from stable_baselines3 import PPO
import gymnasium as gym

# Create environment
env = gym.make("CartPole-v1")

# Train a PPO agent (just 3 lines!)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50_000)

# Evaluate the trained agent
obs, _ = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, _ = env.reset()

# Save and load
model.save("ppo_cartpole")
loaded_model = PPO.load("ppo_cartpole")

Use Cases

Game AI: Training agents that master video games (Atari, StarCraft, Dota 2) and board games (Chess, Go). Used both for entertainment and as research benchmarks.
Robotics: Teaching robots to walk, grasp objects, assemble parts, and navigate environments. RL enables robots to learn complex motor skills that are difficult to program manually.
Autonomous Driving: Decision-making for lane changes, intersection handling, and route planning. Often combined with imitation learning from human driving data.
Resource Optimization: Data center cooling (Google reduced cooling costs 40% with RL), network routing, cloud resource allocation, and supply chain optimization.
Trading and Finance: Portfolio optimization, order execution, and market making. RL agents learn trading strategies by maximizing returns in simulated markets.
LLM Alignment: RLHF is the standard technique for making language models helpful, harmless, and honest. Every major chatbot uses some form of RL-based alignment.
Drug Discovery: Molecular optimization, where RL agents generate and modify drug candidates to maximize desired properties like binding affinity and drug-likeness.
Recommendation Systems: Optimizing long-term user engagement rather than just immediate click-through rates.

Challenges and Limitations

Sample Efficiency: RL typically requires millions or billions of interactions to learn, making it expensive and slow. Model-based methods and offline RL help, but the gap with supervised learning remains large.
Reward Hacking: Agents can find unexpected shortcuts to maximize the reward signal without actually solving the intended task. Designing robust reward functions is an art as much as a science.
Sim-to-Real Gap: Policies trained in simulation often fail in the real world due to differences in physics, visual appearance, and sensor noise. Domain randomization and sim-to-real transfer techniques help bridge this gap.
Credit Assignment: When rewards are sparse and delayed, it's difficult for the agent to figure out which past actions led to the eventual success or failure.
Exploration vs. Exploitation: Balancing trying new things (exploration) with doing what's known to work (exploitation) remains a fundamental challenge.
Stability and Reproducibility: RL training can be unstable and highly sensitive to hyperparameters and random seeds. Results can vary dramatically between runs.

💡

When to use RL: RL shines when you need sequential decision-making with delayed feedback, when you can simulate the environment, or when you need to optimize a complex objective that's hard to specify with supervised learning. If you have labeled data and a clear input-output mapping, supervised learning is almost always simpler and more effective.

← Previous Generative Models Next → Choosing the Right Model