Reinforcement Learning Algorithms (14)
Algorithms for learning optimal actions through environment interaction
Reinforcement Learning (RL) trains an agent to make sequential decisions by interacting with an environment to maximize cumulative reward. Unlike supervised learning, RL learns from experience without explicit labels. These 14 algorithms span value-based, policy-based, and actor-critic approaches.
Quick Reference Table
| Algorithm | Category | Action Space | On/Off Policy | Key Innovation |
|---|---|---|---|---|
| Q-Learning | Value-based | Discrete | Off-policy | Tabular Q-values |
| SARSA | Value-based | Discrete | On-policy | On-policy Q-learning |
| DQN | Value-based | Discrete | Off-policy | Neural network + replay buffer |
| Double DQN | Value-based | Discrete | Off-policy | Reduces overestimation |
| Dueling DQN | Value-based | Discrete | Off-policy | Separate value/advantage streams |
| Policy Gradient | Policy-based | Both | On-policy | Direct policy optimization |
| REINFORCE | Policy-based | Both | On-policy | Monte Carlo policy gradient |
| Actor-Critic | Actor-Critic | Both | Both | Value baseline reduces variance |
| A3C | Actor-Critic | Both | On-policy | Asynchronous parallel training |
| PPO | Actor-Critic | Both | On-policy | Clipped objective, stable updates |
| TRPO | Actor-Critic | Both | On-policy | Trust region constraint |
| DDPG | Actor-Critic | Continuous | Off-policy | Deterministic policy + continuous |
| TD3 | Actor-Critic | Continuous | Off-policy | Twin critics, delayed updates |
| SAC | Actor-Critic | Continuous | Off-policy | Maximum entropy framework |
1. Q-Learning
Category: Value-based | Policy: Off-policy
Description: Learns a Q-table mapping (state, action) pairs to expected cumulative rewards. Updates using the Bellman equation: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]. The agent selects actions greedily from Q-values (with epsilon-greedy exploration).
Use Cases: Grid worlds, simple games, environments with small discrete state/action spaces.
import numpy as np
class QLearning:
def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
self.q_table = np.zeros((n_states, n_actions))
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.q_table.shape[1])
return np.argmax(self.q_table[state])
def update(self, state, action, reward, next_state, done):
target = reward
if not done:
target += self.gamma * np.max(self.q_table[next_state])
self.q_table[state, action] += self.lr * (target - self.q_table[state, action])
# Usage
agent = QLearning(n_states=16, n_actions=4)
print(f"Q-table shape: {agent.q_table.shape}")
2. SARSA
Category: Value-based | Policy: On-policy
Description: State-Action-Reward-State-Action. Similar to Q-Learning but uses the actual next action (not the max) for the update: Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]. Being on-policy makes it more conservative and safer in stochastic environments.
Use Cases: When the agent's exploration policy matters (e.g., cliff-walking), safer learning.
class SARSA:
def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
self.q_table = np.zeros((n_states, n_actions))
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.q_table.shape[1])
return np.argmax(self.q_table[state])
def update(self, state, action, reward, next_state, next_action, done):
target = reward
if not done:
target += self.gamma * self.q_table[next_state, next_action] # On-policy
self.q_table[state, action] += self.lr * (target - self.q_table[state, action])
agent = SARSA(n_states=16, n_actions=4)
print("SARSA: on-policy Q-learning variant")
3. Deep Q-Network (DQN)
Category: Value-based | Policy: Off-policy
Description: Replaces the Q-table with a neural network that approximates Q(s,a). Introduced two key innovations: experience replay buffer (breaks correlation in sequential data) and target network (stabilizes training by using a slowly-updated copy for targets). The 2015 breakthrough that achieved human-level play on Atari games.
Use Cases: Atari games, environments with large state spaces, discrete action problems.
import torch
import torch.nn as nn
import numpy as np
from collections import deque
import random
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, x):
return self.network(x)
class DQNAgent:
def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
self.q_net = DQN(state_dim, action_dim)
self.target_net = DQN(state_dim, action_dim)
self.target_net.load_state_dict(self.q_net.state_dict())
self.optimizer = torch.optim.Adam(self.q_net.parameters(), lr=lr)
self.replay_buffer = deque(maxlen=10000)
self.gamma = gamma
def update_target(self):
self.target_net.load_state_dict(self.q_net.state_dict())
agent = DQNAgent(state_dim=4, action_dim=2)
print("DQN agent initialized")
4. Double DQN
Category: Value-based | Policy: Off-policy
Description: Addresses DQN's overestimation bias by decoupling action selection from action evaluation. Uses the online network to select the best action, but the target network to evaluate that action's value. This simple change significantly improves stability.
Use Cases: Any DQN application where overestimation is a concern, improved Atari performance.
# Double DQN: only change is in the target calculation
# Standard DQN target: r + gamma * max_a' Q_target(s', a')
# Double DQN target: r + gamma * Q_target(s', argmax_a' Q_online(s', a'))
def double_dqn_target(q_net, target_net, next_states, rewards, dones, gamma):
with torch.no_grad():
# Select actions using online network
best_actions = q_net(next_states).argmax(dim=1, keepdim=True)
# Evaluate actions using target network
next_q_values = target_net(next_states).gather(1, best_actions).squeeze()
targets = rewards + gamma * next_q_values * (1 - dones)
return targets
print("Double DQN: decouples selection from evaluation")
5. Dueling DQN
Category: Value-based | Policy: Off-policy
Description: Modifies the DQN architecture to have two separate streams: one estimates the state value V(s) and the other estimates the advantage A(s,a) of each action. Q(s,a) = V(s) + A(s,a) - mean(A). This allows the network to learn which states are valuable without having to evaluate every action.
Use Cases: Environments where many actions have similar values, improved sample efficiency.
class DuelingDQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU()
)
# Value stream
self.value_stream = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 1)
)
# Advantage stream
self.advantage_stream = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, action_dim)
)
def forward(self, x):
shared = self.shared(x)
value = self.value_stream(shared)
advantage = self.advantage_stream(shared)
# Combine: Q = V + (A - mean(A))
q_values = value + advantage - advantage.mean(dim=1, keepdim=True)
return q_values
model = DuelingDQN(state_dim=4, action_dim=2)
print(f"Dueling DQN parameters: {sum(p.numel() for p in model.parameters())}")
6. Policy Gradient
Category: Policy-based | Policy: On-policy
Description: Directly parameterizes and optimizes the policy (mapping from states to action probabilities) without learning a value function. Uses the policy gradient theorem to compute gradients of expected return with respect to policy parameters. Can handle continuous action spaces naturally.
Use Cases: Continuous action spaces, stochastic policies, when value-based methods struggle.
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, action_dim),
nn.Softmax(dim=-1)
)
def forward(self, state):
return self.network(state)
policy = PolicyNetwork(state_dim=4, action_dim=2)
print("Policy Gradient: directly optimizes action probabilities")
7. REINFORCE
Category: Policy-based | Policy: On-policy
Description: A Monte Carlo policy gradient method. Collects complete episode trajectories, then updates the policy by increasing the probability of actions that led to high returns. Simple but has high variance due to using full episode returns.
Use Cases: Episodic tasks, when simplicity is preferred over sample efficiency.
class REINFORCE:
def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
self.policy = PolicyNetwork(state_dim, action_dim)
self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
self.gamma = gamma
def compute_returns(self, rewards):
returns = []
G = 0
for r in reversed(rewards):
G = r + self.gamma * G
returns.insert(0, G)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
return returns
def update(self, log_probs, rewards):
returns = self.compute_returns(rewards)
loss = -torch.stack(log_probs) * returns
loss = loss.sum()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
agent = REINFORCE(state_dim=4, action_dim=2)
print("REINFORCE: Monte Carlo policy gradient")
8. Actor-Critic
Category: Actor-Critic | Policy: Both
Description: Combines policy-based (actor) and value-based (critic) methods. The actor learns the policy, while the critic learns the value function to reduce variance. The critic provides a baseline, replacing the high-variance Monte Carlo returns of REINFORCE with lower-variance TD estimates.
Use Cases: General RL tasks, when REINFORCE has too high variance, continuous and discrete action spaces.
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU()
)
self.actor = nn.Sequential(nn.Linear(128, action_dim), nn.Softmax(dim=-1))
self.critic = nn.Linear(128, 1)
def forward(self, state):
shared = self.shared(state)
policy = self.actor(shared)
value = self.critic(shared)
return policy, value
model = ActorCritic(state_dim=4, action_dim=2)
state = torch.randn(1, 4)
policy, value = model(state)
print(f"Policy: {policy.data}, Value: {value.data}")
9. A3C (Asynchronous Advantage Actor-Critic)
Category: Actor-Critic | Policy: On-policy
Description: Runs multiple actor-critic agents in parallel, each in its own copy of the environment. Each agent computes gradients locally and asynchronously updates a shared global model. The asynchronous updates naturally provide diverse experiences, eliminating the need for a replay buffer.
Use Cases: Training on multi-core CPUs, when parallelism is available, Atari games.
# A3C pseudocode structure
# (Full implementation requires multiprocessing)
class A3CWorker:
"""Each worker runs in its own process with a copy of the environment."""
def __init__(self, global_model, optimizer, env_name):
self.global_model = global_model
self.local_model = ActorCritic(state_dim=4, action_dim=2)
self.optimizer = optimizer
def sync_with_global(self):
self.local_model.load_state_dict(self.global_model.state_dict())
def push_gradients_to_global(self):
for local_param, global_param in zip(
self.local_model.parameters(),
self.global_model.parameters()
):
global_param.grad = local_param.grad
self.optimizer.step()
print("A3C: parallel actor-critic with asynchronous gradient updates")
10. Proximal Policy Optimization (PPO)
Category: Actor-Critic | Policy: On-policy
Description: The most widely used RL algorithm today. Improves on TRPO by using a simpler clipped surrogate objective that constrains policy updates to a "trust region" without expensive second-order optimization. Strikes the best balance between simplicity, sample efficiency, and stability.
Use Cases: Robotics, game AI, RLHF for language models, any continuous or discrete control task. Used by OpenAI for ChatGPT alignment.
# PPO with clipped objective (using stable-baselines3)
from stable_baselines3 import PPO
# model = PPO(
# "MlpPolicy",
# "CartPole-v1",
# learning_rate=3e-4,
# n_steps=2048,
# batch_size=64,
# n_epochs=10,
# gamma=0.99,
# gae_lambda=0.95,
# clip_range=0.2, # Clipping parameter epsilon
# verbose=1
# )
# model.learn(total_timesteps=100000)
# Core PPO clipped objective:
# L_CLIP = min(r_t * A_t, clip(r_t, 1-eps, 1+eps) * A_t)
# where r_t = pi_new(a|s) / pi_old(a|s)
print("PPO: the go-to RL algorithm for most applications")
11. TRPO (Trust Region Policy Optimization)
Category: Actor-Critic | Policy: On-policy
Description: Guarantees monotonic policy improvement by constraining each update to stay within a trust region (measured by KL divergence between old and new policies). Uses conjugate gradient and line search for optimization. More theoretically principled but more complex than PPO.
Use Cases: When guaranteed monotonic improvement is important, robotics, continuous control.
# TRPO optimizes subject to KL constraint:
# maximize E[pi_new(a|s)/pi_old(a|s) * A(s,a)]
# subject to E[KL(pi_old || pi_new)] <= delta
# Typically used via stable-baselines3 or RLlib
# from sb3_contrib import TRPO
# model = TRPO("MlpPolicy", "CartPole-v1", verbose=1)
# model.learn(total_timesteps=100000)
print("TRPO: constrained optimization for guaranteed improvement")
12. DDPG (Deep Deterministic Policy Gradient)
Category: Actor-Critic | Policy: Off-policy
Description: Extends DQN to continuous action spaces using a deterministic policy. The actor outputs a specific action (not a distribution), and the critic evaluates (state, action) pairs. Uses experience replay and target networks (like DQN) for stability. Adds noise to actions for exploration.
Use Cases: Continuous control (robotic arm, autonomous driving), physics simulations.
# DDPG with stable-baselines3
from stable_baselines3 import DDPG
# model = DDPG(
# "MlpPolicy",
# "Pendulum-v1",
# learning_rate=1e-3,
# buffer_size=200000,
# learning_starts=100,
# batch_size=100,
# tau=0.005, # Soft target update coefficient
# gamma=0.99,
# verbose=1
# )
# model.learn(total_timesteps=100000)
print("DDPG: DQN + Actor-Critic for continuous actions")
13. TD3 (Twin Delayed DDPG)
Category: Actor-Critic | Policy: Off-policy
Description: Improves DDPG with three key innovations: (1) twin critics -- uses two Q-networks and takes the minimum to reduce overestimation; (2) delayed actor updates -- updates the actor less frequently than the critics; (3) target policy smoothing -- adds noise to target actions for regularization.
Use Cases: Continuous control tasks, when DDPG is unstable, robotics.
from stable_baselines3 import TD3
# model = TD3(
# "MlpPolicy",
# "Pendulum-v1",
# learning_rate=1e-3,
# buffer_size=200000,
# batch_size=100,
# tau=0.005,
# gamma=0.99,
# policy_delay=2, # Update actor every 2 critic updates
# target_policy_noise=0.2,
# target_noise_clip=0.5,
# verbose=1
# )
# model.learn(total_timesteps=100000)
print("TD3: three tricks to stabilize DDPG")
14. Soft Actor-Critic (SAC)
Category: Actor-Critic | Policy: Off-policy
Description: Maximizes both expected return and entropy (randomness) of the policy. The entropy term encourages exploration and makes the policy robust. Uses twin critics (like TD3) and automatic temperature tuning. Often considered the best off-policy algorithm for continuous control.
Use Cases: Continuous control with exploration challenges, robotics, when robustness to different environments is needed.
from stable_baselines3 import SAC
# model = SAC(
# "MlpPolicy",
# "Pendulum-v1",
# learning_rate=3e-4,
# buffer_size=1000000,
# batch_size=256,
# tau=0.005,
# gamma=0.99,
# ent_coef='auto', # Automatic entropy coefficient tuning
# verbose=1
# )
# model.learn(total_timesteps=100000)
# SAC objective: maximize E[sum(r + alpha * H(pi))]
# where H(pi) is the entropy of the policy
print("SAC: maximum entropy RL for robust continuous control")