Home » ML Algorithm Directory » Reinforcement Learning

Reinforcement Learning Algorithms (14)

Algorithms for learning optimal actions through environment interaction

Reinforcement Learning (RL) trains an agent to make sequential decisions by interacting with an environment to maximize cumulative reward. Unlike supervised learning, RL learns from experience without explicit labels. These 14 algorithms span value-based, policy-based, and actor-critic approaches.

Quick Reference Table

Algorithm	Category	Action Space	On/Off Policy	Key Innovation
Q-Learning	Value-based	Discrete	Off-policy	Tabular Q-values
SARSA	Value-based	Discrete	On-policy	On-policy Q-learning
DQN	Value-based	Discrete	Off-policy	Neural network + replay buffer
Double DQN	Value-based	Discrete	Off-policy	Reduces overestimation
Dueling DQN	Value-based	Discrete	Off-policy	Separate value/advantage streams
Policy Gradient	Policy-based	Both	On-policy	Direct policy optimization
REINFORCE	Policy-based	Both	On-policy	Monte Carlo policy gradient
Actor-Critic	Actor-Critic	Both	Both	Value baseline reduces variance
A3C	Actor-Critic	Both	On-policy	Asynchronous parallel training
PPO	Actor-Critic	Both	On-policy	Clipped objective, stable updates
TRPO	Actor-Critic	Both	On-policy	Trust region constraint
DDPG	Actor-Critic	Continuous	Off-policy	Deterministic policy + continuous
TD3	Actor-Critic	Continuous	Off-policy	Twin critics, delayed updates
SAC	Actor-Critic	Continuous	Off-policy	Maximum entropy framework

1. Q-Learning

Category: Value-based | Policy: Off-policy

Description: Learns a Q-table mapping (state, action) pairs to expected cumulative rewards. Updates using the Bellman equation: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]. The agent selects actions greedily from Q-values (with epsilon-greedy exploration).

Use Cases: Grid worlds, simple games, environments with small discrete state/action spaces.

import numpy as np

class QLearning:
    def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon

    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.q_table.shape[1])
        return np.argmax(self.q_table[state])

    def update(self, state, action, reward, next_state, done):
        target = reward
        if not done:
            target += self.gamma * np.max(self.q_table[next_state])
        self.q_table[state, action] += self.lr * (target - self.q_table[state, action])

# Usage
agent = QLearning(n_states=16, n_actions=4)
print(f"Q-table shape: {agent.q_table.shape}")

2. SARSA

Category: Value-based | Policy: On-policy

Description: State-Action-Reward-State-Action. Similar to Q-Learning but uses the actual next action (not the max) for the update: Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]. Being on-policy makes it more conservative and safer in stochastic environments.

Use Cases: When the agent's exploration policy matters (e.g., cliff-walking), safer learning.

class SARSA:
    def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon

    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.q_table.shape[1])
        return np.argmax(self.q_table[state])

    def update(self, state, action, reward, next_state, next_action, done):
        target = reward
        if not done:
            target += self.gamma * self.q_table[next_state, next_action]  # On-policy
        self.q_table[state, action] += self.lr * (target - self.q_table[state, action])

agent = SARSA(n_states=16, n_actions=4)
print("SARSA: on-policy Q-learning variant")

3. Deep Q-Network (DQN)

Category: Value-based | Policy: Off-policy

Description: Replaces the Q-table with a neural network that approximates Q(s,a). Introduced two key innovations: experience replay buffer (breaks correlation in sequential data) and target network (stabilizes training by using a slowly-updated copy for targets). The 2015 breakthrough that achieved human-level play on Atari games.

Use Cases: Atari games, environments with large state spaces, discrete action problems.

import torch
import torch.nn as nn
import numpy as np
from collections import deque
import random

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )

    def forward(self, x):
        return self.network(x)

class DQNAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.q_net = DQN(state_dim, action_dim)
        self.target_net = DQN(state_dim, action_dim)
        self.target_net.load_state_dict(self.q_net.state_dict())
        self.optimizer = torch.optim.Adam(self.q_net.parameters(), lr=lr)
        self.replay_buffer = deque(maxlen=10000)
        self.gamma = gamma

    def update_target(self):
        self.target_net.load_state_dict(self.q_net.state_dict())

agent = DQNAgent(state_dim=4, action_dim=2)
print("DQN agent initialized")

4. Double DQN

Category: Value-based | Policy: Off-policy

Description: Addresses DQN's overestimation bias by decoupling action selection from action evaluation. Uses the online network to select the best action, but the target network to evaluate that action's value. This simple change significantly improves stability.

Use Cases: Any DQN application where overestimation is a concern, improved Atari performance.

# Double DQN: only change is in the target calculation
# Standard DQN target:  r + gamma * max_a' Q_target(s', a')
# Double DQN target:    r + gamma * Q_target(s', argmax_a' Q_online(s', a'))

def double_dqn_target(q_net, target_net, next_states, rewards, dones, gamma):
    with torch.no_grad():
        # Select actions using online network
        best_actions = q_net(next_states).argmax(dim=1, keepdim=True)
        # Evaluate actions using target network
        next_q_values = target_net(next_states).gather(1, best_actions).squeeze()
        targets = rewards + gamma * next_q_values * (1 - dones)
    return targets

print("Double DQN: decouples selection from evaluation")

5. Dueling DQN

Category: Value-based | Policy: Off-policy

Description: Modifies the DQN architecture to have two separate streams: one estimates the state value V(s) and the other estimates the advantage A(s,a) of each action. Q(s,a) = V(s) + A(s,a) - mean(A). This allows the network to learn which states are valuable without having to evaluate every action.

Use Cases: Environments where many actions have similar values, improved sample efficiency.

class DuelingDQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU()
        )
        # Value stream
        self.value_stream = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        # Advantage stream
        self.advantage_stream = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim)
        )

    def forward(self, x):
        shared = self.shared(x)
        value = self.value_stream(shared)
        advantage = self.advantage_stream(shared)
        # Combine: Q = V + (A - mean(A))
        q_values = value + advantage - advantage.mean(dim=1, keepdim=True)
        return q_values

model = DuelingDQN(state_dim=4, action_dim=2)
print(f"Dueling DQN parameters: {sum(p.numel() for p in model.parameters())}")

6. Policy Gradient

Category: Policy-based | Policy: On-policy

Description: Directly parameterizes and optimizes the policy (mapping from states to action probabilities) without learning a value function. Uses the policy gradient theorem to compute gradients of expected return with respect to policy parameters. Can handle continuous action spaces naturally.

Use Cases: Continuous action spaces, stochastic policies, when value-based methods struggle.

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.network(state)

policy = PolicyNetwork(state_dim=4, action_dim=2)
print("Policy Gradient: directly optimizes action probabilities")

7. REINFORCE

Category: Policy-based | Policy: On-policy

Description: A Monte Carlo policy gradient method. Collects complete episode trajectories, then updates the policy by increasing the probability of actions that led to high returns. Simple but has high variance due to using full episode returns.

Use Cases: Episodic tasks, when simplicity is preferred over sample efficiency.

class REINFORCE:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.policy = PolicyNetwork(state_dim, action_dim)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
        self.gamma = gamma

    def compute_returns(self, rewards):
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        return returns

    def update(self, log_probs, rewards):
        returns = self.compute_returns(rewards)
        loss = -torch.stack(log_probs) * returns
        loss = loss.sum()
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

agent = REINFORCE(state_dim=4, action_dim=2)
print("REINFORCE: Monte Carlo policy gradient")

8. Actor-Critic

Category: Actor-Critic | Policy: Both

Description: Combines policy-based (actor) and value-based (critic) methods. The actor learns the policy, while the critic learns the value function to reduce variance. The critic provides a baseline, replacing the high-variance Monte Carlo returns of REINFORCE with lower-variance TD estimates.

Use Cases: General RL tasks, when REINFORCE has too high variance, continuous and discrete action spaces.

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU()
        )
        self.actor = nn.Sequential(nn.Linear(128, action_dim), nn.Softmax(dim=-1))
        self.critic = nn.Linear(128, 1)

    def forward(self, state):
        shared = self.shared(state)
        policy = self.actor(shared)
        value = self.critic(shared)
        return policy, value

model = ActorCritic(state_dim=4, action_dim=2)
state = torch.randn(1, 4)
policy, value = model(state)
print(f"Policy: {policy.data}, Value: {value.data}")

9. A3C (Asynchronous Advantage Actor-Critic)

Category: Actor-Critic | Policy: On-policy

Description: Runs multiple actor-critic agents in parallel, each in its own copy of the environment. Each agent computes gradients locally and asynchronously updates a shared global model. The asynchronous updates naturally provide diverse experiences, eliminating the need for a replay buffer.

Use Cases: Training on multi-core CPUs, when parallelism is available, Atari games.

# A3C pseudocode structure
# (Full implementation requires multiprocessing)

class A3CWorker:
    """Each worker runs in its own process with a copy of the environment."""
    def __init__(self, global_model, optimizer, env_name):
        self.global_model = global_model
        self.local_model = ActorCritic(state_dim=4, action_dim=2)
        self.optimizer = optimizer

    def sync_with_global(self):
        self.local_model.load_state_dict(self.global_model.state_dict())

    def push_gradients_to_global(self):
        for local_param, global_param in zip(
            self.local_model.parameters(),
            self.global_model.parameters()
        ):
            global_param.grad = local_param.grad
        self.optimizer.step()

print("A3C: parallel actor-critic with asynchronous gradient updates")

10. Proximal Policy Optimization (PPO)

Category: Actor-Critic | Policy: On-policy

Description: The most widely used RL algorithm today. Improves on TRPO by using a simpler clipped surrogate objective that constrains policy updates to a "trust region" without expensive second-order optimization. Strikes the best balance between simplicity, sample efficiency, and stability.

Use Cases: Robotics, game AI, RLHF for language models, any continuous or discrete control task. Used by OpenAI for ChatGPT alignment.

# PPO with clipped objective (using stable-baselines3)
from stable_baselines3 import PPO

# model = PPO(
#     "MlpPolicy",
#     "CartPole-v1",
#     learning_rate=3e-4,
#     n_steps=2048,
#     batch_size=64,
#     n_epochs=10,
#     gamma=0.99,
#     gae_lambda=0.95,
#     clip_range=0.2,        # Clipping parameter epsilon
#     verbose=1
# )
# model.learn(total_timesteps=100000)

# Core PPO clipped objective:
# L_CLIP = min(r_t * A_t, clip(r_t, 1-eps, 1+eps) * A_t)
# where r_t = pi_new(a|s) / pi_old(a|s)

print("PPO: the go-to RL algorithm for most applications")

11. TRPO (Trust Region Policy Optimization)

Category: Actor-Critic | Policy: On-policy

Description: Guarantees monotonic policy improvement by constraining each update to stay within a trust region (measured by KL divergence between old and new policies). Uses conjugate gradient and line search for optimization. More theoretically principled but more complex than PPO.

Use Cases: When guaranteed monotonic improvement is important, robotics, continuous control.

# TRPO optimizes subject to KL constraint:
# maximize E[pi_new(a|s)/pi_old(a|s) * A(s,a)]
# subject to E[KL(pi_old || pi_new)] <= delta

# Typically used via stable-baselines3 or RLlib
# from sb3_contrib import TRPO
# model = TRPO("MlpPolicy", "CartPole-v1", verbose=1)
# model.learn(total_timesteps=100000)

print("TRPO: constrained optimization for guaranteed improvement")

12. DDPG (Deep Deterministic Policy Gradient)

Category: Actor-Critic | Policy: Off-policy

Description: Extends DQN to continuous action spaces using a deterministic policy. The actor outputs a specific action (not a distribution), and the critic evaluates (state, action) pairs. Uses experience replay and target networks (like DQN) for stability. Adds noise to actions for exploration.

Use Cases: Continuous control (robotic arm, autonomous driving), physics simulations.

# DDPG with stable-baselines3
from stable_baselines3 import DDPG

# model = DDPG(
#     "MlpPolicy",
#     "Pendulum-v1",
#     learning_rate=1e-3,
#     buffer_size=200000,
#     learning_starts=100,
#     batch_size=100,
#     tau=0.005,             # Soft target update coefficient
#     gamma=0.99,
#     verbose=1
# )
# model.learn(total_timesteps=100000)

print("DDPG: DQN + Actor-Critic for continuous actions")

13. TD3 (Twin Delayed DDPG)

Category: Actor-Critic | Policy: Off-policy

Description: Improves DDPG with three key innovations: (1) twin critics -- uses two Q-networks and takes the minimum to reduce overestimation; (2) delayed actor updates -- updates the actor less frequently than the critics; (3) target policy smoothing -- adds noise to target actions for regularization.

Use Cases: Continuous control tasks, when DDPG is unstable, robotics.

from stable_baselines3 import TD3

# model = TD3(
#     "MlpPolicy",
#     "Pendulum-v1",
#     learning_rate=1e-3,
#     buffer_size=200000,
#     batch_size=100,
#     tau=0.005,
#     gamma=0.99,
#     policy_delay=2,        # Update actor every 2 critic updates
#     target_policy_noise=0.2,
#     target_noise_clip=0.5,
#     verbose=1
# )
# model.learn(total_timesteps=100000)

print("TD3: three tricks to stabilize DDPG")

14. Soft Actor-Critic (SAC)

Category: Actor-Critic | Policy: Off-policy

Description: Maximizes both expected return and entropy (randomness) of the policy. The entropy term encourages exploration and makes the policy robust. Uses twin critics (like TD3) and automatic temperature tuning. Often considered the best off-policy algorithm for continuous control.

Use Cases: Continuous control with exploration challenges, robotics, when robustness to different environments is needed.

from stable_baselines3 import SAC

# model = SAC(
#     "MlpPolicy",
#     "Pendulum-v1",
#     learning_rate=3e-4,
#     buffer_size=1000000,
#     batch_size=256,
#     tau=0.005,
#     gamma=0.99,
#     ent_coef='auto',       # Automatic entropy coefficient tuning
#     verbose=1
# )
# model.learn(total_timesteps=100000)

# SAC objective: maximize E[sum(r + alpha * H(pi))]
# where H(pi) is the entropy of the policy

print("SAC: maximum entropy RL for robust continuous control")

← Previous: Ensemble Methods Next: Neural Networks →