Reinforcement learning fundamentals and practical implementations. Covers RL concepts (agent, environment, reward, policy), Q-Learning, Deep Q-Network (DQN), Policy Gradient methods, PPO, Actor-Critic, Gymnasium environments, Stable-Baselines3, reward shaping, and exploration-exploitation trade-off. Use when user asks about 'reinforcement learning', 'RL', 'Q-learning', 'DQN', 'PPO', 'policy gradient', 'reward function', 'agent', 'environment', 'Gym', 'Gymnasium', 'exploration', 'exploitation', 'Stable-Baselines3', 'Actor-Critic', 'SARSA', 'Bellman equation', or 'Markov decision process'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/reinforcement-learning Install via the SkillsCat registry.
SKILL.md
Reinforcement Learning - Learning from Interaction
למידת חיזוק: הסוכן לומד מניסיון ושגיאה.
Quick Start - Q-Learning (No Libraries)
import numpy as np
# Simple GridWorld
# Agent starts at (0,0), goal at (3,3), in 4x4 grid
grid_size = 4
n_states = grid_size * grid_size
n_actions = 4 # up, down, left, right
# Q-table: state × action → expected reward
Q = np.zeros((n_states, n_actions))
# Hyperparameters
learning_rate = 0.1
discount_factor = 0.99 # gamma
epsilon = 1.0 # exploration rate
epsilon_decay = 0.995
epsilon_min = 0.01
episodes = 1000
def state_to_pos(state):
return state // grid_size, state % grid_size
def pos_to_state(row, col):
return row * grid_size + col
def step(state, action):
"""Take action, return (next_state, reward, done)."""
row, col = state_to_pos(state)
if action == 0: row = max(0, row - 1) # up
elif action == 1: row = min(grid_size-1, row + 1) # down
elif action == 2: col = max(0, col - 1) # left
elif action == 3: col = min(grid_size-1, col + 1) # right
next_state = pos_to_state(row, col)
done = (next_state == n_states - 1) # Goal at last cell
reward = 1.0 if done else -0.01 # Small penalty per step
return next_state, reward, done
# Training loop
for episode in range(episodes):
state = 0 # Start at (0,0)
total_reward = 0
for t in range(100): # Max steps per episode
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = np.random.randint(n_actions) # Explore
else:
action = np.argmax(Q[state]) # Exploit
next_state, reward, done = step(state, action)
total_reward += reward
# Q-Learning update (off-policy)
# Q(s,a) ← Q(s,a) + α[r + γ·max_a'Q(s',a') - Q(s,a)]
best_next = np.max(Q[next_state])
Q[state, action] += learning_rate * (
reward + discount_factor * best_next - Q[state, action]
)
state = next_state
if done:
break
# Decay exploration
epsilon = max(epsilon_min, epsilon * epsilon_decay)
if episode % 100 == 0:
print(f"Episode {episode}, Reward: {total_reward:.2f}, "
f"Epsilon: {epsilon:.3f}")
print("Learned Q-table:")
print(Q.reshape(grid_size, grid_size, n_actions).round(2))When This Skill Activates
Use this skill when:
- Learning RL concepts (agent, environment, reward)
- Implementing Q-Learning or DQN
- Training agents with PPO or A2C
- Working with Gymnasium environments
- Designing reward functions
- Understanding exploration vs exploitation
Core Patterns
Pattern 1: RL Framework
RL Components:
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌───────────┐ action a ┌──────────────┐ │
│ │ │──────────────▶│ │ │
│ │ Agent │ │ Environment │ │
│ │ (Policy) │◀──────────────│ (World) │ │
│ │ │ state s' │ │ │
│ │ │ reward r │ │ │
│ └───────────┘ └──────────────┘ │
│ │
│ Agent's Goal: Maximize cumulative reward │
│ G = r₁ + γr₂ + γ²r₃ + ... (discounted return) │
│ │
│ Policy π(a|s): probability of action a in state s │
│ Value V(s): expected return from state s │
│ Q-Value Q(s,a): expected return from (state, action) │
└─────────────────────────────────────────────────────────┘Pattern 2: DQN (Deep Q-Network) with PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
import gymnasium as gym
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, x):
return self.network(x)
class ReplayBuffer:
"""Experience replay for stable training."""
def __init__(self, capacity=10000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (np.array(states), np.array(actions), np.array(rewards),
np.array(next_states), np.array(dones))
def __len__(self):
return len(self.buffer)
class DQNAgent:
def __init__(self, state_dim, action_dim, lr=1e-3):
self.action_dim = action_dim
self.epsilon = 1.0
self.epsilon_decay = 0.995
self.epsilon_min = 0.01
self.gamma = 0.99
self.batch_size = 64
self.q_network = DQN(state_dim, action_dim)
self.target_network = DQN(state_dim, action_dim)
self.target_network.load_state_dict(self.q_network.state_dict())
self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
self.buffer = ReplayBuffer()
def select_action(self, state):
if random.random() < self.epsilon:
return random.randint(0, self.action_dim - 1)
with torch.no_grad():
state_t = torch.FloatTensor(state).unsqueeze(0)
q_values = self.q_network(state_t)
return q_values.argmax().item()
def train_step(self):
if len(self.buffer) < self.batch_size:
return
states, actions, rewards, next_states, dones = \
self.buffer.sample(self.batch_size)
states_t = torch.FloatTensor(states)
actions_t = torch.LongTensor(actions)
rewards_t = torch.FloatTensor(rewards)
next_states_t = torch.FloatTensor(next_states)
dones_t = torch.FloatTensor(dones)
# Current Q-values
current_q = self.q_network(states_t).gather(1, actions_t.unsqueeze(1))
# Target Q-values (from target network)
with torch.no_grad():
next_q = self.target_network(next_states_t).max(1)[0]
target_q = rewards_t + self.gamma * next_q * (1 - dones_t)
# Loss and update
loss = nn.MSELoss()(current_q.squeeze(), target_q)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
def update_target(self):
"""Copy Q-network weights to target network."""
self.target_network.load_state_dict(self.q_network.state_dict())
# Training
env = gym.make("CartPole-v1")
agent = DQNAgent(state_dim=4, action_dim=2)
for episode in range(500):
state, _ = env.reset()
total_reward = 0
for t in range(500):
action = agent.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.buffer.push(state, action, reward, next_state, done)
agent.train_step()
state = next_state
total_reward += reward
if done:
break
# Update target network periodically
if episode % 10 == 0:
agent.update_target()
if episode % 50 == 0:
print(f"Episode {episode}, Reward: {total_reward:.0f}, "
f"Epsilon: {agent.epsilon:.3f}")Pattern 3: Stable-Baselines3 (High-Level RL)
from stable_baselines3 import PPO, DQN, A2C, SAC
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback
import gymnasium as gym
# Create vectorized environment (parallel)
env = make_vec_env("CartPole-v1", n_envs=4)
# Train with PPO (most versatile algorithm)
model = PPO(
"MlpPolicy", # Neural network policy
env,
learning_rate=3e-4,
n_steps=2048, # Steps per update
batch_size=64,
n_epochs=10, # Optimization epochs per update
gamma=0.99, # Discount factor
verbose=1
)
# Evaluation callback (saves best model)
eval_callback = EvalCallback(
eval_env=gym.make("CartPole-v1"),
eval_freq=5000,
best_model_save_path="./best_model/",
deterministic=True
)
# Train
model.learn(total_timesteps=100_000, callback=eval_callback)
# Test the trained agent
env = gym.make("CartPole-v1", render_mode="human")
obs, _ = env.reset()
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
obs, _ = env.reset()
env.close()
# Save / Load
model.save("ppo_cartpole")
loaded_model = PPO.load("ppo_cartpole")Pattern 4: Algorithm Selection Guide
RL Algorithm Decision Tree:
┌──────────────────────────────────────────────────────┐
│ │
│ Is the action space DISCRETE? │
│ ├── YES │
│ │ ├── Simple problem? → Q-Learning (tabular) │
│ │ ├── Complex states? → DQN │
│ │ └── Want stability? → PPO (recommended) │
│ │ │
│ └── NO (continuous actions) │
│ ├── Want simplicity? → PPO (works for both!) │
│ ├── Want sample efficiency? → SAC │
│ └── Want deterministic? → TD3 │
│ │
│ General Recommendation: │
│ Start with PPO → it works well almost everywhere │
└──────────────────────────────────────────────────────┘| Algorithm | Action Space | Sample Efficiency | Stability | Use Case |
|---|---|---|---|---|
| Q-Learning | Discrete | N/A (tabular) | Good | Simple, small state space |
| DQN | Discrete | Low | Medium | Atari games, discrete control |
| PPO | Both | Medium | High | General purpose, recommended default |
| A2C | Both | Low | Medium | Simpler alternative to PPO |
| SAC | Continuous | High | High | Robotics, continuous control |
| TD3 | Continuous | High | High | Deterministic continuous control |
Pattern 5: Custom Gymnasium Environment
import gymnasium as gym
from gymnasium import spaces
import numpy as np
class TradingEnv(gym.Env):
"""Custom environment for stock trading."""
metadata = {"render_modes": ["human"]}
def __init__(self, prices, initial_balance=10000):
super().__init__()
self.prices = prices
self.initial_balance = initial_balance
# Action: 0=hold, 1=buy, 2=sell
self.action_space = spaces.Discrete(3)
# Observation: [balance, shares_held, current_price, price_change]
self.observation_space = spaces.Box(
low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
)
def reset(self, seed=None):
super().reset(seed=seed)
self.balance = self.initial_balance
self.shares = 0
self.current_step = 0
return self._get_obs(), {}
def _get_obs(self):
price = self.prices[self.current_step]
prev_price = self.prices[max(0, self.current_step - 1)]
change = (price - prev_price) / prev_price if prev_price > 0 else 0
return np.array([
self.balance, self.shares, price, change
], dtype=np.float32)
def step(self, action):
price = self.prices[self.current_step]
# Execute action
if action == 1 and self.balance >= price: # Buy
self.shares += 1
self.balance -= price
elif action == 2 and self.shares > 0: # Sell
self.shares -= 1
self.balance += price
self.current_step += 1
done = self.current_step >= len(self.prices) - 1
# Reward: portfolio value change
portfolio = self.balance + self.shares * price
reward = (portfolio - self.initial_balance) / self.initial_balance
return self._get_obs(), reward, done, False, {}
# Use with Stable-Baselines3
env = TradingEnv(prices=stock_prices)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50_000)Pattern 6: Policy Gradient (REINFORCE)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import gymnasium as gym
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, action_dim),
nn.Softmax(dim=-1)
)
def forward(self, x):
return self.network(x)
def reinforce(env_name="CartPole-v1", episodes=1000, gamma=0.99, lr=1e-3):
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
policy = PolicyNetwork(state_dim, action_dim)
optimizer = optim.Adam(policy.parameters(), lr=lr)
for episode in range(episodes):
state, _ = env.reset()
log_probs = []
rewards = []
# Collect trajectory
done = False
while not done:
state_t = torch.FloatTensor(state)
probs = policy(state_t)
dist = Categorical(probs)
action = dist.sample()
log_probs.append(dist.log_prob(action))
state, reward, terminated, truncated, _ = env.step(action.item())
rewards.append(reward)
done = terminated or truncated
# Compute discounted returns
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
# Policy gradient update
# ∇J(θ) = E[∑ ∇log π(a|s) · G_t]
loss = -torch.stack(log_probs) @ returns
optimizer.zero_grad()
loss.backward()
optimizer.step()
if episode % 50 == 0:
print(f"Episode {episode}, Total Reward: {sum(rewards):.0f}")
reinforce()Reference Navigation
For detailed content, see:
- RL Fundamentals:
reference/rl_fundamentals.md- MDP, Bellman equation, value/policy - Q-Learning & DQN:
reference/q_learning_dqn.md- Tabular Q, DQN tricks, Double DQN - Policy Gradient:
reference/policy_gradient.md- REINFORCE, Actor-Critic, PPO math - Gym Environments:
reference/gym_environments.md- Custom envs, wrappers, benchmarks
Common Mistakes to Avoid
1. No Experience Replay in DQN
# WRONG: Train on single transitions (correlated, unstable)
loss = compute_loss(state, action, reward, next_state)
# CORRECT: Sample random batch from replay buffer
batch = buffer.sample(batch_size=64) # Breaks correlation!
loss = compute_batch_loss(batch)2. No Target Network
# WRONG: Q-network chases its own tail
target = reward + gamma * q_network(next_state).max() # Moving target!
# CORRECT: Stable target from frozen network
target = reward + gamma * target_network(next_state).max()
# Update target_network periodically (every N episodes)3. Reward Shaping Gone Wrong
# WRONG: Dense reward that creates shortcuts
reward = -distance_to_goal # Agent might find exploit
# BETTER: Sparse + well-defined
reward = 1.0 if reached_goal else 0.0
# Or: shaped reward that preserves optimal policyTeaching Mode
RL vs Supervised Learning
Supervised Learning: Reinforcement Learning:
┌──────────┐ ┌──────────┐
│ Dataset │ = Fixed │ Agent │ = Learns by doing
│ (X, y) │ answers │ │
└────┬─────┘ └────┬─────┘
│ │
▼ ▼
Learn: Explore:
"input → output" "try action → see result"
"good result → do more"
"bad result → do less"
Key difference:
- Supervised: Teacher gives correct answers
- RL: Agent discovers good actions through trial and error
- RL: Reward may be delayed (not immediate feedback)Exploration vs Exploitation
Restaurant Dilemma:
├── Exploitation: Go to your favorite restaurant (known good)
├── Exploration: Try a new restaurant (might be better... or worse)
└── Balance: Mostly exploit, sometimes explore
Epsilon-Greedy:
With probability ε: Random action (explore)
With probability 1-ε: Best known action (exploit)
ε starts high (1.0) → lots of exploration
ε decays over time → more exploitation
ε minimum (0.01) → always a little explorationCross-References
- Neural networks:
../deep-learning-core/SKILL.md- NN architecture for policy/value nets - PyTorch:
../pytorch-mastery/SKILL.md- Training loops, GPU optimization - MLOps:
../mlops-experiment/SKILL.md- Tracking RL experiments - ML fundamentals:
../ml-fundamentals/SKILL.md- Evaluation metrics