levy-n

reinforcement-learning

Reinforcement learning fundamentals and practical implementations. Covers RL concepts (agent, environment, reward, policy), Q-Learning, Deep Q-Network (DQN), Policy Gradient methods, PPO, Actor-Critic, Gymnasium environments, Stable-Baselines3, reward shaping, and exploration-exploitation trade-off. Use when user asks about 'reinforcement learning', 'RL', 'Q-learning', 'DQN', 'PPO', 'policy gradient', 'reward function', 'agent', 'environment', 'Gym', 'Gymnasium', 'exploration', 'exploitation', 'Stable-Baselines3', 'Actor-Critic', 'SARSA', 'Bellman equation', or 'Markov decision process'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/reinforcement-learning

Install via the SkillsCat registry.

SKILL.md

Reinforcement Learning - Learning from Interaction

למידת חיזוק: הסוכן לומד מניסיון ושגיאה.

Quick Start - Q-Learning (No Libraries)

import numpy as np

# Simple GridWorld
# Agent starts at (0,0), goal at (3,3), in 4x4 grid
grid_size = 4
n_states = grid_size * grid_size
n_actions = 4  # up, down, left, right

# Q-table: state × action → expected reward
Q = np.zeros((n_states, n_actions))

# Hyperparameters
learning_rate = 0.1
discount_factor = 0.99  # gamma
epsilon = 1.0           # exploration rate
epsilon_decay = 0.995
epsilon_min = 0.01
episodes = 1000

def state_to_pos(state):
    return state // grid_size, state % grid_size

def pos_to_state(row, col):
    return row * grid_size + col

def step(state, action):
    """Take action, return (next_state, reward, done)."""
    row, col = state_to_pos(state)
    if action == 0: row = max(0, row - 1)          # up
    elif action == 1: row = min(grid_size-1, row + 1)  # down
    elif action == 2: col = max(0, col - 1)          # left
    elif action == 3: col = min(grid_size-1, col + 1)  # right

    next_state = pos_to_state(row, col)
    done = (next_state == n_states - 1)  # Goal at last cell
    reward = 1.0 if done else -0.01      # Small penalty per step
    return next_state, reward, done

# Training loop
for episode in range(episodes):
    state = 0  # Start at (0,0)
    total_reward = 0

    for t in range(100):  # Max steps per episode
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = np.random.randint(n_actions)     # Explore
        else:
            action = np.argmax(Q[state])               # Exploit

        next_state, reward, done = step(state, action)
        total_reward += reward

        # Q-Learning update (off-policy)
        # Q(s,a) ← Q(s,a) + α[r + γ·max_a'Q(s',a') - Q(s,a)]
        best_next = np.max(Q[next_state])
        Q[state, action] += learning_rate * (
            reward + discount_factor * best_next - Q[state, action]
        )

        state = next_state
        if done:
            break

    # Decay exploration
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    if episode % 100 == 0:
        print(f"Episode {episode}, Reward: {total_reward:.2f}, "
              f"Epsilon: {epsilon:.3f}")

print("Learned Q-table:")
print(Q.reshape(grid_size, grid_size, n_actions).round(2))

When This Skill Activates

Use this skill when:

  • Learning RL concepts (agent, environment, reward)
  • Implementing Q-Learning or DQN
  • Training agents with PPO or A2C
  • Working with Gymnasium environments
  • Designing reward functions
  • Understanding exploration vs exploitation

Core Patterns

Pattern 1: RL Framework

RL Components:
┌─────────────────────────────────────────────────────────┐
│                                                         │
│    ┌───────────┐   action a    ┌──────────────┐        │
│    │           │──────────────▶│              │        │
│    │   Agent   │               │ Environment  │        │
│    │  (Policy) │◀──────────────│   (World)    │        │
│    │           │  state s'     │              │        │
│    │           │  reward r     │              │        │
│    └───────────┘               └──────────────┘        │
│                                                         │
│    Agent's Goal: Maximize cumulative reward              │
│    G = r₁ + γr₂ + γ²r₃ + ... (discounted return)      │
│                                                         │
│    Policy π(a|s): probability of action a in state s    │
│    Value V(s): expected return from state s             │
│    Q-Value Q(s,a): expected return from (state, action) │
└─────────────────────────────────────────────────────────┘

Pattern 2: DQN (Deep Q-Network) with PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
import gymnasium as gym

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )

    def forward(self, x):
        return self.network(x)

class ReplayBuffer:
    """Experience replay for stable training."""
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (np.array(states), np.array(actions), np.array(rewards),
                np.array(next_states), np.array(dones))

    def __len__(self):
        return len(self.buffer)

class DQNAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3):
        self.action_dim = action_dim
        self.epsilon = 1.0
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
        self.gamma = 0.99
        self.batch_size = 64

        self.q_network = DQN(state_dim, action_dim)
        self.target_network = DQN(state_dim, action_dim)
        self.target_network.load_state_dict(self.q_network.state_dict())

        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        self.buffer = ReplayBuffer()

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, self.action_dim - 1)

        with torch.no_grad():
            state_t = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.q_network(state_t)
            return q_values.argmax().item()

    def train_step(self):
        if len(self.buffer) < self.batch_size:
            return

        states, actions, rewards, next_states, dones = \
            self.buffer.sample(self.batch_size)

        states_t = torch.FloatTensor(states)
        actions_t = torch.LongTensor(actions)
        rewards_t = torch.FloatTensor(rewards)
        next_states_t = torch.FloatTensor(next_states)
        dones_t = torch.FloatTensor(dones)

        # Current Q-values
        current_q = self.q_network(states_t).gather(1, actions_t.unsqueeze(1))

        # Target Q-values (from target network)
        with torch.no_grad():
            next_q = self.target_network(next_states_t).max(1)[0]
            target_q = rewards_t + self.gamma * next_q * (1 - dones_t)

        # Loss and update
        loss = nn.MSELoss()(current_q.squeeze(), target_q)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

    def update_target(self):
        """Copy Q-network weights to target network."""
        self.target_network.load_state_dict(self.q_network.state_dict())

# Training
env = gym.make("CartPole-v1")
agent = DQNAgent(state_dim=4, action_dim=2)

for episode in range(500):
    state, _ = env.reset()
    total_reward = 0

    for t in range(500):
        action = agent.select_action(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        agent.buffer.push(state, action, reward, next_state, done)
        agent.train_step()

        state = next_state
        total_reward += reward

        if done:
            break

    # Update target network periodically
    if episode % 10 == 0:
        agent.update_target()

    if episode % 50 == 0:
        print(f"Episode {episode}, Reward: {total_reward:.0f}, "
              f"Epsilon: {agent.epsilon:.3f}")

Pattern 3: Stable-Baselines3 (High-Level RL)

from stable_baselines3 import PPO, DQN, A2C, SAC
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import EvalCallback
import gymnasium as gym

# Create vectorized environment (parallel)
env = make_vec_env("CartPole-v1", n_envs=4)

# Train with PPO (most versatile algorithm)
model = PPO(
    "MlpPolicy",                    # Neural network policy
    env,
    learning_rate=3e-4,
    n_steps=2048,                   # Steps per update
    batch_size=64,
    n_epochs=10,                    # Optimization epochs per update
    gamma=0.99,                     # Discount factor
    verbose=1
)

# Evaluation callback (saves best model)
eval_callback = EvalCallback(
    eval_env=gym.make("CartPole-v1"),
    eval_freq=5000,
    best_model_save_path="./best_model/",
    deterministic=True
)

# Train
model.learn(total_timesteps=100_000, callback=eval_callback)

# Test the trained agent
env = gym.make("CartPole-v1", render_mode="human")
obs, _ = env.reset()

for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        obs, _ = env.reset()

env.close()

# Save / Load
model.save("ppo_cartpole")
loaded_model = PPO.load("ppo_cartpole")

Pattern 4: Algorithm Selection Guide

RL Algorithm Decision Tree:
┌──────────────────────────────────────────────────────┐
│                                                      │
│  Is the action space DISCRETE?                       │
│  ├── YES                                             │
│  │   ├── Simple problem? → Q-Learning (tabular)     │
│  │   ├── Complex states? → DQN                       │
│  │   └── Want stability? → PPO (recommended)         │
│  │                                                   │
│  └── NO (continuous actions)                         │
│      ├── Want simplicity? → PPO (works for both!)   │
│      ├── Want sample efficiency? → SAC               │
│      └── Want deterministic? → TD3                   │
│                                                      │
│  General Recommendation:                             │
│  Start with PPO → it works well almost everywhere    │
└──────────────────────────────────────────────────────┘
Algorithm Action Space Sample Efficiency Stability Use Case
Q-Learning Discrete N/A (tabular) Good Simple, small state space
DQN Discrete Low Medium Atari games, discrete control
PPO Both Medium High General purpose, recommended default
A2C Both Low Medium Simpler alternative to PPO
SAC Continuous High High Robotics, continuous control
TD3 Continuous High High Deterministic continuous control

Pattern 5: Custom Gymnasium Environment

import gymnasium as gym
from gymnasium import spaces
import numpy as np

class TradingEnv(gym.Env):
    """Custom environment for stock trading."""

    metadata = {"render_modes": ["human"]}

    def __init__(self, prices, initial_balance=10000):
        super().__init__()
        self.prices = prices
        self.initial_balance = initial_balance

        # Action: 0=hold, 1=buy, 2=sell
        self.action_space = spaces.Discrete(3)

        # Observation: [balance, shares_held, current_price, price_change]
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(4,), dtype=np.float32
        )

    def reset(self, seed=None):
        super().reset(seed=seed)
        self.balance = self.initial_balance
        self.shares = 0
        self.current_step = 0
        return self._get_obs(), {}

    def _get_obs(self):
        price = self.prices[self.current_step]
        prev_price = self.prices[max(0, self.current_step - 1)]
        change = (price - prev_price) / prev_price if prev_price > 0 else 0
        return np.array([
            self.balance, self.shares, price, change
        ], dtype=np.float32)

    def step(self, action):
        price = self.prices[self.current_step]

        # Execute action
        if action == 1 and self.balance >= price:    # Buy
            self.shares += 1
            self.balance -= price
        elif action == 2 and self.shares > 0:         # Sell
            self.shares -= 1
            self.balance += price

        self.current_step += 1
        done = self.current_step >= len(self.prices) - 1

        # Reward: portfolio value change
        portfolio = self.balance + self.shares * price
        reward = (portfolio - self.initial_balance) / self.initial_balance

        return self._get_obs(), reward, done, False, {}

# Use with Stable-Baselines3
env = TradingEnv(prices=stock_prices)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50_000)

Pattern 6: Policy Gradient (REINFORCE)

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import gymnasium as gym

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, x):
        return self.network(x)

def reinforce(env_name="CartPole-v1", episodes=1000, gamma=0.99, lr=1e-3):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    policy = PolicyNetwork(state_dim, action_dim)
    optimizer = optim.Adam(policy.parameters(), lr=lr)

    for episode in range(episodes):
        state, _ = env.reset()
        log_probs = []
        rewards = []

        # Collect trajectory
        done = False
        while not done:
            state_t = torch.FloatTensor(state)
            probs = policy(state_t)
            dist = Categorical(probs)
            action = dist.sample()

            log_probs.append(dist.log_prob(action))
            state, reward, terminated, truncated, _ = env.step(action.item())
            rewards.append(reward)
            done = terminated or truncated

        # Compute discounted returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)

        returns = torch.FloatTensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)

        # Policy gradient update
        # ∇J(θ) = E[∑ ∇log π(a|s) · G_t]
        loss = -torch.stack(log_probs) @ returns
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if episode % 50 == 0:
            print(f"Episode {episode}, Total Reward: {sum(rewards):.0f}")

reinforce()

Reference Navigation

For detailed content, see:

  • RL Fundamentals: reference/rl_fundamentals.md - MDP, Bellman equation, value/policy
  • Q-Learning & DQN: reference/q_learning_dqn.md - Tabular Q, DQN tricks, Double DQN
  • Policy Gradient: reference/policy_gradient.md - REINFORCE, Actor-Critic, PPO math
  • Gym Environments: reference/gym_environments.md - Custom envs, wrappers, benchmarks

Common Mistakes to Avoid

1. No Experience Replay in DQN

# WRONG: Train on single transitions (correlated, unstable)
loss = compute_loss(state, action, reward, next_state)

# CORRECT: Sample random batch from replay buffer
batch = buffer.sample(batch_size=64)  # Breaks correlation!
loss = compute_batch_loss(batch)

2. No Target Network

# WRONG: Q-network chases its own tail
target = reward + gamma * q_network(next_state).max()  # Moving target!

# CORRECT: Stable target from frozen network
target = reward + gamma * target_network(next_state).max()
# Update target_network periodically (every N episodes)

3. Reward Shaping Gone Wrong

# WRONG: Dense reward that creates shortcuts
reward = -distance_to_goal  # Agent might find exploit

# BETTER: Sparse + well-defined
reward = 1.0 if reached_goal else 0.0
# Or: shaped reward that preserves optimal policy

Teaching Mode

RL vs Supervised Learning

Supervised Learning:          Reinforcement Learning:
┌──────────┐                 ┌──────────┐
│ Dataset  │ = Fixed         │ Agent    │ = Learns by doing
│ (X, y)   │   answers       │          │
└────┬─────┘                 └────┬─────┘
     │                            │
     ▼                            ▼
  Learn:                      Explore:
  "input → output"            "try action → see result"
                              "good result → do more"
                              "bad result → do less"

Key difference:
- Supervised: Teacher gives correct answers
- RL: Agent discovers good actions through trial and error
- RL: Reward may be delayed (not immediate feedback)

Exploration vs Exploitation

Restaurant Dilemma:
├── Exploitation: Go to your favorite restaurant (known good)
├── Exploration: Try a new restaurant (might be better... or worse)
└── Balance: Mostly exploit, sometimes explore

Epsilon-Greedy:
  With probability ε:   Random action  (explore)
  With probability 1-ε: Best known action (exploit)

  ε starts high (1.0) → lots of exploration
  ε decays over time   → more exploitation
  ε minimum (0.01)     → always a little exploration

Cross-References

  • Neural networks: ../deep-learning-core/SKILL.md - NN architecture for policy/value nets
  • PyTorch: ../pytorch-mastery/SKILL.md - Training loops, GPU optimization
  • MLOps: ../mlops-experiment/SKILL.md - Tracking RL experiments
  • ML fundamentals: ../ml-fundamentals/SKILL.md - Evaluation metrics