Generative AI models: GANs, VAEs, Diffusion Models, and image generation. Covers GAN architecture (Generator/Discriminator), DCGAN, Wasserstein GAN, Variational Autoencoders, latent space interpolation, Diffusion models (DDPM), Stable Diffusion, conditional generation, and text-to-image. Use when user asks about 'GAN', 'generative adversarial', 'VAE', 'variational autoencoder', 'diffusion model', 'image generation', 'Stable Diffusion', 'DCGAN', 'Wasserstein', 'WGAN', 'latent space', 'generate images', 'text-to-image', 'DDPM', 'denoising diffusion', 'style transfer', or 'deepfake'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/generative-models Install via the SkillsCat registry.
SKILL.md
Generative Models - Creating New Data
מודלים גנרטיביים: ליצור דאטה חדש שנראה אמיתי.
Quick Start - Simple GAN in PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# Hyperparameters
latent_dim = 100
image_dim = 28 * 28 # MNIST
lr = 0.0002
# Generator: noise → image
generator = nn.Sequential(
nn.Linear(latent_dim, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, image_dim),
nn.Tanh() # Output [-1, 1]
)
# Discriminator: image → real/fake
discriminator = nn.Sequential(
nn.Linear(image_dim, 512),
nn.LeakyReLU(0.2),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.LeakyReLU(0.2),
nn.Dropout(0.3),
nn.Linear(256, 1),
nn.Sigmoid() # Output [0, 1]
)
# Optimizers (Adam with beta1=0.5 is standard for GANs)
opt_g = optim.Adam(generator.parameters(), lr=lr, betas=(0.5, 0.999))
opt_d = optim.Adam(discriminator.parameters(), lr=lr, betas=(0.5, 0.999))
criterion = nn.BCELoss()
# Data
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
loader = DataLoader(dataset, batch_size=64, shuffle=True)
# Training
for epoch in range(50):
for real_images, _ in loader:
batch_size = real_images.size(0)
real_images = real_images.view(batch_size, -1)
real_labels = torch.ones(batch_size, 1)
fake_labels = torch.zeros(batch_size, 1)
# Train Discriminator
noise = torch.randn(batch_size, latent_dim)
fake_images = generator(noise).detach()
d_real = discriminator(real_images)
d_fake = discriminator(fake_images)
d_loss = criterion(d_real, real_labels) + criterion(d_fake, fake_labels)
opt_d.zero_grad()
d_loss.backward()
opt_d.step()
# Train Generator
noise = torch.randn(batch_size, latent_dim)
fake_images = generator(noise)
g_output = discriminator(fake_images)
g_loss = criterion(g_output, real_labels) # Fool discriminator
opt_g.zero_grad()
g_loss.backward()
opt_g.step()
print(f"Epoch {epoch}: D_loss={d_loss.item():.4f}, G_loss={g_loss.item():.4f}")When This Skill Activates
Use this skill when:
- Building GAN or VAE architectures
- Generating synthetic images or data
- Understanding latent spaces
- Working with diffusion models
- Implementing conditional generation
- Debugging GAN training instability
Core Patterns
Pattern 1: GAN Architecture
GAN Training (Adversarial Game):
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌───────────┐ fake ┌───────────────┐ │
│ │ │──────────▶│ │ │
│ │ Generator │ │ Discriminator │──▶ Real? │
│ │ G(z) │ ┌────────▶│ D(x) │ Fake? │
│ └─────┬─────┘ │ └───────────────┘ │
│ │ │ │
│ noise z real data │
│ │
│ Generator goal: Fool discriminator (make D(G(z)) → 1) │
│ Discriminator goal: Tell real from fake │
│ │
│ When balanced: Generator creates realistic data │
└─────────────────────────────────────────────────────────┘
Training Steps (each batch):
1. Train D on real data (label=1) + fake data (label=0)
2. Train G to make D output 1 for fake data
3. Repeat!Pattern 2: DCGAN (Convolutional GAN)
import torch.nn as nn
class DCGenerator(nn.Module):
"""Generates 64x64 images from noise vector."""
def __init__(self, latent_dim=100, channels=3):
super().__init__()
self.main = nn.Sequential(
# latent_dim → 512 × 4 × 4
nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
nn.BatchNorm2d(512),
nn.ReLU(True),
# 512 × 4 × 4 → 256 × 8 × 8
nn.ConvTranspose2d(512, 256, 4, 2, 1, bias=False),
nn.BatchNorm2d(256),
nn.ReLU(True),
# 256 × 8 × 8 → 128 × 16 × 16
nn.ConvTranspose2d(256, 128, 4, 2, 1, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(True),
# 128 × 16 × 16 → 64 × 32 × 32
nn.ConvTranspose2d(128, 64, 4, 2, 1, bias=False),
nn.BatchNorm2d(64),
nn.ReLU(True),
# 64 × 32 × 32 → channels × 64 × 64
nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
nn.Tanh()
)
def forward(self, z):
return self.main(z.view(z.size(0), -1, 1, 1))
class DCDiscriminator(nn.Module):
"""Classifies 64x64 images as real/fake."""
def __init__(self, channels=3):
super().__init__()
self.main = nn.Sequential(
# channels × 64 × 64 → 64 × 32 × 32
nn.Conv2d(channels, 64, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, inplace=True),
# 64 × 32 × 32 → 128 × 16 × 16
nn.Conv2d(64, 128, 4, 2, 1, bias=False),
nn.BatchNorm2d(128),
nn.LeakyReLU(0.2, inplace=True),
# 128 × 16 × 16 → 256 × 8 × 8
nn.Conv2d(128, 256, 4, 2, 1, bias=False),
nn.BatchNorm2d(256),
nn.LeakyReLU(0.2, inplace=True),
# 256 × 8 × 8 → 512 × 4 × 4
nn.Conv2d(256, 512, 4, 2, 1, bias=False),
nn.BatchNorm2d(512),
nn.LeakyReLU(0.2, inplace=True),
# 512 × 4 × 4 → 1
nn.Conv2d(512, 1, 4, 1, 0, bias=False),
nn.Sigmoid()
)
def forward(self, x):
return self.main(x).view(-1, 1)Pattern 3: Variational Autoencoder (VAE)
import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim=784, hidden_dim=256, latent_dim=20):
super().__init__()
# Encoder: input → hidden → (mu, log_var)
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU()
)
self.fc_mu = nn.Linear(hidden_dim, latent_dim)
self.fc_var = nn.Linear(hidden_dim, latent_dim)
# Decoder: latent → hidden → reconstruction
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, input_dim),
nn.Sigmoid()
)
def encode(self, x):
h = self.encoder(x)
return self.fc_mu(h), self.fc_var(h)
def reparameterize(self, mu, log_var):
"""Reparameterization trick: z = mu + sigma * epsilon."""
std = torch.exp(0.5 * log_var)
epsilon = torch.randn_like(std)
return mu + std * epsilon
def decode(self, z):
return self.decoder(z)
def forward(self, x):
mu, log_var = self.encode(x)
z = self.reparameterize(mu, log_var)
reconstruction = self.decode(z)
return reconstruction, mu, log_var
def vae_loss(reconstruction, original, mu, log_var):
"""VAE loss = Reconstruction + KL Divergence."""
# Reconstruction loss (how well we rebuild the input)
recon_loss = F.binary_cross_entropy(reconstruction, original, reduction='sum')
# KL divergence (keep latent space close to N(0,1))
# KL(q(z|x) || p(z)) = -0.5 * sum(1 + log(σ²) - μ² - σ²)
kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon_loss + kl_loss
# Training
vae = VAE()
optimizer = optim.Adam(vae.parameters(), lr=1e-3)
for epoch in range(50):
for images, _ in train_loader:
images = images.view(images.size(0), -1)
reconstruction, mu, log_var = vae(images)
loss = vae_loss(reconstruction, images, mu, log_var)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Generate new images
with torch.no_grad():
z = torch.randn(16, 20) # Sample from N(0,1)
generated = vae.decode(z)
generated = generated.view(16, 1, 28, 28)
# Interpolate between two images
z1, z2 = torch.randn(1, 20), torch.randn(1, 20)
alphas = torch.linspace(0, 1, 10)
interpolations = torch.stack([vae.decode((1-a)*z1 + a*z2) for a in alphas])VAE Architecture:
┌─────────────┐ ┌─────────┐ ┌─────────────┐
│ Encoder │ │ Latent │ │ Decoder │
│ │───▶│ Space │───▶│ │
│ input → h │ │ z~N(μ,σ)│ │ z → output │
└─────────────┘ └─────────┘ └─────────────┘
┌─── μ (mean)
x → encoder ───┤
└─── σ (std)
│
z = μ + σ × ε ← Reparameterization trick
│ (ε ~ N(0,1))
▼
decoder(z) → x̂
Loss = Reconstruction(x, x̂) + KL(q(z|x) || N(0,1))
"rebuild accurately" "keep latent space organized"Pattern 4: Diffusion Model (DDPM Simplified)
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleDiffusion:
"""Denoising Diffusion Probabilistic Model (simplified)."""
def __init__(self, timesteps=1000, beta_start=1e-4, beta_end=0.02):
self.timesteps = timesteps
# Noise schedule (linear)
self.betas = torch.linspace(beta_start, beta_end, timesteps)
self.alphas = 1.0 - self.betas
self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)
def add_noise(self, x, t):
"""Forward process: add noise at timestep t."""
alpha_t = self.alpha_cumprod[t].view(-1, 1, 1, 1)
noise = torch.randn_like(x)
# x_t = sqrt(α̅_t) * x_0 + sqrt(1-α̅_t) * ε
noisy = torch.sqrt(alpha_t) * x + torch.sqrt(1 - alpha_t) * noise
return noisy, noise
def sample(self, model, shape, device):
"""Reverse process: denoise from pure noise."""
x = torch.randn(shape).to(device)
for t in reversed(range(self.timesteps)):
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
# Predict noise
predicted_noise = model(x, t_batch)
# Denoise one step
alpha_t = self.alphas[t]
alpha_cumprod_t = self.alpha_cumprod[t]
beta_t = self.betas[t]
x = (1 / torch.sqrt(alpha_t)) * (
x - (beta_t / torch.sqrt(1 - alpha_cumprod_t)) * predicted_noise
)
# Add noise (except at t=0)
if t > 0:
noise = torch.randn_like(x)
x += torch.sqrt(beta_t) * noise
return x
class NoisePredictor(nn.Module):
"""Simple U-Net-like noise predictor."""
def __init__(self, channels=1, time_dim=256):
super().__init__()
self.time_mlp = nn.Sequential(
nn.Linear(1, time_dim),
nn.SiLU(),
nn.Linear(time_dim, time_dim),
)
self.encoder = nn.Sequential(
nn.Conv2d(channels, 64, 3, padding=1),
nn.SiLU(),
nn.Conv2d(64, 128, 3, padding=1),
nn.SiLU(),
)
self.decoder = nn.Sequential(
nn.Conv2d(128 + time_dim, 64, 3, padding=1),
nn.SiLU(),
nn.Conv2d(64, channels, 3, padding=1),
)
def forward(self, x, t):
# Time embedding
t_emb = self.time_mlp(t.float().unsqueeze(-1) / 1000)
t_emb = t_emb.view(t_emb.size(0), -1, 1, 1).expand(-1, -1, x.size(2), x.size(3))
# Encode
h = self.encoder(x)
# Concat time embedding
h = torch.cat([h, t_emb], dim=1)
# Decode (predict noise)
return self.decoder(h)
# Training
diffusion = SimpleDiffusion(timesteps=1000)
model = NoisePredictor(channels=1)
optimizer = optim.Adam(model.parameters(), lr=1e-4)
for epoch in range(100):
for images, _ in train_loader:
# Random timestep
t = torch.randint(0, diffusion.timesteps, (images.size(0),))
# Add noise
noisy_images, noise = diffusion.add_noise(images, t)
# Predict noise
predicted_noise = model(noisy_images, t)
# Loss: how well did we predict the noise?
loss = F.mse_loss(predicted_noise, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Generate new images
model.eval()
with torch.no_grad():
generated = diffusion.sample(model, shape=(16, 1, 28, 28), device='cpu')Diffusion Process:
┌─────────────────────────────────────────────────────────┐
│ │
│ Forward (add noise): │
│ x₀ ──▶ x₁ ──▶ x₂ ──▶ ... ──▶ x_T │
│ clean +noise +noise pure noise │
│ │
│ Reverse (denoise): │
│ x_T ──▶ x_{T-1} ──▶ ... ──▶ x₁ ──▶ x₀ │
│ noise -predicted ... almost clean! │
│ noise clean │
│ │
│ Model learns: "given noisy image at step t, │
│ predict the noise that was added" │
│ │
│ Key insight: Breaking generation into many small │
│ denoising steps is easier than generating in one shot │
└─────────────────────────────────────────────────────────┘Pattern 5: Using Stable Diffusion / Diffusers
from diffusers import StableDiffusionPipeline
import torch
# Load pre-trained Stable Diffusion
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
# Text-to-Image generation
prompt = "a photo of a cat wearing a spacesuit on Mars, digital art"
negative_prompt = "blurry, low quality, distorted"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50,
guidance_scale=7.5, # How much to follow the prompt
).images[0]
image.save("cat_astronaut.png")
# Generate multiple images
images = pipe(
prompt=prompt,
num_images_per_prompt=4,
num_inference_steps=30,
).images
# Image-to-Image (modify existing image)
from diffusers import StableDiffusionImg2ImgPipeline
img2img = StableDiffusionImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16
).to("cuda")
result = img2img(
prompt="make it look like a watercolor painting",
image=input_image,
strength=0.75, # How much to change (0=no change, 1=full regeneration)
).images[0]Pattern 6: Generative Model Comparison
Model Comparison:
┌────────────┬─────────────────┬──────────────┬──────────────┐
│ │ GAN │ VAE │ Diffusion │
├────────────┼─────────────────┼──────────────┼──────────────┤
│ Quality │ Sharp, realistic│ Blurry │ Best quality │
│ Training │ Unstable │ Stable │ Stable │
│ Speed │ Fast generation │ Fast │ Slow (many │
│ │ │ │ denoising │
│ │ │ │ steps) │
│ Diversity │ Mode collapse │ Good │ Excellent │
│ Latent │ Hard to control │ Smooth, │ Conditional │
│ space │ │ interpolate │ generation │
│ Use case │ Image synthesis │ Compression, │ Text-to-image│
│ │ │ generation │ editing │
└────────────┴─────────────────┴──────────────┴──────────────┘Reference Navigation
For detailed content, see:
- GAN Fundamentals:
reference/gan_fundamentals.md- Training tricks, WGAN, mode collapse - VAE Architecture:
reference/vae_architecture.md- ELBO, reparameterization, variants - Diffusion Models:
reference/diffusion_models.md- DDPM, score matching, schedulers - Image Generation:
reference/image_generation.md- Stable Diffusion, ControlNet, practical use
Common Mistakes to Avoid
1. GAN Mode Collapse
# SYMPTOM: Generator produces same image regardless of noise
# CAUSES: Discriminator too strong, LR imbalance
# FIXES:
# 1. Use Wasserstein loss instead of BCE
# 2. Add noise to discriminator inputs
# 3. Use spectral normalization
# 4. Train D fewer steps than G (or vice versa)
# 5. Use label smoothing
real_labels = torch.FloatTensor(batch_size, 1).uniform_(0.8, 1.0) # Not 1.0!2. VAE Posterior Collapse
# SYMPTOM: KL loss goes to 0, model ignores latent space
# FIX: KL annealing (gradually increase KL weight)
kl_weight = min(1.0, epoch / 10) # Linear warmup over 10 epochs
loss = recon_loss + kl_weight * kl_loss3. Wrong Normalization for GAN
# WRONG: Tanh output but data in [0, 1]
# Generator outputs [-1, 1] but images are [0, 1]
# CORRECT: Normalize data to match generator output
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)) # Now data is [-1, 1]
])Teaching Mode
GAN Analogy
GAN = Counterfeiter vs Detective
Counterfeiter (Generator):
Makes fake money, tries to fool the detective.
Detective (Discriminator):
Examines money, tries to catch fakes.
Over time:
- Counterfeiter gets better at faking
- Detective gets better at detecting
- Eventually: fakes indistinguishable from real!
This is a "minimax game" from Game Theory.VAE vs AE
Autoencoder (AE):
encoder: image → single point in latent space
Problem: Gaps between points → random points = garbage
VAE:
encoder: image → distribution (mean + variance)
Advantage: Continuous, smooth latent space
Random points → meaningful images!
┌─ AE latent space ─┐ ┌─ VAE latent space ─┐
│ · · │ │ ∿∿∿∿∿∿∿∿∿∿∿∿ │
│ · · │ │ ∿ smooth everywhere│
│ gap! · │ │ ∿ any point works │
│ · · │ │ ∿∿∿∿∿∿∿∿∿∿∿∿ │
└────────────────────┘ └─────────────────────┘Cross-References
- Autoencoders basics:
../deep-learning-core/SKILL.md- AE, Denoising AE, latent space - CNNs for images:
../cnn-vision/SKILL.md- ConvTranspose2d, image processing - PyTorch patterns:
../pytorch-mastery/SKILL.md- Training loops, GPU optimization - Fine-tuning Diffusion:
../fine-tuning-peft/SKILL.md- LoRA for Stable Diffusion