levy-n

generative-models

Generative AI models: GANs, VAEs, Diffusion Models, and image generation. Covers GAN architecture (Generator/Discriminator), DCGAN, Wasserstein GAN, Variational Autoencoders, latent space interpolation, Diffusion models (DDPM), Stable Diffusion, conditional generation, and text-to-image. Use when user asks about 'GAN', 'generative adversarial', 'VAE', 'variational autoencoder', 'diffusion model', 'image generation', 'Stable Diffusion', 'DCGAN', 'Wasserstein', 'WGAN', 'latent space', 'generate images', 'text-to-image', 'DDPM', 'denoising diffusion', 'style transfer', or 'deepfake'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/generative-models

Install via the SkillsCat registry.

SKILL.md

Generative Models - Creating New Data

מודלים גנרטיביים: ליצור דאטה חדש שנראה אמיתי.

Quick Start - Simple GAN in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Hyperparameters
latent_dim = 100
image_dim = 28 * 28  # MNIST
lr = 0.0002

# Generator: noise → image
generator = nn.Sequential(
    nn.Linear(latent_dim, 256),
    nn.LeakyReLU(0.2),
    nn.Linear(256, 512),
    nn.LeakyReLU(0.2),
    nn.Linear(512, image_dim),
    nn.Tanh()  # Output [-1, 1]
)

# Discriminator: image → real/fake
discriminator = nn.Sequential(
    nn.Linear(image_dim, 512),
    nn.LeakyReLU(0.2),
    nn.Dropout(0.3),
    nn.Linear(512, 256),
    nn.LeakyReLU(0.2),
    nn.Dropout(0.3),
    nn.Linear(256, 1),
    nn.Sigmoid()  # Output [0, 1]
)

# Optimizers (Adam with beta1=0.5 is standard for GANs)
opt_g = optim.Adam(generator.parameters(), lr=lr, betas=(0.5, 0.999))
opt_d = optim.Adam(discriminator.parameters(), lr=lr, betas=(0.5, 0.999))
criterion = nn.BCELoss()

# Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
loader = DataLoader(dataset, batch_size=64, shuffle=True)

# Training
for epoch in range(50):
    for real_images, _ in loader:
        batch_size = real_images.size(0)
        real_images = real_images.view(batch_size, -1)

        real_labels = torch.ones(batch_size, 1)
        fake_labels = torch.zeros(batch_size, 1)

        # Train Discriminator
        noise = torch.randn(batch_size, latent_dim)
        fake_images = generator(noise).detach()

        d_real = discriminator(real_images)
        d_fake = discriminator(fake_images)
        d_loss = criterion(d_real, real_labels) + criterion(d_fake, fake_labels)

        opt_d.zero_grad()
        d_loss.backward()
        opt_d.step()

        # Train Generator
        noise = torch.randn(batch_size, latent_dim)
        fake_images = generator(noise)
        g_output = discriminator(fake_images)
        g_loss = criterion(g_output, real_labels)  # Fool discriminator

        opt_g.zero_grad()
        g_loss.backward()
        opt_g.step()

    print(f"Epoch {epoch}: D_loss={d_loss.item():.4f}, G_loss={g_loss.item():.4f}")

When This Skill Activates

Use this skill when:

  • Building GAN or VAE architectures
  • Generating synthetic images or data
  • Understanding latent spaces
  • Working with diffusion models
  • Implementing conditional generation
  • Debugging GAN training instability

Core Patterns

Pattern 1: GAN Architecture

GAN Training (Adversarial Game):
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  ┌───────────┐  fake      ┌───────────────┐            │
│  │           │──────────▶│               │            │
│  │ Generator │            │ Discriminator │──▶ Real?   │
│  │  G(z)     │  ┌────────▶│   D(x)        │   Fake?   │
│  └─────┬─────┘  │         └───────────────┘            │
│        │        │                                       │
│    noise z   real data                                  │
│                                                         │
│  Generator goal: Fool discriminator (make D(G(z)) → 1) │
│  Discriminator goal: Tell real from fake                │
│                                                         │
│  When balanced: Generator creates realistic data        │
└─────────────────────────────────────────────────────────┘

Training Steps (each batch):
  1. Train D on real data (label=1) + fake data (label=0)
  2. Train G to make D output 1 for fake data
  3. Repeat!

Pattern 2: DCGAN (Convolutional GAN)

import torch.nn as nn

class DCGenerator(nn.Module):
    """Generates 64x64 images from noise vector."""
    def __init__(self, latent_dim=100, channels=3):
        super().__init__()
        self.main = nn.Sequential(
            # latent_dim → 512 × 4 × 4
            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
            nn.BatchNorm2d(512),
            nn.ReLU(True),
            # 512 × 4 × 4 → 256 × 8 × 8
            nn.ConvTranspose2d(512, 256, 4, 2, 1, bias=False),
            nn.BatchNorm2d(256),
            nn.ReLU(True),
            # 256 × 8 × 8 → 128 × 16 × 16
            nn.ConvTranspose2d(256, 128, 4, 2, 1, bias=False),
            nn.BatchNorm2d(128),
            nn.ReLU(True),
            # 128 × 16 × 16 → 64 × 32 × 32
            nn.ConvTranspose2d(128, 64, 4, 2, 1, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(True),
            # 64 × 32 × 32 → channels × 64 × 64
            nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
            nn.Tanh()
        )

    def forward(self, z):
        return self.main(z.view(z.size(0), -1, 1, 1))


class DCDiscriminator(nn.Module):
    """Classifies 64x64 images as real/fake."""
    def __init__(self, channels=3):
        super().__init__()
        self.main = nn.Sequential(
            # channels × 64 × 64 → 64 × 32 × 32
            nn.Conv2d(channels, 64, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # 64 × 32 × 32 → 128 × 16 × 16
            nn.Conv2d(64, 128, 4, 2, 1, bias=False),
            nn.BatchNorm2d(128),
            nn.LeakyReLU(0.2, inplace=True),
            # 128 × 16 × 16 → 256 × 8 × 8
            nn.Conv2d(128, 256, 4, 2, 1, bias=False),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.2, inplace=True),
            # 256 × 8 × 8 → 512 × 4 × 4
            nn.Conv2d(256, 512, 4, 2, 1, bias=False),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.2, inplace=True),
            # 512 × 4 × 4 → 1
            nn.Conv2d(512, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.main(x).view(-1, 1)

Pattern 3: Variational Autoencoder (VAE)

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256, latent_dim=20):
        super().__init__()

        # Encoder: input → hidden → (mu, log_var)
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_var = nn.Linear(hidden_dim, latent_dim)

        # Decoder: latent → hidden → reconstruction
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid()
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_var(h)

    def reparameterize(self, mu, log_var):
        """Reparameterization trick: z = mu + sigma * epsilon."""
        std = torch.exp(0.5 * log_var)
        epsilon = torch.randn_like(std)
        return mu + std * epsilon

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        reconstruction = self.decode(z)
        return reconstruction, mu, log_var

def vae_loss(reconstruction, original, mu, log_var):
    """VAE loss = Reconstruction + KL Divergence."""
    # Reconstruction loss (how well we rebuild the input)
    recon_loss = F.binary_cross_entropy(reconstruction, original, reduction='sum')

    # KL divergence (keep latent space close to N(0,1))
    # KL(q(z|x) || p(z)) = -0.5 * sum(1 + log(σ²) - μ² - σ²)
    kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())

    return recon_loss + kl_loss

# Training
vae = VAE()
optimizer = optim.Adam(vae.parameters(), lr=1e-3)

for epoch in range(50):
    for images, _ in train_loader:
        images = images.view(images.size(0), -1)

        reconstruction, mu, log_var = vae(images)
        loss = vae_loss(reconstruction, images, mu, log_var)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Generate new images
with torch.no_grad():
    z = torch.randn(16, 20)  # Sample from N(0,1)
    generated = vae.decode(z)
    generated = generated.view(16, 1, 28, 28)

# Interpolate between two images
z1, z2 = torch.randn(1, 20), torch.randn(1, 20)
alphas = torch.linspace(0, 1, 10)
interpolations = torch.stack([vae.decode((1-a)*z1 + a*z2) for a in alphas])
VAE Architecture:
┌─────────────┐    ┌─────────┐    ┌─────────────┐
│   Encoder   │    │ Latent  │    │   Decoder   │
│             │───▶│  Space  │───▶│             │
│ input → h   │    │  z~N(μ,σ)│    │ z → output  │
└─────────────┘    └─────────┘    └─────────────┘

                   ┌─── μ (mean)
   x → encoder ───┤
                   └─── σ (std)
                          │
                   z = μ + σ × ε     ← Reparameterization trick
                          │               (ε ~ N(0,1))
                          ▼
                   decoder(z) → x̂

Loss = Reconstruction(x, x̂) + KL(q(z|x) || N(0,1))
       "rebuild accurately"    "keep latent space organized"

Pattern 4: Diffusion Model (DDPM Simplified)

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleDiffusion:
    """Denoising Diffusion Probabilistic Model (simplified)."""

    def __init__(self, timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.timesteps = timesteps

        # Noise schedule (linear)
        self.betas = torch.linspace(beta_start, beta_end, timesteps)
        self.alphas = 1.0 - self.betas
        self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)

    def add_noise(self, x, t):
        """Forward process: add noise at timestep t."""
        alpha_t = self.alpha_cumprod[t].view(-1, 1, 1, 1)
        noise = torch.randn_like(x)

        # x_t = sqrt(α̅_t) * x_0 + sqrt(1-α̅_t) * ε
        noisy = torch.sqrt(alpha_t) * x + torch.sqrt(1 - alpha_t) * noise
        return noisy, noise

    def sample(self, model, shape, device):
        """Reverse process: denoise from pure noise."""
        x = torch.randn(shape).to(device)

        for t in reversed(range(self.timesteps)):
            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)

            # Predict noise
            predicted_noise = model(x, t_batch)

            # Denoise one step
            alpha_t = self.alphas[t]
            alpha_cumprod_t = self.alpha_cumprod[t]
            beta_t = self.betas[t]

            x = (1 / torch.sqrt(alpha_t)) * (
                x - (beta_t / torch.sqrt(1 - alpha_cumprod_t)) * predicted_noise
            )

            # Add noise (except at t=0)
            if t > 0:
                noise = torch.randn_like(x)
                x += torch.sqrt(beta_t) * noise

        return x


class NoisePredictor(nn.Module):
    """Simple U-Net-like noise predictor."""

    def __init__(self, channels=1, time_dim=256):
        super().__init__()
        self.time_mlp = nn.Sequential(
            nn.Linear(1, time_dim),
            nn.SiLU(),
            nn.Linear(time_dim, time_dim),
        )

        self.encoder = nn.Sequential(
            nn.Conv2d(channels, 64, 3, padding=1),
            nn.SiLU(),
            nn.Conv2d(64, 128, 3, padding=1),
            nn.SiLU(),
        )

        self.decoder = nn.Sequential(
            nn.Conv2d(128 + time_dim, 64, 3, padding=1),
            nn.SiLU(),
            nn.Conv2d(64, channels, 3, padding=1),
        )

    def forward(self, x, t):
        # Time embedding
        t_emb = self.time_mlp(t.float().unsqueeze(-1) / 1000)
        t_emb = t_emb.view(t_emb.size(0), -1, 1, 1).expand(-1, -1, x.size(2), x.size(3))

        # Encode
        h = self.encoder(x)

        # Concat time embedding
        h = torch.cat([h, t_emb], dim=1)

        # Decode (predict noise)
        return self.decoder(h)


# Training
diffusion = SimpleDiffusion(timesteps=1000)
model = NoisePredictor(channels=1)
optimizer = optim.Adam(model.parameters(), lr=1e-4)

for epoch in range(100):
    for images, _ in train_loader:
        # Random timestep
        t = torch.randint(0, diffusion.timesteps, (images.size(0),))

        # Add noise
        noisy_images, noise = diffusion.add_noise(images, t)

        # Predict noise
        predicted_noise = model(noisy_images, t)

        # Loss: how well did we predict the noise?
        loss = F.mse_loss(predicted_noise, noise)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Generate new images
model.eval()
with torch.no_grad():
    generated = diffusion.sample(model, shape=(16, 1, 28, 28), device='cpu')
Diffusion Process:
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  Forward (add noise):                                   │
│  x₀ ──▶ x₁ ──▶ x₂ ──▶ ... ──▶ x_T                   │
│  clean   +noise  +noise         pure noise              │
│                                                         │
│  Reverse (denoise):                                     │
│  x_T ──▶ x_{T-1} ──▶ ... ──▶ x₁ ──▶ x₀              │
│  noise   -predicted    ...    almost   clean!           │
│          noise                clean                     │
│                                                         │
│  Model learns: "given noisy image at step t,            │
│                 predict the noise that was added"        │
│                                                         │
│  Key insight: Breaking generation into many small        │
│  denoising steps is easier than generating in one shot  │
└─────────────────────────────────────────────────────────┘

Pattern 5: Using Stable Diffusion / Diffusers

from diffusers import StableDiffusionPipeline
import torch

# Load pre-trained Stable Diffusion
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

# Text-to-Image generation
prompt = "a photo of a cat wearing a spacesuit on Mars, digital art"
negative_prompt = "blurry, low quality, distorted"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    guidance_scale=7.5,         # How much to follow the prompt
).images[0]

image.save("cat_astronaut.png")

# Generate multiple images
images = pipe(
    prompt=prompt,
    num_images_per_prompt=4,
    num_inference_steps=30,
).images

# Image-to-Image (modify existing image)
from diffusers import StableDiffusionImg2ImgPipeline

img2img = StableDiffusionImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16
).to("cuda")

result = img2img(
    prompt="make it look like a watercolor painting",
    image=input_image,
    strength=0.75,  # How much to change (0=no change, 1=full regeneration)
).images[0]

Pattern 6: Generative Model Comparison

Model Comparison:
┌────────────┬─────────────────┬──────────────┬──────────────┐
│            │ GAN             │ VAE          │ Diffusion    │
├────────────┼─────────────────┼──────────────┼──────────────┤
│ Quality    │ Sharp, realistic│ Blurry       │ Best quality │
│ Training   │ Unstable        │ Stable       │ Stable       │
│ Speed      │ Fast generation │ Fast         │ Slow (many   │
│            │                 │              │ denoising    │
│            │                 │              │ steps)       │
│ Diversity  │ Mode collapse   │ Good         │ Excellent    │
│ Latent     │ Hard to control │ Smooth,      │ Conditional  │
│ space      │                 │ interpolate  │ generation   │
│ Use case   │ Image synthesis │ Compression, │ Text-to-image│
│            │                 │ generation   │ editing      │
└────────────┴─────────────────┴──────────────┴──────────────┘

Reference Navigation

For detailed content, see:

  • GAN Fundamentals: reference/gan_fundamentals.md - Training tricks, WGAN, mode collapse
  • VAE Architecture: reference/vae_architecture.md - ELBO, reparameterization, variants
  • Diffusion Models: reference/diffusion_models.md - DDPM, score matching, schedulers
  • Image Generation: reference/image_generation.md - Stable Diffusion, ControlNet, practical use

Common Mistakes to Avoid

1. GAN Mode Collapse

# SYMPTOM: Generator produces same image regardless of noise
# CAUSES: Discriminator too strong, LR imbalance

# FIXES:
# 1. Use Wasserstein loss instead of BCE
# 2. Add noise to discriminator inputs
# 3. Use spectral normalization
# 4. Train D fewer steps than G (or vice versa)
# 5. Use label smoothing
real_labels = torch.FloatTensor(batch_size, 1).uniform_(0.8, 1.0)  # Not 1.0!

2. VAE Posterior Collapse

# SYMPTOM: KL loss goes to 0, model ignores latent space

# FIX: KL annealing (gradually increase KL weight)
kl_weight = min(1.0, epoch / 10)  # Linear warmup over 10 epochs
loss = recon_loss + kl_weight * kl_loss

3. Wrong Normalization for GAN

# WRONG: Tanh output but data in [0, 1]
# Generator outputs [-1, 1] but images are [0, 1]

# CORRECT: Normalize data to match generator output
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Now data is [-1, 1]
])

Teaching Mode

GAN Analogy

GAN = Counterfeiter vs Detective

  Counterfeiter (Generator):
    Makes fake money, tries to fool the detective.

  Detective (Discriminator):
    Examines money, tries to catch fakes.

  Over time:
    - Counterfeiter gets better at faking
    - Detective gets better at detecting
    - Eventually: fakes indistinguishable from real!

  This is a "minimax game" from Game Theory.

VAE vs AE

Autoencoder (AE):
  encoder: image → single point in latent space
  Problem: Gaps between points → random points = garbage

VAE:
  encoder: image → distribution (mean + variance)
  Advantage: Continuous, smooth latent space
             Random points → meaningful images!

  ┌─ AE latent space ─┐    ┌─ VAE latent space ─┐
  │   ·    ·           │    │  ∿∿∿∿∿∿∿∿∿∿∿∿    │
  │  ·        ·        │    │  ∿ smooth everywhere│
  │     gap!      ·    │    │  ∿ any point works  │
  │   ·     ·          │    │  ∿∿∿∿∿∿∿∿∿∿∿∿    │
  └────────────────────┘    └─────────────────────┘

Cross-References

  • Autoencoders basics: ../deep-learning-core/SKILL.md - AE, Denoising AE, latent space
  • CNNs for images: ../cnn-vision/SKILL.md - ConvTranspose2d, image processing
  • PyTorch patterns: ../pytorch-mastery/SKILL.md - Training loops, GPU optimization
  • Fine-tuning Diffusion: ../fine-tuning-peft/SKILL.md - LoRA for Stable Diffusion