levy-n

cnn-vision

Implements CNN architectures for computer vision tasks. Covers convolution operations, pooling, CNN design patterns (LeNet, ResNet, VGG), transfer learning, fine-tuning pretrained models, data augmentation, and image preprocessing. Use when building image classifiers, doing object detection, or when user mentions 'CNN', 'convolution', 'pooling', 'ResNet', 'VGG', 'transfer learning', 'fine-tuning', 'image augmentation', 'ImageNet', 'feature maps', 'MNIST', 'image classification', 'multi-modal', 'image captioning', or 'multimodal network'.

levy-n 10 1 Updated 4mo ago

Resources

1
GitHub

Install

npx skillscat add levy-n/claude-useful-skills/cnn-vision

Install via the SkillsCat registry.

SKILL.md

CNN Vision - Computer Vision with CNNs

Convolutional Neural Networks לעיבוד תמונה וראייה ממוחשבת.

Quick Start - Image Classification

import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision import datasets, models

# Image transforms
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Load data
train_dataset = datasets.ImageFolder('data/train', transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)

# Transfer learning with ResNet
model = models.resnet18(weights='IMAGENET1K_V1')
model.fc = nn.Linear(model.fc.in_features, num_classes)  # Replace final layer
model = model.to(device)

# Train
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

When This Skill Activates

Use this skill when:

  • Building image classifiers
  • Using pretrained models (ResNet, VGG, EfficientNet)
  • Applying transfer learning
  • Designing CNN architectures
  • Implementing data augmentation
  • Working with MNIST, CIFAR, ImageNet

Core Patterns

Pattern 1: Convolution Output Size

Output Size = (Input - Kernel + 2×Padding) / Stride + 1

Example: Input=28, Kernel=3, Padding=1, Stride=1
Output = (28 - 3 + 2×1) / 1 + 1 = 28 (same size)

"Same" padding rule: padding = kernel_size // 2
# Conv2d parameters
nn.Conv2d(
    in_channels=3,      # RGB input
    out_channels=64,    # Number of filters
    kernel_size=3,      # 3x3 filter
    stride=1,           # Move 1 pixel at a time
    padding=1           # "Same" padding
)

Pattern 2: CNN Architecture Pattern

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        # Feature extractor: Conv → ReLU → Pool
        self.features = nn.Sequential(
            # Block 1: 3 → 32 channels
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 224 → 112

            # Block 2: 32 → 64 channels
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 112 → 56

            # Block 3: 64 → 128 channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 56 → 28
        )

        # Classifier: Flatten → FC → Output
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 28 * 28, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Pattern 3: Transfer Learning (Feature Extraction)

import torchvision.models as models

# Load pretrained model
model = models.resnet18(weights='IMAGENET1K_V1')

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer (unfrozen by default)
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only train the new layer
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

Pattern 4: Fine-Tuning (Unfreeze Later Layers)

# Load pretrained
model = models.resnet18(weights='IMAGENET1K_V1')

# Replace final layer
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Freeze early layers, unfreeze later
for name, param in model.named_parameters():
    if 'layer4' in name or 'fc' in name:
        param.requires_grad = True
    else:
        param.requires_grad = False

# Different learning rates
optimizer = torch.optim.Adam([
    {'params': model.layer4.parameters(), 'lr': 1e-4},  # Low LR for pretrained
    {'params': model.fc.parameters(), 'lr': 1e-3}       # High LR for new
])

Pattern 5: Data Augmentation

# Training transforms (with augmentation)
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Validation transforms (NO augmentation!)
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# ⚠️ CRITICAL: Only augment training data!
train_dataset = datasets.ImageFolder('train', transform=train_transform)
val_dataset = datasets.ImageFolder('val', transform=val_transform)

Pattern 6: MNIST Patterns

# MNIST specific values
MNIST_MEAN = 0.1307
MNIST_STD = 0.3081

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((MNIST_MEAN,), (MNIST_STD,))
])

# MNIST is 28x28 grayscale (1 channel)
# Input shape: [batch, 1, 28, 28]

class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)   # 28→28
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)  # 14→14
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # 28→14
        x = self.pool(F.relu(self.conv2(x)))  # 14→7
        x = x.view(-1, 64 * 7 * 7)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

Pattern 7: Feature Map Visualization

def visualize_feature_maps(model, image, layer_name):
    """Visualize what a CNN layer sees."""
    activation = {}

    def hook(model, input, output):
        activation['output'] = output.detach()

    # Register hook
    layer = dict(model.named_modules())[layer_name]
    handle = layer.register_forward_hook(hook)

    # Forward pass
    model.eval()
    with torch.no_grad():
        _ = model(image.unsqueeze(0))

    handle.remove()

    # Plot feature maps
    import matplotlib.pyplot as plt
    feat_maps = activation['output'].squeeze().cpu()
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for i, ax in enumerate(axes.flat):
        if i < feat_maps.shape[0]:
            ax.imshow(feat_maps[i], cmap='viridis')
        ax.axis('off')
    plt.show()

Pattern 8: Calculate FC Input Size

def calc_fc_input_size(model, input_size=(3, 224, 224)):
    """Calculate the flattened size after conv layers."""
    with torch.no_grad():
        x = torch.randn(1, *input_size)
        x = model.features(x)  # Pass through conv layers
        return x.view(1, -1).size(1)

# Usage
fc_input = calc_fc_input_size(model)
model.classifier = nn.Linear(fc_input, num_classes)

Reference Navigation

For detailed content, see:

  • Convolution Basics: reference/convolution_basics.md - Filters, Pooling, Feature maps
  • CNN Architectures: reference/cnn_architectures.md - LeNet, VGG, ResNet patterns
  • Transfer Learning: reference/transfer_learning.md - Fine-tuning, Feature extraction
  • Image Preprocessing: reference/image_preprocessing.md - Transforms, Normalization
  • MNIST Patterns: reference/mnist_patterns.md - Dataset loading, specific values
  • Multi-Modal & Captioning: reference/multimodal_captioning.md - Multi-modal networks, Image captioning, CNN+LSTM

Common Mistakes to Avoid

1. Augmenting Validation Data

# WRONG: Augmentation on validation
val_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),  # NO! Don't augment val/test
    transforms.ToTensor(),
])

# CORRECT: Only resize and normalize
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

2. Wrong ImageNet Normalization

# WRONG: Random normalization values
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])

# CORRECT: ImageNet statistics (for pretrained models)
transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225]
)

3. Forgetting to Unfreeze Classifier

# WRONG: New layer frozen
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(512, 10)  # Also frozen!

# CORRECT: Explicitly unfreeze new layer
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(512, 10)
for param in model.fc.parameters():
    param.requires_grad = True

4. Wrong Flatten Size

# WRONG: Hardcoded wrong size
self.fc = nn.Linear(512, 10)  # Error if feature maps are different!

# CORRECT: Calculate from conv output
# After 3 pools on 224x224: 224 / 2^3 = 28
# If last conv has 128 channels: 128 * 28 * 28 = 100352
self.fc = nn.Linear(128 * 28 * 28, 10)

5. BatchNorm with batch_size=1

# WRONG: BatchNorm fails with batch_size=1
model.train()
output = model(single_image)  # Error!

# CORRECT: Use eval mode for single images
model.eval()
with torch.no_grad():
    output = model(single_image)

Teaching Mode

When explaining CNN concepts:

Convolution Intuition

Convolution = Sliding window that detects patterns

Image:                  Filter (3x3):         Output:
┌─────────────┐        ┌─────────┐           Multiply
│ 1 2 3 4 5   │        │ 1  0 -1 │           element-wise,
│ 6 7 8 9 10  │   *    │ 1  0 -1 │    →      sum up
│ ...         │        │ 1  0 -1 │
└─────────────┘        └─────────┘

This filter detects vertical edges!

CNN Layer Hierarchy

Layer 1: Edges, corners       ─────▶  Low-level features
Layer 2: Textures, patterns   ─────▶  Mid-level features
Layer 3: Parts (eyes, wheels) ─────▶  High-level features
Layer 4: Objects              ─────▶  Semantic features

Deeper = More abstract

Transfer Learning Analogy

Training from scratch:
"Learn to see from birth"
- Needs millions of images
- Takes days/weeks

Transfer learning:
"I already know what edges, textures, objects look like"
"Just teach me YOUR specific categories"
- Needs hundreds/thousands of images
- Takes hours