cnn-vision

Implements CNN architectures for computer vision tasks. Covers convolution operations, pooling, CNN design patterns (LeNet, ResNet, VGG), transfer learning, fine-tuning pretrained models, data augmentation, and image preprocessing. Use when building image classifiers, doing object detection, or when user mentions 'CNN', 'convolution', 'pooling', 'ResNet', 'VGG', 'transfer learning', 'fine-tuning', 'image augmentation', 'ImageNet', 'feature maps', 'MNIST', 'image classification', 'multi-modal', 'image captioning', or 'multimodal network'.

levy-n 10 1 Updated 5mo ago

Resources

GitHub

Install

npx skillscat add levy-n/claude-useful-skills/cnn-vision

Install via the SkillsCat registry.

SKILL.md

CNN Vision - Computer Vision with CNNs

Convolutional Neural Networks לעיבוד תמונה וראייה ממוחשבת.

Quick Start - Image Classification

import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision import datasets, models

# Image transforms
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Load data
train_dataset = datasets.ImageFolder('data/train', transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)

# Transfer learning with ResNet
model = models.resnet18(weights='IMAGENET1K_V1')
model.fc = nn.Linear(model.fc.in_features, num_classes)  # Replace final layer
model = model.to(device)

# Train
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

When This Skill Activates

Use this skill when:

Building image classifiers
Using pretrained models (ResNet, VGG, EfficientNet)
Applying transfer learning
Designing CNN architectures
Implementing data augmentation
Working with MNIST, CIFAR, ImageNet

Core Patterns

Pattern 1: Convolution Output Size

Output Size = (Input - Kernel + 2×Padding) / Stride + 1

Example: Input=28, Kernel=3, Padding=1, Stride=1
Output = (28 - 3 + 2×1) / 1 + 1 = 28 (same size)

"Same" padding rule: padding = kernel_size // 2

# Conv2d parameters
nn.Conv2d(
    in_channels=3,      # RGB input
    out_channels=64,    # Number of filters
    kernel_size=3,      # 3x3 filter
    stride=1,           # Move 1 pixel at a time
    padding=1           # "Same" padding
)

Pattern 2: CNN Architecture Pattern

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        # Feature extractor: Conv → ReLU → Pool
        self.features = nn.Sequential(
            # Block 1: 3 → 32 channels
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 224 → 112

            # Block 2: 32 → 64 channels
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 112 → 56

            # Block 3: 64 → 128 channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 56 → 28
        )

        # Classifier: Flatten → FC → Output
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 28 * 28, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Pattern 3: Transfer Learning (Feature Extraction)

import torchvision.models as models

# Load pretrained model
model = models.resnet18(weights='IMAGENET1K_V1')

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer (unfrozen by default)
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only train the new layer
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)

Pattern 4: Fine-Tuning (Unfreeze Later Layers)

# Load pretrained
model = models.resnet18(weights='IMAGENET1K_V1')

# Replace final layer
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Freeze early layers, unfreeze later
for name, param in model.named_parameters():
    if 'layer4' in name or 'fc' in name:
        param.requires_grad = True
    else:
        param.requires_grad = False

# Different learning rates
optimizer = torch.optim.Adam([
    {'params': model.layer4.parameters(), 'lr': 1e-4},  # Low LR for pretrained
    {'params': model.fc.parameters(), 'lr': 1e-3}       # High LR for new
])

Pattern 5: Data Augmentation

# Training transforms (with augmentation)
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Validation transforms (NO augmentation!)
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# ⚠️ CRITICAL: Only augment training data!
train_dataset = datasets.ImageFolder('train', transform=train_transform)
val_dataset = datasets.ImageFolder('val', transform=val_transform)

Pattern 6: MNIST Patterns

# MNIST specific values
MNIST_MEAN = 0.1307
MNIST_STD = 0.3081

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((MNIST_MEAN,), (MNIST_STD,))
])

# MNIST is 28x28 grayscale (1 channel)
# Input shape: [batch, 1, 28, 28]

class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)   # 28→28
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)  # 14→14
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # 28→14
        x = self.pool(F.relu(self.conv2(x)))  # 14→7
        x = x.view(-1, 64 * 7 * 7)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

Pattern 7: Feature Map Visualization

def visualize_feature_maps(model, image, layer_name):
    """Visualize what a CNN layer sees."""
    activation = {}

    def hook(model, input, output):
        activation['output'] = output.detach()

    # Register hook
    layer = dict(model.named_modules())[layer_name]
    handle = layer.register_forward_hook(hook)

    # Forward pass
    model.eval()
    with torch.no_grad():
        _ = model(image.unsqueeze(0))

    handle.remove()

    # Plot feature maps
    import matplotlib.pyplot as plt
    feat_maps = activation['output'].squeeze().cpu()
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for i, ax in enumerate(axes.flat):
        if i < feat_maps.shape[0]:
            ax.imshow(feat_maps[i], cmap='viridis')
        ax.axis('off')
    plt.show()

Pattern 8: Calculate FC Input Size

def calc_fc_input_size(model, input_size=(3, 224, 224)):
    """Calculate the flattened size after conv layers."""
    with torch.no_grad():
        x = torch.randn(1, *input_size)
        x = model.features(x)  # Pass through conv layers
        return x.view(1, -1).size(1)

# Usage
fc_input = calc_fc_input_size(model)
model.classifier = nn.Linear(fc_input, num_classes)

Reference Navigation

For detailed content, see:

Convolution Basics: reference/convolution_basics.md - Filters, Pooling, Feature maps
CNN Architectures: reference/cnn_architectures.md - LeNet, VGG, ResNet patterns
Transfer Learning: reference/transfer_learning.md - Fine-tuning, Feature extraction
Image Preprocessing: reference/image_preprocessing.md - Transforms, Normalization
MNIST Patterns: reference/mnist_patterns.md - Dataset loading, specific values
Multi-Modal & Captioning: reference/multimodal_captioning.md - Multi-modal networks, Image captioning, CNN+LSTM

Common Mistakes to Avoid

1. Augmenting Validation Data

# WRONG: Augmentation on validation
val_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),  # NO! Don't augment val/test
    transforms.ToTensor(),
])

# CORRECT: Only resize and normalize
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

2. Wrong ImageNet Normalization

# WRONG: Random normalization values
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])

# CORRECT: ImageNet statistics (for pretrained models)
transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225]
)

3. Forgetting to Unfreeze Classifier

# WRONG: New layer frozen
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(512, 10)  # Also frozen!

# CORRECT: Explicitly unfreeze new layer
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(512, 10)
for param in model.fc.parameters():
    param.requires_grad = True

4. Wrong Flatten Size

# WRONG: Hardcoded wrong size
self.fc = nn.Linear(512, 10)  # Error if feature maps are different!

# CORRECT: Calculate from conv output
# After 3 pools on 224x224: 224 / 2^3 = 28
# If last conv has 128 channels: 128 * 28 * 28 = 100352
self.fc = nn.Linear(128 * 28 * 28, 10)

5. BatchNorm with batch_size=1

# WRONG: BatchNorm fails with batch_size=1
model.train()
output = model(single_image)  # Error!

# CORRECT: Use eval mode for single images
model.eval()
with torch.no_grad():
    output = model(single_image)

Teaching Mode

When explaining CNN concepts:

Convolution Intuition

Convolution = Sliding window that detects patterns

Image:                  Filter (3x3):         Output:
┌─────────────┐        ┌─────────┐           Multiply
│ 1 2 3 4 5   │        │ 1  0 -1 │           element-wise,
│ 6 7 8 9 10  │   *    │ 1  0 -1 │    →      sum up
│ ...         │        │ 1  0 -1 │
└─────────────┘        └─────────┘

This filter detects vertical edges!

CNN Layer Hierarchy

Layer 1: Edges, corners       ─────▶  Low-level features
Layer 2: Textures, patterns   ─────▶  Mid-level features
Layer 3: Parts (eyes, wheels) ─────▶  High-level features
Layer 4: Objects              ─────▶  Semantic features

Deeper = More abstract

Transfer Learning Analogy

Training from scratch:
"Learn to see from birth"
- Needs millions of images
- Takes days/weeks

Transfer learning:
"I already know what edges, textures, objects look like"
"Just teach me YOUR specific categories"
- Needs hundreds/thousands of images
- Takes hours

cnn-vision

Resources

Install

CNN Vision - Computer Vision with CNNs

Quick Start - Image Classification

When This Skill Activates

Core Patterns

Pattern 1: Convolution Output Size

Pattern 2: CNN Architecture Pattern

Pattern 3: Transfer Learning (Feature Extraction)

Pattern 4: Fine-Tuning (Unfreeze Later Layers)

Pattern 5: Data Augmentation

Pattern 6: MNIST Patterns

Pattern 7: Feature Map Visualization

Pattern 8: Calculate FC Input Size

Reference Navigation

Common Mistakes to Avoid

1. Augmenting Validation Data

2. Wrong ImageNet Normalization

3. Forgetting to Unfreeze Classifier

4. Wrong Flatten Size

5. BatchNorm with batch_size=1

Teaching Mode

Convolution Intuition

CNN Layer Hierarchy

Transfer Learning Analogy

Categories

Install

Recommended Skills