Implements CNN architectures for computer vision tasks. Covers convolution operations, pooling, CNN design patterns (LeNet, ResNet, VGG), transfer learning, fine-tuning pretrained models, data augmentation, and image preprocessing. Use when building image classifiers, doing object detection, or when user mentions 'CNN', 'convolution', 'pooling', 'ResNet', 'VGG', 'transfer learning', 'fine-tuning', 'image augmentation', 'ImageNet', 'feature maps', 'MNIST', 'image classification', 'multi-modal', 'image captioning', or 'multimodal network'.
Resources
1Install
npx skillscat add levy-n/claude-useful-skills/cnn-vision Install via the SkillsCat registry.
SKILL.md
CNN Vision - Computer Vision with CNNs
Convolutional Neural Networks לעיבוד תמונה וראייה ממוחשבת.
Quick Start - Image Classification
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision import datasets, models
# Image transforms
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Load data
train_dataset = datasets.ImageFolder('data/train', transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
# Transfer learning with ResNet
model = models.resnet18(weights='IMAGENET1K_V1')
model.fc = nn.Linear(model.fc.in_features, num_classes) # Replace final layer
model = model.to(device)
# Train
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)When This Skill Activates
Use this skill when:
- Building image classifiers
- Using pretrained models (ResNet, VGG, EfficientNet)
- Applying transfer learning
- Designing CNN architectures
- Implementing data augmentation
- Working with MNIST, CIFAR, ImageNet
Core Patterns
Pattern 1: Convolution Output Size
Output Size = (Input - Kernel + 2×Padding) / Stride + 1
Example: Input=28, Kernel=3, Padding=1, Stride=1
Output = (28 - 3 + 2×1) / 1 + 1 = 28 (same size)
"Same" padding rule: padding = kernel_size // 2# Conv2d parameters
nn.Conv2d(
in_channels=3, # RGB input
out_channels=64, # Number of filters
kernel_size=3, # 3x3 filter
stride=1, # Move 1 pixel at a time
padding=1 # "Same" padding
)Pattern 2: CNN Architecture Pattern
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Feature extractor: Conv → ReLU → Pool
self.features = nn.Sequential(
# Block 1: 3 → 32 channels
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 224 → 112
# Block 2: 32 → 64 channels
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 112 → 56
# Block 3: 64 → 128 channels
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 56 → 28
)
# Classifier: Flatten → FC → Output
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 28 * 28, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return xPattern 3: Transfer Learning (Feature Extraction)
import torchvision.models as models
# Load pretrained model
model = models.resnet18(weights='IMAGENET1K_V1')
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer (unfrozen by default)
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only train the new layer
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)Pattern 4: Fine-Tuning (Unfreeze Later Layers)
# Load pretrained
model = models.resnet18(weights='IMAGENET1K_V1')
# Replace final layer
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Freeze early layers, unfreeze later
for name, param in model.named_parameters():
if 'layer4' in name or 'fc' in name:
param.requires_grad = True
else:
param.requires_grad = False
# Different learning rates
optimizer = torch.optim.Adam([
{'params': model.layer4.parameters(), 'lr': 1e-4}, # Low LR for pretrained
{'params': model.fc.parameters(), 'lr': 1e-3} # High LR for new
])Pattern 5: Data Augmentation
# Training transforms (with augmentation)
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
# Validation transforms (NO augmentation!)
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
# ⚠️ CRITICAL: Only augment training data!
train_dataset = datasets.ImageFolder('train', transform=train_transform)
val_dataset = datasets.ImageFolder('val', transform=val_transform)Pattern 6: MNIST Patterns
# MNIST specific values
MNIST_MEAN = 0.1307
MNIST_STD = 0.3081
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((MNIST_MEAN,), (MNIST_STD,))
])
# MNIST is 28x28 grayscale (1 channel)
# Input shape: [batch, 1, 28, 28]
class MNISTNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, padding=1) # 28→28
self.conv2 = nn.Conv2d(32, 64, 3, padding=1) # 14→14
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 7 * 7, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x))) # 28→14
x = self.pool(F.relu(self.conv2(x))) # 14→7
x = x.view(-1, 64 * 7 * 7)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return xPattern 7: Feature Map Visualization
def visualize_feature_maps(model, image, layer_name):
"""Visualize what a CNN layer sees."""
activation = {}
def hook(model, input, output):
activation['output'] = output.detach()
# Register hook
layer = dict(model.named_modules())[layer_name]
handle = layer.register_forward_hook(hook)
# Forward pass
model.eval()
with torch.no_grad():
_ = model(image.unsqueeze(0))
handle.remove()
# Plot feature maps
import matplotlib.pyplot as plt
feat_maps = activation['output'].squeeze().cpu()
fig, axes = plt.subplots(4, 8, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
if i < feat_maps.shape[0]:
ax.imshow(feat_maps[i], cmap='viridis')
ax.axis('off')
plt.show()Pattern 8: Calculate FC Input Size
def calc_fc_input_size(model, input_size=(3, 224, 224)):
"""Calculate the flattened size after conv layers."""
with torch.no_grad():
x = torch.randn(1, *input_size)
x = model.features(x) # Pass through conv layers
return x.view(1, -1).size(1)
# Usage
fc_input = calc_fc_input_size(model)
model.classifier = nn.Linear(fc_input, num_classes)Reference Navigation
For detailed content, see:
- Convolution Basics:
reference/convolution_basics.md- Filters, Pooling, Feature maps - CNN Architectures:
reference/cnn_architectures.md- LeNet, VGG, ResNet patterns - Transfer Learning:
reference/transfer_learning.md- Fine-tuning, Feature extraction - Image Preprocessing:
reference/image_preprocessing.md- Transforms, Normalization - MNIST Patterns:
reference/mnist_patterns.md- Dataset loading, specific values - Multi-Modal & Captioning:
reference/multimodal_captioning.md- Multi-modal networks, Image captioning, CNN+LSTM
Common Mistakes to Avoid
1. Augmenting Validation Data
# WRONG: Augmentation on validation
val_transform = transforms.Compose([
transforms.RandomHorizontalFlip(), # NO! Don't augment val/test
transforms.ToTensor(),
])
# CORRECT: Only resize and normalize
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])2. Wrong ImageNet Normalization
# WRONG: Random normalization values
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
# CORRECT: ImageNet statistics (for pretrained models)
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)3. Forgetting to Unfreeze Classifier
# WRONG: New layer frozen
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(512, 10) # Also frozen!
# CORRECT: Explicitly unfreeze new layer
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(512, 10)
for param in model.fc.parameters():
param.requires_grad = True4. Wrong Flatten Size
# WRONG: Hardcoded wrong size
self.fc = nn.Linear(512, 10) # Error if feature maps are different!
# CORRECT: Calculate from conv output
# After 3 pools on 224x224: 224 / 2^3 = 28
# If last conv has 128 channels: 128 * 28 * 28 = 100352
self.fc = nn.Linear(128 * 28 * 28, 10)5. BatchNorm with batch_size=1
# WRONG: BatchNorm fails with batch_size=1
model.train()
output = model(single_image) # Error!
# CORRECT: Use eval mode for single images
model.eval()
with torch.no_grad():
output = model(single_image)Teaching Mode
When explaining CNN concepts:
Convolution Intuition
Convolution = Sliding window that detects patterns
Image: Filter (3x3): Output:
┌─────────────┐ ┌─────────┐ Multiply
│ 1 2 3 4 5 │ │ 1 0 -1 │ element-wise,
│ 6 7 8 9 10 │ * │ 1 0 -1 │ → sum up
│ ... │ │ 1 0 -1 │
└─────────────┘ └─────────┘
This filter detects vertical edges!CNN Layer Hierarchy
Layer 1: Edges, corners ─────▶ Low-level features
Layer 2: Textures, patterns ─────▶ Mid-level features
Layer 3: Parts (eyes, wheels) ─────▶ High-level features
Layer 4: Objects ─────▶ Semantic features
Deeper = More abstractTransfer Learning Analogy
Training from scratch:
"Learn to see from birth"
- Needs millions of images
- Takes days/weeks
Transfer learning:
"I already know what edges, textures, objects look like"
"Just teach me YOUR specific categories"
- Needs hundreds/thousands of images
- Takes hours