Lecture 2b: Computer Vision Architectures¶
Imagenet¶
Imagenet is a Large Scale Visual recognition challenge running since 2010.
- 1.2M images
- 1000 categories
- task - classification (but also localization and detection)
Convnet Architectures for image recognition¶
AlexNet¶
Alexnet's performance on Imagenet challenege in 2012 using a deep neural network launched the CNN revolution Progress since then tracked by ImageNet winners
Similar to LeNet - Introduced many innovations:
- ReLU
- Dropout
- Data augmentation
ZFNet¶
Very similar to AlexNet. Introduced "deconvolutional visualizations"
- Each filter can be thought of as detecting an image patch. Paper introuced visualisations to show how detection becomes more specific
VGG¶
Simple deep architecture.
Innovation:
- Stacking smaller convolutional filters to increase receptive field with fewer parameters
GoogleNet/Inception¶
As deep as VGG but with 3% parameters
- No fully connected layers
- Stack of inception modules which are different to the standard(conv-relu-conv-relu-pool) module.
- classification outputs at different portions of the network.
Inception module¶
Inputs go through different filters or pooling simultaneously and the outputs get concatenated.
ResNet¶
- Very deep architecture with 152 layers
- Most commonly used now
- Lowered error rate below human performance
- Authors observed that deeper models should be able to perform as well as shallower networks of same architecture, but don't due to vanishing gradients
Innovation:
- If a layer makes the gradient vanish, then skip around the layer by adding shortcuts
ResNet variants¶
- DenseNet
- Multiple skip connections
- ResNeXt
- combines inception and resent
- SENet
- Adds a module of global pooling and fully connected layer to adaptively reweigh feature outputs maps
SqueezeNet¶
Alexnet accuracy with 50x fewer parameters
Comparision of architectures¶
Applications¶
- classification
- output the class of the image
- localization
- Show where object is, in an image
- detection
- output every object's class and locations
- segmentation
- label every pixel belonging to an object
- instance segmentation
- differentiate objects of the same class
Classification¶
Localization¶
Predict bounding box co-ordinates as well as class using the same network * does not scale for detection of multiple objects
Detection¶
- slide a classifier over the image at multiple scales
- very computationally expensive
Non Maxing Suppression: When bounding boxes overlap, keep the one with the highest score.
- Overfeat 2013, used apprach of turning FC layers into convolutional layers
- YOLO/SSD ,utiple versions
- You only look once, Single shot detection
- Multiple versions, evaluated against Microsoft CoCo common objects in context, which is a large scale object detecttion, segmentation and captioning dataset.
Region proposal methods¶
Look only at and classify interesting portion of an image
- Region-CNN
- Faster R-CNN
- RPN
- Mask R-CNN
- uNet / Fully convolutional nets
Segmentation¶
- label every pixel belonging to an object
- instance segmentation
- differentiate objects of the same class
Mesh RCN - predicts 3D mesh of a 2D image¶
Facial Landmark Detection¶
Pose estimation¶
Adverserial attacks on CNNs¶
- White box - with access to model parameters
- Black box - without access to model parameters
Examples:
- Panda -> Gibbon example - by ingroducing weel crafted "noise"
- Pictures of real world physical objects that mess up models]
Attack:
- Some form of modifying the input in the direction of the loss gradient
Defence:
- Training with adversarial examples - not very effective in practice
- Smooth class decision boundaries: Defensive distillation - train second NN that learns to provide the same output as a nn without seeing the raw data ??
Style Transfer¶
GANs¶
Produce life like fake images