Skip to content

Lecture 2b: Computer Vision Architectures

Imagenet

Imagenet is a Large Scale Visual recognition challenge running since 2010.

  • 1.2M images
  • 1000 categories
  • task - classification (but also localization and detection)

Convnet Architectures for image recognition

AlexNet

Alexnet's performance on Imagenet challenege in 2012 using a deep neural network launched the CNN revolution Progress since then tracked by ImageNet winners

Similar to LeNet - Introduced many innovations:

  • ReLU
  • Dropout
  • Data augmentation

ZFNet

Very similar to AlexNet. Introduced "deconvolutional visualizations"

  • Each filter can be thought of as detecting an image patch. Paper introuced visualisations to show how detection becomes more specific

VGG

Simple deep architecture.

Innovation:

  • Stacking smaller convolutional filters to increase receptive field with fewer parameters

GoogleNet/Inception

As deep as VGG but with 3% parameters

  • No fully connected layers
  • Stack of inception modules which are different to the standard(conv-relu-conv-relu-pool) module.
  • classification outputs at different portions of the network.

Inception module

Inputs go through different filters or pooling simultaneously and the outputs get concatenated.

ResNet

  • Very deep architecture with 152 layers
  • Most commonly used now
  • Lowered error rate below human performance
  • Authors observed that deeper models should be able to perform as well as shallower networks of same architecture, but don't due to vanishing gradients

Innovation:

  • If a layer makes the gradient vanish, then skip around the layer by adding shortcuts

ResNet variants

  • DenseNet
    • Multiple skip connections
  • ResNeXt
    • combines inception and resent
  • SENet
    • Adds a module of global pooling and fully connected layer to adaptively reweigh feature outputs maps

SqueezeNet

Alexnet accuracy with 50x fewer parameters

Comparision of architectures

DawnBecnh

Applications

  • classification
    • output the class of the image
  • localization
    • Show where object is, in an image
  • detection
    • output every object's class and locations
  • segmentation
    • label every pixel belonging to an object
  • instance segmentation
    • differentiate objects of the same class

Classification

Localization

Predict bounding box co-ordinates as well as class using the same network * does not scale for detection of multiple objects

Detection

  • slide a classifier over the image at multiple scales
    • very computationally expensive

Non Maxing Suppression: When bounding boxes overlap, keep the one with the highest score.

  • Overfeat 2013, used apprach of turning FC layers into convolutional layers
  • YOLO/SSD ,utiple versions
    • You only look once, Single shot detection
    • Multiple versions, evaluated against Microsoft CoCo common objects in context, which is a large scale object detecttion, segmentation and captioning dataset.

Region proposal methods

Look only at and classify interesting portion of an image

  • Region-CNN
  • Faster R-CNN
  • RPN
  • Mask R-CNN
  • uNet / Fully convolutional nets

Segmentation

  • label every pixel belonging to an object
  • instance segmentation
    • differentiate objects of the same class

Mesh RCN - predicts 3D mesh of a 2D image

Facial Landmark Detection

Pose estimation

Adverserial attacks on CNNs

  • White box - with access to model parameters
  • Black box - without access to model parameters

Examples:

Attack:

  • Some form of modifying the input in the direction of the loss gradient

Defence:

  • Training with adversarial examples - not very effective in practice
  • Smooth class decision boundaries: Defensive distillation - train second NN that learns to provide the same output as a nn without seeing the raw data ??

Style Transfer

GANs

Produce life like fake images