Lecture 1: Deep Learning Fundamentals¶

Mathematical model of a neuron¶

Axons: Input(s) x_i (axon_i from a neighbouring neuron)
Synapse: Weights w_i
Cell body:
- sums input(s) w_ix_i and add a bias b
- Output is determined by the activation function which determines whether the neuron "fires" or not.

Activation Functions:¶

Sigmoid
Hyperbolic Tangent
ReLU

Universality of neural networks:¶

Hornik's Theorem - Any continuous function can be approximated with a 2-layer neural networks with enough hidden units

Exlore interactively

Types of learning¶

Supervised
- Learn X given X->Y
Unsupervised
- Learn X
Reinforcement
- Learn to interact with an environment

Unsupervised Learning examples¶

Predict next character (charRNN - Andrej K)
- Radford et al - 2017
Predict "nearby" words (word2vec)
- Mikoloc et al - 2013
Predict next pixel (pixelCNN)
- van den oord et al - 2016
Variational Autoencoders (VAE) - Encode image down to latent vector/variables and then decode back - used to learn complex images from compressed representations
- Kigma and Welling - 2014
Generative Adversarial Networks - use a latent image to generate x that is indistinguishable from real
- Goodfellow et al - 2015

Linear Regression - line fitting¶

Can be thought of a prediction problem - Given a number of inputs X and outputs Y, what is the output corresponding to a new X we have not seen before

Achieved by fitting a line which is in turn achieved by finding parameters w and b for a line (y = wx + b) that minimize (optimizing) the squared error loss function (min_w_b(sum(wx_i + b =y_i)^2))

Loss functions examples¶

Mean Squared Error (MSE)
Cross Entropy Loss

Optimizing loss functions - gradient descent¶

All "learning" can be thought of as the problem of optimization loss functions

choose weights randomly
calculate the gradient (derivative) of the loss function with respect to chosen weights, using the observed measurements
update weights by subtracting (learning rate * gradient) from current weight
repeat

Stochastic/Batch Gradient Descent¶

Essentially computing each gradient step on subsets of the data.

noisy
more efficient

Encoding data in neural networks¶

Computer Vision: Convolutional Neural Networks - leverage "spatial translation invariance" (objects look similar as we move through space)
Natural Language Processing (sequence processing more generally): Recurrent Neural Networks - leverage "temporal invariance" (the rules of language do not change depending on where we are in a sentence)