deft/archives/Deep-Learning.org

-- #+TITLE: Deep Learning Coursera
-- #+AUTHOR: Yann Esposito
#+STARTUP: latexpreview
#+TODO: TODO IN-PROGRESS WAIT | DONE CANCELED
#+COLUMNS: %TODO %3PRIORITY %40ITEM(Task) %17EFFORT(Estimated Effort){:} %CLOCKSUM %8TAGS(TAG)

* Plan

5 courses

** Neural Network and Deep Learning
*** Week 1: Introduction
*** Week 2: Basic of Neural Network programming
*** Week 3: One hidden layer Neural Networks
*** Week 4: Deep Neural Network
** Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
** Structuring your Machine Learning project
** Convolutional Neural Networks
** Natural Language Processing: Building sequence models
* DONE Neural Network and Deep Learning
  CLOSED: [2017-08-22 Tue 13:43]
** Introduction

*** What is a neural network?

*** Supervised Learning with Neural Networks

 - Lucrative application: ads, showing the add you're most likely to click on
 - Photo tagging
 - Speech recognition
 - Machine translation
 - Autonomous driving


***** Convolutional NN good for images

***** Strutured data (db of data) vs Unstructured data

 - Structured data: Tables
 - Unstructured data: Audio, image, text...

 Computer are much better at interpreting unstructured data.

*** Why is Deep Learning taking off?

 [[///Users/yaesposi/Library/Mobile%20Documents/com~apple~CloudDocs/deft/img/Scale%20drives%20deep%20learning%20progress.png]]

 - Data (lot of data)
 - Computation (faster learning loop)
 - Algorithms (ex, use ReLU instead of sigma)
** Geoffrey Hinton interview
** Binary Classification

\[ (x,y) x\in \mathbb{R}^{n_x}, y \in {0,1} \]

$m$ training examples: $$ {(x^{(1)},y^{(1)}), ... (x^{(m)},y^{(m)})} $$

$$ m = m_{train} , m_{test}  = #test examples $$

$$ X = [ X^{(1)} ... X^{(m)} ] is an n_x x m matrix $$
$$ X.shape (n_x,m) $$

$$ Y = [ y^{(1)} ... y^{(m)} ] $$
$$ Y.shape = (1,m) $$

** Logistic Regression

Given $X \in \mathbb{R}^{n_x}$ you want $\hat{y} = P(y=1 | X)$

Paramters: $w \in \mathbb{R}^{n_x}, b\in \mathbb{R}$

Output: $\hat{y} = \sigma(w^Tx + b) = \sigma(z)$

$$\sigma(z)= \frac{1}{1 + e^{-z}}$$

If $z \rightarrow \infty => \sigma(z) \approx 1$
If $z \rightarrow - \infty => \sigma(z) \approx 0$


Alternative notation not used in this course:

$X_0=1, x\in\mathbb{R}^{n_x+1}$
$\hat{y} = \sigma(\Theta^Tx)$
...

** Logistic Regression Cost Function

Search a convex loss function:

$L(\hat{y},y) = - (y\log(\hat{y}) + (1-y)\log(1-\hat{y}))$

If y = 1 : $L(\hat{y},y) = -\log\hat{y}$ <- want log\haty larg, want \hat{y} large
If y = 0 : $L(\hat{y},y) = -\log\hat{y}$ <- want log (1-\hat{y}) large, want \hat{y} sall

Cost function: $$ J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y^\{(i)}},y^{(i)}) = ... $$

** Gradient Descent

Minize $J(w,b)$

1. initialize w,b (generaly uses zero)
2. Take a step in the steepest descent direction
3. repeat 2 until reaching global optimum

Repeat {
  $w := w - \alpha\frac{dJ(w)}{dw} = w - \alpha\mathtext{dw}$
}

** Derivatives
** More Derivative Examples
** Computaion Graph
** Computing Derivatives
** Computing Derivatives for multiple examples
** Vectorization
getting rid of explicit for loops in your code
** Vectorizing Logistic Regression
** Vectorizing Logistic Regression's Gradient Computation
** Broadcasting in Python
** Quick Tour of Jupyter / ipython notebooks
** Neural Network Basics
J = a*b + a*c - (b+c) = a (b + c) - (b + c) = (a - 1) (b + c)
* DONE Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
  CLOSED: [2017-09-01 Fri 09:52]
** DONE Week 1: Setting up your Machine
   CLOSED: [2017-08-22 Tue 13:43]
*** Recipe

If *High bias*? (bad training set performance?)
   Then try:
     - Bigger network
     - Training longer
     - (NN architecture search)
   Else if *High variance*? (bad dev set performance?)
     Then try:
     - More data
     - Regularization
     - (NN architecture search)

Deep learning, not much bias/variance tradeoff if we have a big amount of
computer power (bigger network) and lot of data.
*** Regularization
**** Regularization: reduce variance
 - L2 regularization

 λ / 2m || w ||_2 ^2

 - L1 regularization: same with |w| instead of ||w||_2^2


 λ is a regularization parameter (in code named =lambd=)

 Cost = J(w^[1], b^[1], ..., w^[L], b^[L]) = 1/m \sum L(^y(i), y(i)) + λ/2m \sum_l=1^L || W^[l] ||^2


 call the "Frobenius norm"

 dW = from backprop + λ/m W^l

 update W^l = W^l - αdW^l still works

 Sometime L2 regularization called "weight decay".

**** Dropout Regularization

 Eliminates nodes by layer randomly for each training example.

 - implementing, (inverted dropout)
   - gen random boolean vector:
      d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob # (for each iteration)
      a3 = np.mulitply(a3,d3)
      a3 /= keep_prob (for normalization to be certain the a3 output still the same, reduce testing problems)

 Making prediction at test time: no drop out

**** Over regularization methods

 - Data augmentation, (flipping images for example, random crops, random distortions, etc...)
 - Early stopping, stop earlier iteration

*** Setting up your optimization problem

**** Normalizing Inputs

 - μ = 1/m Sum X^(i)
 - x := x - μ (centralize)
 - σ = 1/m Sum X^(i)^2
 - x /= σ^2


**** Gradient Checking
***** Don't use gard check in traingin, only in debug
***** If algorithm fail, grad check, look at component (is db? dW? dW on certain layer, etc...)
***** Remember regularization
***** Doesn't work with dropout, turn off drop out (put 1.0) then check
***** Run at random initialization; perhaps again after training

** DONE Week 2: Optimization Algorithms
   CLOSED: [2017-08-22 Tue 13:43]
*** Mini batch

    X :: X^(1) ... X^(m)

    X,Y -> X^{i},Y^{i} where X^{i} = X^(i*batch-size ---> (i+1)*batch-size)

*** Minibatch size

- if mini batch size = m => Batch gradient descent (X^{1},Y^{1}) = (X,Y)
- if mini match size = 1 => Stochastic gradient descent, every example is its own mini batch.
- in practice in between 1 and m, m --> too  long, 1 loose speedup from vectorization.
  + vectorization ~1000

1. If small training set, use batch gradient descent (m <= 2000)
2. Typical mini-batch size: 64, 128, 256, 512, ... 2^k to fits in CPU/GPU memory

*** Exponentially weighted average

v_t = βv_{t-1} + (1-β)θ_t
** DONE Week 3: Hyperparameter
   CLOSED: [2017-09-01 Fri 09:52]
*** Video 1: use random not a grid to search for hyperparameter best value
*** Video 2: choose appropriate scale to pick hyperparameter
- uniformly random n^[l] (number of neuron for layer l) or L (number of layers)
- alpha: between 0.00001 to 1, then shouldn't use linear but instead use log-scale
    r = -4*np.random.rand() <- r in [-4,0]
    α = 10^r                <- 10^-4 ... 10^0

- β <- 0.9 ... 0.999 (0.9 about avg on 10 values, 0.999 avg about 1000 values)
    1-β = 0.1 .... 0.001
    r <- [-3,-1]
    1-β = 10^r
*** Hyperparameter: Tuning in practice Panda vs caviar
- Babysitting one model (panda) for few computer resources
- Training many models in parallel (caviar) for lot of computer resources
*** Batch normalization
**** In a network
**** Fitting Batch norm into a deep network
**** Why Batch Normalizing?
- don't use batch norm as a regularization even if sometime it could have this
  effect
**** Batch Norm at test time
μ = 1/m \sum z^(i)

σ^2 = 1/m \sum (z^(i) - μ)^2

z^(i)_norm = z^(i) - μ / sqrt( σ^2 + ε )

~z^(i) = γz^(i)_norm + β

Estimate μ and σ with exponentially weighted avg accross minibatches
*** Multi-class classification
**** Softmax Regression
notation: C = #classes (0,1,2...,C-1)

last hidden layer nb of neuron is equal to C: n^L = C

z^[L] = w[L]a^[L-1] + b[L] (C,1)

Activation function:

t = e^(Z[L])
a^[L] = e^(Z[L])/\sum_i=0^C t_i

a^[L]_i = t_i / \sum_i=0^C t_i

**** Training a softmax classifier
*** Introduction to programming frameworks

**** Deep learning frameworks
* Structuring your Machine Learning project
** Week 1
*** Introduction to ML Strategy
**** Why ML Strategy
Try to find quick and effective way to choose a strategy

Ways of analyzing ML problems

**** Orthogonalization

***** Chain of assumptions in ML

 - Fit training set well on cost function => bigger network, Adam, ...
 - Fit dev set well on cost function => Regularization, Bigger training set
 - Fit test set well on cost function => Bigger dev set
 - Perform well in real world => Change the devset or cost function

 Try not to use early stoping as it simulanously affect cost on training and dev set.

*** Setting up your goal

**** Single number evaluation metric

***** First

| Classifier | Precision | Recall |
|------------+-----------+--------|
| A          |       95% |    90% |
| B          |       98% |    85% |

Rather than using two number, find a new evaluation metric


| Classifier | Precision | Recall | F1 Score |
|------------+-----------+--------+----------|
| A          |       95% |    90% |    92.4% |
| B          |       98% |    85% |     91.0 |

F1 score = 2 / (1/p) + (1/R) :: "Harmonic mean" of precision and recall.

So:

Having a good Dev set + single evaluation metric, really speed up iterating.

***** Another example

| Algorithm | US  | China | India | Other | *Average* |
|-----------+-----+-------+-------+-------+-----------|
| A         | 3%  | 7%    |    5% |    9% |           |
| ...       |     |       |       |       |           |
| F         | ... | ...   |       |       |           |

Try to improve the average.

**** Satisficing and Optimizing metric

It's not alway easy to select on metric to optimize.

***** Another cat classification example

| Classifier | Accuracy | Running Time |
|------------+----------+--------------|
| A          |      90% | 80ms         |
| B          |      92% | 95ms         |
| C          |      95% | 1500ms       |

cost = accuracy - 0.5x running time

maximize accuracy s.t. running time < 100ms

Accuracy <- Optimizing
Running time <- Satisficing

If you have n metrics, pick one to optimizing, and all the other be satisficing.

**** Train/dev/test distribution

How you can setup these dataset to speed up your work.

***** Cat classification dev/test sets

Try to find a way that dev and test set come from the same distribution.

***** True story (detail changed)

Optimizing on dev set on load approvals for medium income zip codes.

(repay loan?)

Tested on low income zip codes.

Lost 3 months

*****  Guideline

Choose a dev set and test set to reflect data you expect to get in the future
and consider important to do well on.

**** Size of dev and test sets

*****  Old way of splitting
70% train, 30% test
60% train, 20% dev, 20% test

For at max 10^4 examples

But in new era, 10^6 examples:

train: 98%, Dev 1%, Test 1%.

***** Size of test set

Set your test set to be big enough to give high confidence in the overall
performance of your system. Can be far less than 30% of your data.

For some applications, you don't need test set and only dev set.
For example if you have a very large dev set.

**** When to change dev/test sets and metrics?

Metric: classification error
Algorithm A: 3% error → letting throught a lot of porn images
Algorithm B: 5% error → doesn't let pass porn images

So your metric + evaluation prefer A, but you and your users prefer B.

When this happens, mispredict your algorithm B is better.

Error: 1/m_dev \sum_i=1^m I{y_pred^(i) /= y^(i)

They treat pron and non pron equaly but you don't want that.

We add a w(i) = 1 if non porn and 0 if porn in the formula

**** Orthogonalization for cat pictures: anti-pron

1. So far we've only discussed how to define a metric to evaluate classifier
2. Worry separately about how to do well on this metric

1. placing the target, and 2. is aiming the target.

**** Another example
Alg A: 3% err
Alg B: 5% err

But B does better. You see that users are using blurier images.
You dev/test are not using the same kind of images.

Change your metric and/or dev/test set.

*** Comparing to Humand-level performance

**** Why human-level performance

 Human-level perf vs Bayes optimal error

 Human are generally very close to bayes perf for lot of tasks.

 - get lableld data from humans
 - gain insight from manual error analysis (why did a person get this right?)
 - better analysis of bias/variance

**** Avoidable bias

***** Cat classification example

 | Humans         |            1% |              7.5% |
 | Training error |            8% |                8% |
 | Dev error      |           10% |               10% |
 |                | focus on bias | focus on variance |

 Human level error as a proxy (estimate) for Bayes error.

 *Diff between Human err and Training err = available bias*
 *Diff between Train and Dev err = variance*

**** Understanding Human-level performance

***** Human-level error as proxy for Bayes error
 Medical image classification example:
 suppose
 (a) Typical human 3% err
 (b) Typical doctor 1% err
 (c) Experienced doctor 0.7% err
 (d) and team of experienced doctors 0.5% err

 What is "human-level" error?

 Bayes error is <= to 0.5% err
 So we use that to aim as saw before.

 For a paper, (b) is good enough to talk about that.

***** Error analysis example

 | Human (proxy for bayes err) | 1, 0.7, 0.5% | 1, 0.7, 0.5 | 1, 0.7, 0.5 |
 | Train err                   |           5% |          1% |        0.7% |
 | Dev err                     |           6% |          5% |        0.8% |
 |                             |              |             |             |

 Case 1:
 For this example it doesn't matter because avoidable bias (5 - 1%), is bigger
 than variance (6-5)

 Case 2: focus on variance

 Case 3, very important you use 0.5 as your "human-level" error. Because it show
 that you should focus on bias and not on variance.

 This problem arose only when you're doing very good.

***** Summary of bias/variance with human-level perf

 Human-level error (proxy for Bayes err)

     ^
     | "Avoidable bias"
     v

 Training error

     ^
     | "Variance"
     v

 Dev error

**** Surpassing human-level performance

***** Surpassing human-level performance

 | Team            |  0.5% |       0.5% |
 | One human       |    1% |         1% |
 | Training error  |  0.6% |       0.3% |
 | Dev error       |  0.8% |       0.4% |
 |-----------------+-------+------------|
 | Avoidable bias? | ~0.5% | can't know |

***** Problems where ML significantly surpasses human-level performance

 - Online advertising
 - Product recommendations
 - Logistics (predicting transit time)
 - Loan approvals

 all thoses examples:
 + come from structured data
 + not natural perception problems
 + Lots of data


 Also, Speech recognition, Some image recognition, Medical, ECG, skin cancer,
 etc...

**** Improving your model performance

 Set of guidelines

***** The two fundamental assumptions of supervised learning

 1. You can fit the training set pretty well (~ avoidable bias)
 2. The training set performance generalizes pretty well to the dev/test set

***** Reducing (avoidable) bias and variance

 Human-level error (proxy for Bayes err)

     ^
     |                       train bigger model
     | "Avoidable bias" =>   train longer/better optimization algorithms (momentum, RMSprop, Adam)
     |                       NN architecture/hyperparameters search (RSS, CNN...)
     v

 Training error

     ^
     |                 More data
     | "variance" =>   Regulraization (L2, dropout, data augmentation)
     |                 NN architecture/hyperparameters search
     v

 Dev error

 These concepts are easy to learn, hard to master.
 You'll be more systematics than most ML teams.

** Week 2
*** Error Analysis
**** Error Analysis
***** Carrying out error analysis
 - Imagine your cat algo doesn't work as good as expected.
 - One of your colaborator think you should focus on working on dogs.
 - Anaylize manually 100 mislabeled dev set examples
 - Count up how many are dogs
 - Supose 5% are dogs. So at most you could go from 10% err to 9.5% so not much useful.

 - Supose taht 50% of them are dogs error, so you could go down from 10% to 5%,
   so you could be more confident.
***** Evaluate multiple idea in parallel
 - fix pictures of dogs
 - fix great cats (lion, panthers, ...)
 - improve performance of blurry images


 Create spreadsheet:

 |      Image |  Dog |  Great cats |  Bluring |
 |          1 | ok   |             |          |
 |          2 |      |             | ok       |
 |          3 |      | ok          | ok       |
 |        ... |      |             |          |
 | % of total | 8%   | 43%         | 61%      |

 You sometime notice other dimensions like instagram filters...

 Could easily know where you should improve.
**** Cleaning up incorrectly labeled dataset
***** Incorrectly labeled examples
 If you have incorrectly labeled data.
 First lets consider the training set.

 So long as you don't have too much errors, DL is quite robust to random errors.

 But this is a problem for systematic errors.
***** Error analysis


 |      Image |  Dog |  Great cats |  Bluring | Comments                         |
 |        ... |      |             |          |                                  |
 |         98 | ok   |             |          | labeler missed cat in background |
 |         99 |      |             | ok       |                                  |
 |        100 |      | ok          | ok       | drawing of a cat not a real cat  |
 | % of total | 8%   | 43%         | 61%      |                                  |

 1st case:

 Overall dev set error: 10%
 Error due incorrect labels: 0.6%
 Errors due to other causes: 9.4%

 2nd case:

 Overall dev set error: 2%
 Error due incorrect labels: 0.6%
 Errors due to other causes: 1.4%

 In 2nd case, take the time to fix mislabeled examples.
***** Correctin incorrect dev/test set examples

 - Apply same process to your dev and test sets to make sure they continue to
   come from the same distribution.
 - Consider examining examples your algorithm gor right as well as ones it got
   wrong.
 - Train and dev/test data may now crom from slightly different distributions
**** Buid your first system quickly then iterate
***** Speech recognition example
 - noisy background
   - café noise
   - car noise
 - Accented speech
 - Far from microphone
 - young children's speech
 - stuttering, uh, ah, um...

 50 directions you could go, on which should you focus on?

 1. Set up dev/test set and metric
 2. Build initial system quickly
 3. Use Bias/Variance analysis & Error Analysis to prioritize next steps

 Guideline: *Build your first system quickly then iterate*

 Do not otherthink, build something quick and dirty first.
*** Mismatched training and dev/test set
**** Training and testing on different distributions
***** Cat app example
Two sources of data:
- data from webpages
- data from mobile app

Let's say you don't have lot of users (~10k from mobile, 200k from web)

You care about doing well on mobile images. You don't want to use only the 10k,
but the dilema is the 200k aren't from the same distribution.

Option 1: take the 210k images and split between train/dev/test (train 205k, 2.5k, 2.5k)
  - avantage, same distribution
  - disavantage, perform on web instead of web.
  - only 119 other the 2.5k will be from mobile.
Option 1 not recommended

Option 2:
 - train set have 200k images from the web and 5k from the mobile.
 - dev and test all mobile app images.
 - avantage you know aiming your target where you want it to be.
 - disavantage, your training distribution is different
 But other the long term it will get you better performance
***** Speech recognition example
- Speech artificial rearview mirror. (real product in China)
1. Training: take all the speech data you have; purshased data, smart speaker control, voice keyboard... (500k)
2. Dev/test: speech activated, rearview mirror (20k)

Set your training set to be 500k from 1. and Dev/Test from 2.

The training set could be 510k (500k from 1 and 10k from 2.) and Dev/Test set (5k+5k from the rest of 2.)

Much bigger training set.
****  Bias and Variance with mismatched data distribution
***** Cat classifier example
Assume humans get ~0% error.

| Training error |  1% |
| Dev error      | 10% |

Maybe there isn't a variance pb as the distribution is different.

Training-dev set: Same distrib as training set but not used for training.

Train / dev / test ==> Train split in train-2 and train-dev

So now you learn only on train-2 and check on train-dev and dev and test.

| Train err%     |     1% |               1% |
| Train-dev err% |     9% |             1.5% |
| dev err%       |    10% |              10% |
|                | Var pb | data mismatch pb |

Other examples:

| Human err%     |      0% |                      0% |
| Train err%     |     10% |                     10% |
| Train-dev err% |     11% |                     11% |
| dev err%       |     12% |                     20% |
|                | Bias pb | Bias + data mismatch pb |
***** Bias/variance on mismatched trainig and dev/test sets

| Human level            |  4% |
                                 avoidable bias
| Training set error     |  7% |
                                 variance
| Training-dev set error | 10% |
                                 data mismatch
| Dev error              | 12% |
                                 degree of overfitting to dev set
| Test error             | 12% |

Example, training is much harder than dev/test set distribution:

| Human level            |  4% |
| Training set error     |  7% |
| Training-dev set error | 10% |
| Dev error              |  6% |
| Test error             |  6% |

*****  More general formulation

The numbers can be place onto a table:

|                    | General Speech rec tasks | Rearview mirror speech data |
|--------------------+--------------------------+-----------------------------|
| Human lvl          | "Human level err" (4%)   |                          6% |
| err on trained on  | "Training err" (7%)      |                          6% |
| err not trained on | "Training-dev err" (10%) |         "Dev/Test err" (6%) |

**** Addressing data mismatch

There are not any systematic way to address that.
But there are things you can try.

***** Addressing data mismatch
- Carry out manual error analysis to try to understand difference between
  training and dev/test sets.
  ex: you might find that a lot of dev set is noisy (car noise)
- Make training data more similar, or collect more data similar to dev/test sets.
  ex: simulate noisy in-car data.

***** Artificial data synthesis

- Clean + car noise = synthetized in-car audio

Create more data, and can be a reasonable process.

Let's say you have 10k hrs of sound and only 1hr of car noise.

There is a risk your algorithm will overfit your 1hr car noise.

***** Artificial data synthesis (2)
Car recognition

Using car generated by computer vs just photos. You might overfit generated
cars. A video game might have only 20 cars, so overfit these 20 cars.

*** Learning from multiple tasks

**** Transfer learning

Learning recognize cats to help to read x-ray scans.

***** Transfer learning

Create new NN by changing just the last layer (the output).

(X,Y) now become (radiology images, diagnosis)

retrain the W^[Z], b^[Z].

You might want to train just the last layer, you all the layers.

The rule of thumb, just the last layer on few data.
The rule of thumb, all the layer on lot of datas.

pre-training, and fine-tuning.

A lot of low-level features learning from a very large data set might help.

- Another example. Speech recognition system:

X (audio) y (speech recognition) (wakeword, trigger word (ok google, hey siri, etc...))

You could add several new layers, and retrain the new layers or even more layers.


It make sense to transfer make sense when you have a very different number of examples.

- 10^6 image recognition, but only 100 radiology data.
- 10k hrs sounds, but only 1h data for wake words...

Transfering from lot of data to small number of data.

It doesn't make sense to transfer the other way.

***** When transfer learning makes sense

Task from A to B

- Task A and B have the same input X
- You have a lot more data for Task A than Task B
- Low level features from A could be helpful for learning B

**** Multi-task learning

Simultaneously learn multiple tasks.

***** Simplified autonomous driving example


|                | y^(i) | (4,1)
|----------------+-------|
| pedestrians    |   0   |
| cars           |   1   |
| stop signs     |   1   |
| traffic lights |   0   |

Y = [ y^(1) y^(2)  .... y^(m) ]

***** Neural network architecture

x -> [] -> [] .... -> ^y in R^4

Loss: y(i) -> 1/m \sum_i=1^m \sum_j=1^4 (L(y^(i)_j , y^(i)_j))

L is the usual loss function.

Unlike softmax regression, one image can have multiple labels.

- One NN doing 4 things is better than learning 4 different NN for each task.

Some examples might not be fully labelled.
And you can train by summing only other 0/1 label and not on ? mark (un labeled values).

So you can use more informations.

***** When multi-task learning makes sens.

- Training on set of tasks taht could benefit from having shared lower-level features
- Usually: amount of data you have for each task is quite similar
- Can train a big enough neural network to do well on all the tasks

Multi-task learning used a lot more than transfer learning.

*** End-to-end deep learning
**** What is end-to-end deep learning?
***** What is end-to-end deep learning?
Speech recognition example


audio - MFCC -> features -- ML --> phonemes -> words -> transcript


audio ------------------------------------------------> transcript

You might need a lot of data.
3k hrs of data, classical approach better.
10k to 100k hurs then end-to-end approach generally shines.
***** Face recognition

Multi state approach works better:

1. detect face, zoom-in and crop to center the face
2. then feed this croped image to find identity. Generally comparing to all
   employes.

Why?

- Have a lot of data for task 1
- Have a lot of data for task 2

If you were to try to learn everything at the same time you wouldn't have enough
data.
***** More examples
Machine translation:

English -> text analysis -> ... -> French
English -------------------------> French

Because we have lot of (x,y) examples.

Estimating child's age from scan of the hand:

Image -> bones -> age
Image ----------> age (there is not enough data)
**** Whether to use end-to-end deep learning
***** Pros and cons of end-to-end learning
Pros:
- let the data speak (no human preconception)
- Less hand-designing of components needed

Cons:
- May need a large amount of data: input end ----> output end (x,y)
- Excludes potentially useful hand-designed components. Data, Hand-design
***** Applying end-to-end deep learning
Key question: do you have sufficient data to learn a function of the complexity
needed to map x to y?

- choose X->Y mapping
- pure deep learning approch not appropriate if hard to find end-to-end exmaples.
* Convolutional Neural Networks
** Week 1
*** Computer Vision
size: 64x64x3 -> 12288
size: 1000x1000x3 -> 3 millions

[x_1 ... x_{3millions}] and w^[1] -> [1000 x 3e6]
*** Edge Detection Example

Take the 6x6 image and "convolve it by the 3x3 matrix" filter: [[1,1,1],[0,0,0],[-1,-1,-1]]

python: conv_forward
tensorflow. tf.nn.conv2d
keras: conv2D
*** More edge detection

Sobel filter: 1,2,1 , 0,0,0, -1,-2,-1
Scharr filter: 3,10,3 , 0,0,0, -3,-10,-3
*** Padding

Size of image shrink because of borders.
If filter as size f and image size n -> final image after filter: n - f +1
**** first solution but a border around the image: Padding

"valid" : nxn * fxf -> n-f+1 x n-f+1
"same": Pad so the output size is the same as the input size
    n + 2p -f + 1 => p = (f-1)/2

    3x3 -> p = 3-1/2 = 1
    5x5 -> p = 5-1/2 = 2
- f is usually odd, easier for padding + the filter has a central position.
- 1x1, 3x3, 5x5, 7x7.
*** Strided Convolutions

Jump some columns/lines. Instead of sliding the filter on every columns/row, do it every n columns/ n rows.

nxn * fxf, padding: p, stride: s

floor ((n + 2p -f / s) + 1) x floor ((n + 2p -f / s) + 1)
**** Summary of convolutions

nxn image
fxf filter

padding p
stride s

output size:
\[ \floor ((n + 2p -f / s) + 1) x \floor ((n + 2p -f / s) + 1) \]
**** Convolution in math textbook (flip vertical and horizontal the filter)

cross-correlation vs convolution

By convention we call cross-correlation, convolution operator.

The convolution op is cross-associative:
 (A * B) * C = A * (B * C)
*** Convolution over volumes

6x6x3 * 3x3x3                -> 4x4
height x width x #channels

By convention the nb of channels will be same in the image and in the filter.
**** Multiple filters

6x6x3 * 3x3x3 ---\ 4x4
      * 3x3x3 ---/ 4x4 ====> 4x4x2, 2, n_c = #filters
*** One layer of convolutional neural network
10 filters that are 3x3x3 in one layer of NN, how many parameters?

3x3x3 = 27 + bias = 28 params
28 x 10 = 280 params
**** Summary of notation

If layer l is a convolutional layer:
f^[l] = filter size
p^[l] = padding
s^[l] = stride
nc^[l] = number of filters

each filter is f^[l] x f^[l] x n_c^[l-1]
Activations: a^[l] -> n_H^[l] x n_W[l] x n_c^[l]
             A^[l] -> m x n_H^[l] x n_w^[l] x n_c^[l]
Weights: f^[l] x f^[l] x n_c^[l-1] x n_c^[l] (n_c^[l]: #filters in layer l)
Bias: n_c^[l] - (1,1,1,n_c^[l])

Input: n_H^[l-1] x n_W^[l-1] x n_c^[l-1]
Output: n_H^[l]  x n_W^[l]   x n_c^[l]

n^[l] = floor ( n^[l-1] + 2p^[l] - f^[l] / s^[l]) +1
*** A simple convolution neural network example

|_|/                        ---------------------------->
39x39x3                       f[1]=3, s^[1]=1, p^[1]= 0
n_H^[0] = n_W^[0] = 39        10 filters
n_c^[0]=3


 |_|/         ---------------------------->
 37x37x10     f[2]=5, s^[1]=2, p^[1]=0
               20 filters

 |_|/         ---------------------------->
 17x17x20     f[3]=5, s^[1]=2, p^[1]=0
               40 filters

 |_|/         ---- 1960 params --> softmax y^hat
 7x7x40

**** Type of layer in a CNN

- Convolution (CONV)
- Pooling (POOL)
- Fully connected (FC)

*** Pooling Layer

**** Max pooling

4x4 --- max over 2x2 region --> 2x2

Hyperparameters: f=2, s=2
No parameters to learn!

In practice it works well.

**** Example

5x5 with f=3 s=1

1 3 2 1 3
2 9 1 1 5       9 9 5
1 3 2 3 2 ====> 9 9 5
8 3 5 1 0       8 6 9
5 6 1 2 9


Overs #channels, the output has the same number of channels.
Max pooling over each channel independently

**** Average Pooling

Same as previous, but we take the average instead of the max

**** Summarize
Hyperparameters
- f: filter size
- s: stride
- Max or Average pooling
- p: padding (almost never used, p=0 in general)


nH x nW x nc ----> ((nH - f / s) + 1) x ((nW - f / s) + 1) + n_c

*** CNN Example

- Input Image: 32x32x3 (try to recognize a 7 letter)
 --- conv f=5, s=1 ---->
- Conv1: 28x28x6 ---- max pool, f=2, s=2 ----> POOL1 14x14x6 } LAYER 1
 --- conv f=5 s=1 -->
- Conv2: 10x10x16 ---- max pool, f=2, s=2 ----> POOL2 5x5x16 } LAYER 2

400 ----> FC3 120 ---> FC4 84 ----> 0 softmax (10 outputs)
W^[3] (120,400)
b^[3] (120)

In general:

- n_H, n_W will decrease as we go deeper
- n_c will increase as we go deeper
- CONV - POOL - CONV - POOL - FC -FC -FC -SOFTMAX

**** Sizes
Activation size go down, # parameters, few for Conv and 0 for POOL, a lot in FC

*** Why convolutions?

Two advantages:
- parameter sharing and sparsity of connections.


If we had to make a FC between a 32x32x3 --> 28x28x6, would need millions of parameters.
Conv as only 156 parameters.

- *Parameter sharing*: A feature detector that's useful in one part of the image
  is probably useful in another part of the image.

- *Sparsity of connections*: In each layer, each output value depends only on a
  small number of inputs.
** Week 2
*** Case studies
**** Why look at case studies?
***** Outline
Classic networks:
- LeNet-5
- AlexNet
- VGG

ResNet (residual network), 152 deep network

Inception
**** Classic Networks
***** LeNet - 5
Recognize handwritten digits,
 32x32x1  --- conv 5x5 s = 1    --->
 28x28x6  --- avg pool, f=2, s=2 -->
 14x14x6  --- conv 5x5,s=1       -->
 10x10x16 --- avg pool, f=2 s=2 --->
  5x5x16 (400) --> FC 120 --> FC 84 --> Softmax y^

Size was 60k parameters. Today hunder millions parameters

n_H, n_W decrease, n_C increase

Conv, pool, Conv, pool, fc, fc, output

LeCun et al. 1998, Gradient-based learning applied to document recognition.
***** AlexNet
Alex Krizhevsky et al. 2012, ImageNet classification with deep convolutional neural networks.

227x227x3 --- conv 11x11, s=4 --->
55x55x96  --- MAX Pool, 3x3, s=2 ---->
27x27x96 --- 5x5 same ------>
27x27x256 --- MAX POOL, 3x3, s=2 ---->
13x13x256 --- 3x3, same ----->
13x13x384 --- 3x3 ---> 13x13x384 ---3x3 ---> 13x13x256 --- MAX POOL, 3x3, s=2 --->
6x6x256 --- FC 9216 --> FC 4096 --> FC 4096 --> Softmax

Similar to previous but MUCH bigger:

60 millions parameters

Also use ReLU
***** VGG - 16

CONV = 3x3 filter, s=1, same
MAX-POOL = 2x2, s=2

224x224x3 --- [CONV 64]x2 ---> 225x224x64 --- POOL --->
112x112x64 --- [CONV 128]x2 ---> 112x112x128 --- POOL --->
56x56x128 --- [CONV 256]x3 ---> 56x56x256 --- POOL --->
28x28x256 --- [CONV 512]x3 ---> 28x28x512 --- POOL --->
14x14x512 --- [CONV 512]x3 ---> 14x14x512 --- POOL --->
7x7x512 ---> FC 4096 --> FC 4096 ---> Sofmax 1000

Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition.

about 138 millions parameters

Also VGG-19 is also another even bigger network, but VGG-16 perform as good as VGG-19
**** Residual Networks (ResNets)
Very very deep neural, over 100 layers.
***** Residual block
a^[l] ---> a^[l+1] --> a^[l+2]

a^[l] --+--> linear --> ReLU --> a^[l+1] -----> linear -----> ReLU --> a^[l+2]
        |                                                 |
        +-------------------------------------------------+
                          shortcut (skip connection)


a^[l+2] = g (z^[l+2] + a^[l])

passes information deeper in the NN.

He et al, 2015, Deep residual networks for image recognition

x -> [] -> [] -> [] .... -> a^[l]

Help in vanishing gradient descent parameters.
**** Why ResNets Work
***** Why do residual networks work so well?
In practice, learning on too deep network make bad results on the training set.
So that prevent using network that are too deep.

But it much less true when learning ResNets.

X ---> Big NN ---> a^[l]

X ----> Big NN ---> a^[l] -+-> [] --> [] -+->a ^[l+2]
         ReLU a >= 0       |              |
                           +--------------+

a^[l+2] = g (z^[l+2] + a^[l])
        = g ( W^[l+2]a^[l+1] + b^[l+2] + a^[l] )

If we're using L2 regularization that would shrink the value of W^[l+2], also b^[l+2]
If W^[l+2] =0 and also b^[l+2], then we'll have g(a^[l] which is equal to a^[l]

So the identity function is easy to learn for ReLU.
So adding the shortcut don't hurt performance.
But if its goes good, it's better but almost never worse.

Remark: We're assuming z^[l+2] and a^[l] is of the same dimension.
In case they have different dimensions, add an extra matrix W_s so we have:

(W_s x a^[l])
***** ResNet example
Plain ----> ResNet
**** Networks in Networks and 1x1 Convolutions
Using a 1x1 convolution.
***** What does a 1x1 convolution do?

6x6x1 * 2 ---> simple multiplication

6x6x32 x 1x1x32 ---> 6x6x#filters

- Lin et al. 2013, Network in Network.
***** Using 1x1 conv

28x28x192 ---- ReLU, CONV 1x1, 32 filters ---> 28x28x32

Let shrink n_C as well

Effect non-linearity, we could keep the nb of layers and its fine.

*1x1 conv does a non trivial operation.*
**** Inception Network Motivation
We might have to pick, conv 3x3, pool, layer, etc...
***** Motivation for inception network.
Do all transformations at the same time:

28x28x192 --- 1x1                 ----> 28x28x64
       \\\--- 3x3,same            ----> 28x28x128
        \\--- 5x5,same            ----> 28x28x32
         \--- max pool, same, s=1 ----> 28x28x32
------------------------------------------------------
                                        28x28x256

Szegedy et al. 2014, Going deeper with convolutions.

pb computational cost.
***** The problem of computational cost
28x28x192 --- conv 5x5, same, 32 ----> 28x28x32

32 filters, filters are 5x5x192

Nb of multiplications: 28x28x32 x 5x5x192 = 120 millions (costly)
***** Using 1x1 conv
28x28x192 ------- conv 1x1, 16, 1x1x192 --->
28x28x16 -------- conv 5x5, 32, 5x5x16 ---> 28x28x32
Bottleneck Layer

cost of 1st conv layer: 28x28x16 x 192 = 2.4 millions
cost of 2nd conv layer: 28x28x32 x 5x5x16 = 10 millions
total cost: 12.4 millions (about 10x less than before)

You can reduce substantially the size without hurting performace of the NN while
improving performances.
**** Inception Network
***** Inception module
Previous Activation: 28x28x192

      1x1 conv --------------------------------------------> 28x28x64   -\
      1x1 conv ----------------> 3x3 conv  ----------------> 28x28x128  -- Channel concat:
      1x1 conv ----------------> 5x5 conv  ----------------> 28x28x32   //  28x28x256
      MAXPOOL,3x3,s=1, same  --> 28x28x192 --> 1x1 CONV ---> 28x28x32   /

Inception network, si the same pattern (block) connected in many layers.

Inception-block --> Inception block ---> ..... ---> Softmax layer output
                 \-> softmax layer   \-> softmax layer

GoogleNet
***** Fun fact
We need to go deeper (from the Inception meme)

Since the dev, there are newer versions of inception modules.
Inception v1, v2, v3 ....
*** Practical advice for using ConvNet
**** Using Open-Source Implementation
Lot of these networks are difficult to reproduce, replicate the work.

Search look online implementation, instead of re-implementing from scratch.

Demo:
1. google
2. github repository
3. check license (MIT for example)
4. git clone ...


Also advantage, pre-trained networks.
Starting from open-source implementation is faster.
**** Transfer Learning
There are lot of data set on the Internet.
You could often download have an initial pretrained NN.
***** Transfer Learning
Classification Problem: is it Tigger, Misty or Neither (recognize cats)

Trained network trained on ImageNet.

Get rid of the soft-max layer and create your own, and output, Tigger/Misty or Neither.
And freeze the other parameters.
Just learn the Softmax layer.

You might get pretty good results.

Depending on the framework:
- TrainableParameters = 0
- freeze = 1

Pre-compute and save the last layer using for activation.
You don't need to recompute these activations.

*If you have a larger dataset*:

Freeze, fewer layers (the firsts)
Or freese a few layers and create a new layers with your own architecture.

*If you have a lot of data*:

Use the hole thing and use it as initialization.

You should almost always do transer learning unless you have an exceptionally
large dataset to train.
**** Data Augmentation
For the majority of Computer vision problem, having more data is almost always
useful and help.
***** Common augmentation
- mirroring images
- Random cropping, not perfect, but work well in practice
- Also: Rotation, shearing, local warping... but not much used in practice
***** Color shifting
Add to RGB different distortions.
Ex: +20,-20,+20 ---> more mauve
    etc...

Advanced: PCA color augmentation (in AlexNet paper)
***** Implementing distortions during training
Harddisk ---> CPU threadd ---> distortions ----> training (CPU/GPU)
          \-- load          -> color
                                   minibatch

Meta parameters, so certainly use open-source implementation to data
augmentation.
**** State of Computer Vision
***** Data vs hand-engineered
Most ML problem.

Little data <--------------------------------------------> Lots of data

speach recognition: lot of data
image recognition: OK data
Object detection: less data

Lot of data: simpler algorithms, less hand-engineering.
Few data: more hand-engineering, hacks

Two sources of knowledge:
- Labeled dataset (x,y)
- Hand engineering features/network architecture/other components


When very few data: Transfer learning.
***** Tips for doing well on benchmarks/winning competitions
- Ensembling
  - Train several networks independently and average their outputs (not their weights)
  - 3/15 networks (but almost never used in production, because it is costly for few benefits)
- Multi-crop at test time
  - run classifier on multiple versions of test images and average results
  - 10-crop: central + 4 corner + same on mirrored

Do not do this in production systems.
***** Use open source code
- Use architectures of networks published in the literature
- Use open source implementations if possible
- Use pretrained models and fine-tune on your dataset