deft/Deep-Learning.org

-- #+TITLE: Deep Learning Coursera
-- #+AUTHOR: Yann Esposito
#+STARTUP: latexpreview
#+TODO: TODO IN-PROGRESS WAITING | DONE CANCELED
#+COLUMNS: %TODO %3PRIORITY %40ITEM(Task) %17EFFORT(Estimated Effort){:} %CLOCKSUM %8TAGS(TAG)

* Plan

5 courses

** Neural Network and Deep Learning
*** Week 1: Introduction
*** Week 2: Basic of Neural Network programming
*** Week 3: One hidden layer Neural Networks
*** Week 4: Deep Neural Network
** Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
** Structuring your Machine Learning project
** Convolutional Neural Networks
** Natural Language Processing: Building sequence models
* DONE Neural Network and Deep Learning
  CLOSED: [2017-08-22 Tue 13:43]
** Introduction

*** What is a neural network?

*** Supervised Learning with Neural Networks

 - Lucrative application: ads, showing the add you're most likely to click on
 - Photo tagging
 - Speech recognition
 - Machine translation
 - Autonomous driving


***** Convolutional NN good for images

***** Strutured data (db of data) vs Unstructured data

 - Structured data: Tables
 - Unstructured data: Audio, image, text...

 Computer are much better at interpreting unstructured data.

*** Why is Deep Learning taking off?

 [[///Users/yaesposi/Library/Mobile%20Documents/com~apple~CloudDocs/deft/img/Scale%20drives%20deep%20learning%20progress.png]]

 - Data (lot of data)
 - Computation (faster learning loop)
 - Algorithms (ex, use ReLU instead of sigma)
** Geoffrey Hinton interview
** Binary Classification

\[ (x,y) x\in \mathbb{R}^{n_x}, y \in {0,1} \]

$m$ training examples: $$ {(x^{(1)},y^{(1)}), ... (x^{(m)},y^{(m)})} $$

$$ m = m_{train} , m_{test}  = #test examples $$

$$ X = [ X^{(1)} ... X^{(m)} ] is an n_x x m matrix $$
$$ X.shape (n_x,m) $$

$$ Y = [ y^{(1)} ... y^{(m)} ] $$
$$ Y.shape = (1,m) $$

** Logistic Regression

Given $X \in \mathbb{R}^{n_x}$ you want $\hat{y} = P(y=1 | X)$

Paramters: $w \in \mathbb{R}^{n_x}, b\in \mathbb{R}$

Output: $\hat{y} = \sigma(w^Tx + b) = \sigma(z)$

$$\sigma(z)= \frac{1}{1 + e^{-z}}$$

If $z \rightarrow \infty => \sigma(z) \approx 1$
If $z \rightarrow - \infty => \sigma(z) \approx 0$


Alternative notation not used in this course:

$X_0=1, x\in\mathbb{R}^{n_x+1}$
$\hat{y} = \sigma(\Theta^Tx)$
...

** Logistic Regression Cost Function

Search a convex loss function:

$L(\hat{y},y) = - (y\log(\hat{y}) + (1-y)\log(1-\hat{y}))$

If y = 1 : $L(\hat{y},y) = -\log\hat{y}$ <- want log\haty larg, want \hat{y} large
If y = 0 : $L(\hat{y},y) = -\log\hat{y}$ <- want log (1-\hat{y}) large, want \hat{y} sall

Cost function: $$ J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y^\{(i)}},y^{(i)}) = ... $$

** Gradient Descent

Minize $J(w,b)$

1. initialize w,b (generaly uses zero)
2. Take a step in the steepest descent direction
3. repeat 2 until reaching global optimum

Repeat {
  $w := w - \alpha\frac{dJ(w)}{dw} = w - \alpha\mathtext{dw}$
}

** Derivatives
** More Derivative Examples
** Computaion Graph
** Computing Derivatives
** Computing Derivatives for multiple examples
** Vectorization
getting rid of explicit for loops in your code
** Vectorizing Logistic Regression
** Vectorizing Logistic Regression's Gradient Computation
** Broadcasting in Python
** Quick Tour of Jupyter / ipython notebooks
** Neural Network Basics
J = a*b + a*c - (b+c) = a (b + c) - (b + c) = (a - 1) (b + c)
* DONE Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
  CLOSED: [2017-09-01 Fri 09:52]
** DONE Week 1: Setting up your Machine
   CLOSED: [2017-08-22 Tue 13:43]
*** Recipe

If *High bias*? (bad training set performance?)
   Then try:
     - Bigger network
     - Training longer
     - (NN architecture search)
   Else if *High variance*? (bad dev set performance?)
     Then try:
     - More data
     - Regularization
     - (NN architecture search)

Deep learning, not much bias/variance tradeoff if we have a big amount of
computer power (bigger network) and lot of data.
*** Regularization
**** Regularization: reduce variance
 - L2 regularization

 λ / 2m || w ||_2 ^2

 - L1 regularization: same with |w| instead of ||w||_2^2


 λ is a regularization parameter (in code named =lambd=)

 Cost = J(w^[1], b^[1], ..., w^[L], b^[L]) = 1/m \sum L(^y(i), y(i)) + λ/2m \sum_l=1^L || W^[l] ||^2


 call the "Frobenius norm"

 dW = from backprop + λ/m W^l

 update W^l = W^l - αdW^l still works

 Sometime L2 regularization called "weight decay".

**** Dropout Regularization

 Eliminates nodes by layer randomly for each training example.

 - implementing, (inverted dropout)
   - gen random boolean vector:
      d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob # (for each iteration)
      a3 = np.mulitply(a3,d3)
      a3 /= keep_prob (for normalization to be certain the a3 output still the same, reduce testing problems)

 Making prediction at test time: no drop out

**** Over regularization methods

 - Data augmentation, (flipping images for example, random crops, random distortions, etc...)
 - Early stopping, stop earlier iteration

*** Setting up your optimization problem

**** Normalizing Inputs

 - μ = 1/m Sum X^(i)
 - x := x - μ (centralize)
 - σ = 1/m Sum X^(i)^2
 - x /= σ^2


**** Gradient Checking
***** Don't use gard check in traingin, only in debug
***** If algorithm fail, grad check, look at component (is db? dW? dW on certain layer, etc...)
***** Remember regularization
***** Doesn't work with dropout, turn off drop out (put 1.0) then check
***** Run at random initialization; perhaps again after training

** DONE Week 2: Optimization Algorithms
   CLOSED: [2017-08-22 Tue 13:43]
*** Mini batch

    X :: X^(1) ... X^(m)

    X,Y -> X^{i},Y^{i} where X^{i} = X^(i*batch-size ---> (i+1)*batch-size)

*** Minibatch size

- if mini batch size = m => Batch gradient descent (X^{1},Y^{1}) = (X,Y)
- if mini match size = 1 => Stochastic gradient descent, every example is its own mini batch.
- in practice in between 1 and m, m --> too  long, 1 loose speedup from vectorization.
  + vectorization ~1000

1. If small training set, use batch gradient descent (m <= 2000)
2. Typical mini-batch size: 64, 128, 256, 512, ... 2^k to fits in CPU/GPU memory

*** Exponentially weighted average

v_t = βv_{t-1} + (1-β)θ_t
** DONE Week 3: Hyperparameter
   CLOSED: [2017-09-01 Fri 09:52]
*** Video 1: use random not a grid to search for hyperparameter best value
*** Video 2: choose appropriate scale to pick hyperparameter
- uniformly random n^[l] (number of neuron for layer l) or L (number of layers)
- alpha: between 0.00001 to 1, then shouldn't use linear but instead use log-scale
    r = -4*np.random.rand() <- r in [-4,0]
    α = 10^r                <- 10^-4 ... 10^0

- β <- 0.9 ... 0.999 (0.9 about avg on 10 values, 0.999 avg about 1000 values)
    1-β = 0.1 .... 0.001
    r <- [-3,-1]
    1-β = 10^r
*** Hyperparameter: Tuning in practice Panda vs caviar
- Babysitting one model (panda) for few computer resources
- Training many models in parallel (caviar) for lot of computer resources
*** Batch normalization
**** In a network
**** Fitting Batch norm into a deep network
**** Why Batch Normalizing?
- don't use batch norm as a regularization even if sometime it could have this
  effect
**** Batch Norm at test time
μ = 1/m \sum z^(i)

σ^2 = 1/m \sum (z^(i) - μ)^2

z^(i)_norm = z^(i) - μ / sqrt( σ^2 + ε )

~z^(i) = γz^(i)_norm + β

Estimate μ and σ with exponentially weighted avg accross minibatches
*** Multi-class classification
**** Softmax Regression
notation: C = #classes (0,1,2...,C-1)

last hidden layer nb of neuron is equal to C: n^L = C

z^[L] = w[L]a^[L-1] + b[L] (C,1)

Activation function:

t = e^(Z[L])
a^[L] = e^(Z[L])/\sum_i=0^C t_i

a^[L]_i = t_i / \sum_i=0^C t_i

**** Training a softmax classifier

*** Introduction to programming frameworks

**** Deep learning frameworks
* Structuring your Machine Learning project
** Week 1
*** Introduction to ML Strategy
**** Why ML Strategy
Try to find quick and effective way to choose a strategy

Ways of analyzing ML problems

**** Orthogonalization

***** Chain of assumptions in ML

 - Fit training set well on cost function => bigger network, Adam, ...
 - Fit dev set well on cost function => Regularization, Bigger training set
 - Fit test set well on cost function => Bigger dev set
 - Perform well in real world => Change the devset or cost function

 Try not to use early stoping as it simulanously affect cost on training and dev set.

*** Setting up your goal

**** Single number evaluation metric

***** First

| Classifier | Precision | Recall |
|------------+-----------+--------|
| A          |       95% |    90% |
| B          |       98% |    85% |

Rather than using two number, find a new evaluation metric


| Classifier | Precision | Recall | F1 Score |
|------------+-----------+--------+----------|
| A          |       95% |    90% |    92.4% |
| B          |       98% |    85% |     91.0 |

F1 score = 2 / (1/p) + (1/R) :: "Harmonic mean" of precision and recall.

So:

Having a good Dev set + single evaluation metric, really speed up iterating.

***** Another example

| Algorithm | US  | China | India | Other | *Average* |
|-----------+-----+-------+-------+-------+-----------|
| A         | 3%  | 7%    |    5% |    9% |           |
| ...       |     |       |       |       |           |
| F         | ... | ...   |       |       |           |

Try to improve the average.

**** Satisficing and Optimizing metric

It's not alway easy to select on metric to optimize.

***** Another cat classification example

| Classifier | Accuracy | Running Time |
|------------+----------+--------------|
| A          |      90% | 80ms         |
| B          |      92% | 95ms         |
| C          |      95% | 1500ms       |

cost = accuracy - 0.5x running time

maximize accuracy s.t. running time < 100ms

Accuracy <- Optimizing
Running time <- Satisficing

If you have n metrics, pick one to optimizing, and all the other be satisficing.

**** Train/dev/test distribution

How you can setup these dataset to speed up your work.

***** Cat classification dev/test sets

Try to find a way that dev and test set come from the same distribution.

***** True story (detail changed)

Optimizing on dev set on load approvals for medium income zip codes.

(repay loan?)

Tested on low income zip codes.

Lost 3 months

*****  Guideline

Choose a dev set and test set to reflect data you expect to get in the future
and consider important to do well on.

**** Size of dev and test sets

*****  Old way of splitting
70% train, 30% test
60% train, 20% dev, 20% test

For at max 10^4 examples

But in new era, 10^6 examples:

train: 98%, Dev 1%, Test 1%.

***** Size of test set

Set your test set to be big enough to give high confidence in the overall
performance of your system. Can be far less than 30% of your data.

For some applications, you don't need test set and only dev set.
For example if you have a very large dev set.

**** When to change dev/test sets and metrics?

Metric: classification error
Algorithm A: 3% error → letting throught a lot of porn images
Algorithm B: 5% error → doesn't let pass porn images

So your metric + evaluation prefer A, but you and your users prefer B.

When this happens, mispredict your algorithm B is better.

Error: 1/m_dev \sum_i=1^m I{y_pred^(i) /= y^(i)

They treat pron and non pron equaly but you don't want that.

We add a w(i) = 1 if non porn and 0 if porn in the formula

**** Orthogonalization for cat pictures: anti-pron

1. So far we've only discussed how to define a metric to evaluate classifier
2. Worry separately about how to do well on this metric

1. placing the target, and 2. is aiming the target.

**** Another example
Alg A: 3% err
Alg B: 5% err

But B does better. You see that users are using blurier images.
You dev/test are not using the same kind of images.

Change your metric and/or dev/test set.

** Comparing to Humand-level performance

*** Why human-level performance

Human-level perf vs Bayes optimal error

Human are generally very close to bayes perf for lot of tasks.

- get lableld data from humans
- gain insight from manual error analysis (why did a person get this right?)
- better analysis of bias/variance

*** Avoidable bias

**** Cat classification example

| Humans         |            1% |              7.5% |
| Training error |            8% |                8% |
| Dev error      |           10% |               10% |
|                | focus on bias | focus on variance |

Human level error as a proxy (estimate) for Bayes error.

*Diff between Human err and Training err = available bias*
*Diff between Train and Dev err = variance*

*** Understanding Human-level performance

**** Human-level error as proxy for Bayes error
Medical image classification example:
suppose
(a) Typical human 3% err
(b) Typical doctor 1% err
(c) Experienced doctor 0.7% err
(d) and team of experienced doctors 0.5% err

What is "human-level" error?

Bayes error is <= to 0.5% err
So we use that to aim as saw before.

For a paper, (b) is good enough to talk about that.

**** Error analysis example

| Human (proxy for bayes err) | 1, 0.7, 0.5% | 1, 0.7, 0.5 | 1, 0.7, 0.5 |
| Train err                   |           5% |          1% |        0.7% |
| Dev err                     |           6% |          5% |        0.8% |
|                             |              |             |             |

Case 1:
For this example it doesn't matter because avoidable bias (5 - 1%), is bigger
than variance (6-5)

Case 2: focus on variance

Case 3, very important you use 0.5 as your "human-level" error. Because it show
that you should focus on bias and not on variance.

This problem arose only when you're doing very good.

**** Summary of bias/variance with human-level perf

Human-level error (proxy for Bayes err)

    ^
    | "Avoidable bias"
    v

Training error

    ^
    | "Variance"
    v

Dev error

*** Surpassing human-level performance

**** Surpassing human-level performance

| Team            |  0.5% |       0.5% |
| One human       |    1% |         1% |
| Training error  |  0.6% |       0.3% |
| Dev error       |  0.8% |       0.4% |
|-----------------+-------+------------|
| Avoidable bias? | ~0.5% | can't know |

**** Problems where ML significantly surpasses human-level performance

- Online advertising
- Product recommendations
- Logistics (predicting transit time)
- Loan approvals

all thoses examples:
+ come from structured data
+ not natural perception problems
+ Lots of data


Also, Speech recognition, Some image recognition, Medical, ECG, skin cancer,
etc...

*** Improving your model performance

Set of guidelines

**** The two fundamental assumptions of supervised learning

1. You can fit the training set pretty well (~ avoidable bias)
2. The training set performance generalizes pretty well to the dev/test set

**** Reducing (avoidable) bias and variance

Human-level error (proxy for Bayes err)
    ^
    |                       train bigger model
    | "Avoidable bias" =>   train longer/better optimization algorithms (momentum, RMSprop, Adam)
    |                       NN architecture/hyperparameters search (RSS, CNN...)
    v

Training error

    ^
    |                 More data
    | "variance" =>   Regulraization (L2, dropout, data augmentation)
    |                 NN architecture/hyperparameters search
    v

Dev error

These concepts are easy to learn, hard to master.
You'll be more systematics than most ML teams.