2017-08-24 18:58:13 +00:00
|
|
|
|
-- #+TITLE: Deep Learning Coursera
|
|
|
|
|
-- #+AUTHOR: Yann Esposito
|
|
|
|
|
#+STARTUP: latexpreview
|
2018-01-02 12:30:18 +00:00
|
|
|
|
#+TODO: TODO IN-PROGRESS WAIT | DONE CANCELED
|
2017-09-02 21:54:37 +00:00
|
|
|
|
#+COLUMNS: %TODO %3PRIORITY %40ITEM(Task) %17EFFORT(Estimated Effort){:} %CLOCKSUM %8TAGS(TAG)
|
2017-08-24 18:58:13 +00:00
|
|
|
|
|
|
|
|
|
* Plan
|
|
|
|
|
|
|
|
|
|
5 courses
|
|
|
|
|
|
|
|
|
|
** Neural Network and Deep Learning
|
|
|
|
|
*** Week 1: Introduction
|
|
|
|
|
*** Week 2: Basic of Neural Network programming
|
|
|
|
|
*** Week 3: One hidden layer Neural Networks
|
|
|
|
|
*** Week 4: Deep Neural Network
|
|
|
|
|
** Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
|
|
|
|
|
** Structuring your Machine Learning project
|
|
|
|
|
** Convolutional Neural Networks
|
|
|
|
|
** Natural Language Processing: Building sequence models
|
|
|
|
|
* DONE Neural Network and Deep Learning
|
|
|
|
|
CLOSED: [2017-08-22 Tue 13:43]
|
|
|
|
|
** Introduction
|
|
|
|
|
|
|
|
|
|
*** What is a neural network?
|
|
|
|
|
|
|
|
|
|
*** Supervised Learning with Neural Networks
|
|
|
|
|
|
|
|
|
|
- Lucrative application: ads, showing the add you're most likely to click on
|
|
|
|
|
- Photo tagging
|
|
|
|
|
- Speech recognition
|
|
|
|
|
- Machine translation
|
|
|
|
|
- Autonomous driving
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
***** Convolutional NN good for images
|
|
|
|
|
|
|
|
|
|
***** Strutured data (db of data) vs Unstructured data
|
|
|
|
|
|
|
|
|
|
- Structured data: Tables
|
|
|
|
|
- Unstructured data: Audio, image, text...
|
|
|
|
|
|
|
|
|
|
Computer are much better at interpreting unstructured data.
|
|
|
|
|
|
|
|
|
|
*** Why is Deep Learning taking off?
|
|
|
|
|
|
|
|
|
|
[[///Users/yaesposi/Library/Mobile%20Documents/com~apple~CloudDocs/deft/img/Scale%20drives%20deep%20learning%20progress.png]]
|
|
|
|
|
|
|
|
|
|
- Data (lot of data)
|
|
|
|
|
- Computation (faster learning loop)
|
|
|
|
|
- Algorithms (ex, use ReLU instead of sigma)
|
|
|
|
|
** Geoffrey Hinton interview
|
|
|
|
|
** Binary Classification
|
|
|
|
|
|
|
|
|
|
\[ (x,y) x\in \mathbb{R}^{n_x}, y \in {0,1} \]
|
|
|
|
|
|
|
|
|
|
$m$ training examples: $$ {(x^{(1)},y^{(1)}), ... (x^{(m)},y^{(m)})} $$
|
|
|
|
|
|
|
|
|
|
$$ m = m_{train} , m_{test} = #test examples $$
|
|
|
|
|
|
|
|
|
|
$$ X = [ X^{(1)} ... X^{(m)} ] is an n_x x m matrix $$
|
|
|
|
|
$$ X.shape (n_x,m) $$
|
|
|
|
|
|
|
|
|
|
$$ Y = [ y^{(1)} ... y^{(m)} ] $$
|
|
|
|
|
$$ Y.shape = (1,m) $$
|
|
|
|
|
|
|
|
|
|
** Logistic Regression
|
|
|
|
|
|
|
|
|
|
Given $X \in \mathbb{R}^{n_x}$ you want $\hat{y} = P(y=1 | X)$
|
|
|
|
|
|
|
|
|
|
Paramters: $w \in \mathbb{R}^{n_x}, b\in \mathbb{R}$
|
|
|
|
|
|
|
|
|
|
Output: $\hat{y} = \sigma(w^Tx + b) = \sigma(z)$
|
|
|
|
|
|
|
|
|
|
$$\sigma(z)= \frac{1}{1 + e^{-z}}$$
|
|
|
|
|
|
|
|
|
|
If $z \rightarrow \infty => \sigma(z) \approx 1$
|
|
|
|
|
If $z \rightarrow - \infty => \sigma(z) \approx 0$
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Alternative notation not used in this course:
|
|
|
|
|
|
|
|
|
|
$X_0=1, x\in\mathbb{R}^{n_x+1}$
|
|
|
|
|
$\hat{y} = \sigma(\Theta^Tx)$
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
** Logistic Regression Cost Function
|
|
|
|
|
|
|
|
|
|
Search a convex loss function:
|
|
|
|
|
|
|
|
|
|
$L(\hat{y},y) = - (y\log(\hat{y}) + (1-y)\log(1-\hat{y}))$
|
|
|
|
|
|
|
|
|
|
If y = 1 : $L(\hat{y},y) = -\log\hat{y}$ <- want log\haty larg, want \hat{y} large
|
|
|
|
|
If y = 0 : $L(\hat{y},y) = -\log\hat{y}$ <- want log (1-\hat{y}) large, want \hat{y} sall
|
|
|
|
|
|
|
|
|
|
Cost function: $$ J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y^\{(i)}},y^{(i)}) = ... $$
|
|
|
|
|
|
|
|
|
|
** Gradient Descent
|
|
|
|
|
|
|
|
|
|
Minize $J(w,b)$
|
|
|
|
|
|
|
|
|
|
1. initialize w,b (generaly uses zero)
|
|
|
|
|
2. Take a step in the steepest descent direction
|
|
|
|
|
3. repeat 2 until reaching global optimum
|
|
|
|
|
|
|
|
|
|
Repeat {
|
|
|
|
|
$w := w - \alpha\frac{dJ(w)}{dw} = w - \alpha\mathtext{dw}$
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
** Derivatives
|
|
|
|
|
** More Derivative Examples
|
|
|
|
|
** Computaion Graph
|
|
|
|
|
** Computing Derivatives
|
|
|
|
|
** Computing Derivatives for multiple examples
|
|
|
|
|
** Vectorization
|
|
|
|
|
getting rid of explicit for loops in your code
|
|
|
|
|
** Vectorizing Logistic Regression
|
|
|
|
|
** Vectorizing Logistic Regression's Gradient Computation
|
|
|
|
|
** Broadcasting in Python
|
|
|
|
|
** Quick Tour of Jupyter / ipython notebooks
|
|
|
|
|
** Neural Network Basics
|
|
|
|
|
J = a*b + a*c - (b+c) = a (b + c) - (b + c) = (a - 1) (b + c)
|
2017-09-02 21:54:37 +00:00
|
|
|
|
* DONE Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
|
|
|
|
|
CLOSED: [2017-09-01 Fri 09:52]
|
2017-08-24 18:58:13 +00:00
|
|
|
|
** DONE Week 1: Setting up your Machine
|
|
|
|
|
CLOSED: [2017-08-22 Tue 13:43]
|
|
|
|
|
*** Recipe
|
|
|
|
|
|
|
|
|
|
If *High bias*? (bad training set performance?)
|
|
|
|
|
Then try:
|
|
|
|
|
- Bigger network
|
|
|
|
|
- Training longer
|
|
|
|
|
- (NN architecture search)
|
|
|
|
|
Else if *High variance*? (bad dev set performance?)
|
|
|
|
|
Then try:
|
|
|
|
|
- More data
|
|
|
|
|
- Regularization
|
|
|
|
|
- (NN architecture search)
|
|
|
|
|
|
|
|
|
|
Deep learning, not much bias/variance tradeoff if we have a big amount of
|
|
|
|
|
computer power (bigger network) and lot of data.
|
|
|
|
|
*** Regularization
|
|
|
|
|
**** Regularization: reduce variance
|
|
|
|
|
- L2 regularization
|
|
|
|
|
|
|
|
|
|
λ / 2m || w ||_2 ^2
|
|
|
|
|
|
|
|
|
|
- L1 regularization: same with |w| instead of ||w||_2^2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
λ is a regularization parameter (in code named =lambd=)
|
|
|
|
|
|
|
|
|
|
Cost = J(w^[1], b^[1], ..., w^[L], b^[L]) = 1/m \sum L(^y(i), y(i)) + λ/2m \sum_l=1^L || W^[l] ||^2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
call the "Frobenius norm"
|
|
|
|
|
|
|
|
|
|
dW = from backprop + λ/m W^l
|
|
|
|
|
|
|
|
|
|
update W^l = W^l - αdW^l still works
|
|
|
|
|
|
|
|
|
|
Sometime L2 regularization called "weight decay".
|
|
|
|
|
|
|
|
|
|
**** Dropout Regularization
|
|
|
|
|
|
|
|
|
|
Eliminates nodes by layer randomly for each training example.
|
|
|
|
|
|
|
|
|
|
- implementing, (inverted dropout)
|
|
|
|
|
- gen random boolean vector:
|
|
|
|
|
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob # (for each iteration)
|
|
|
|
|
a3 = np.mulitply(a3,d3)
|
|
|
|
|
a3 /= keep_prob (for normalization to be certain the a3 output still the same, reduce testing problems)
|
|
|
|
|
|
|
|
|
|
Making prediction at test time: no drop out
|
|
|
|
|
|
|
|
|
|
**** Over regularization methods
|
|
|
|
|
|
|
|
|
|
- Data augmentation, (flipping images for example, random crops, random distortions, etc...)
|
|
|
|
|
- Early stopping, stop earlier iteration
|
|
|
|
|
|
|
|
|
|
*** Setting up your optimization problem
|
|
|
|
|
|
|
|
|
|
**** Normalizing Inputs
|
|
|
|
|
|
|
|
|
|
- μ = 1/m Sum X^(i)
|
|
|
|
|
- x := x - μ (centralize)
|
|
|
|
|
- σ = 1/m Sum X^(i)^2
|
|
|
|
|
- x /= σ^2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**** Gradient Checking
|
|
|
|
|
***** Don't use gard check in traingin, only in debug
|
|
|
|
|
***** If algorithm fail, grad check, look at component (is db? dW? dW on certain layer, etc...)
|
|
|
|
|
***** Remember regularization
|
|
|
|
|
***** Doesn't work with dropout, turn off drop out (put 1.0) then check
|
|
|
|
|
***** Run at random initialization; perhaps again after training
|
|
|
|
|
|
|
|
|
|
** DONE Week 2: Optimization Algorithms
|
|
|
|
|
CLOSED: [2017-08-22 Tue 13:43]
|
|
|
|
|
*** Mini batch
|
|
|
|
|
|
|
|
|
|
X :: X^(1) ... X^(m)
|
|
|
|
|
|
|
|
|
|
X,Y -> X^{i},Y^{i} where X^{i} = X^(i*batch-size ---> (i+1)*batch-size)
|
|
|
|
|
|
|
|
|
|
*** Minibatch size
|
|
|
|
|
|
|
|
|
|
- if mini batch size = m => Batch gradient descent (X^{1},Y^{1}) = (X,Y)
|
|
|
|
|
- if mini match size = 1 => Stochastic gradient descent, every example is its own mini batch.
|
|
|
|
|
- in practice in between 1 and m, m --> too long, 1 loose speedup from vectorization.
|
|
|
|
|
+ vectorization ~1000
|
|
|
|
|
|
|
|
|
|
1. If small training set, use batch gradient descent (m <= 2000)
|
|
|
|
|
2. Typical mini-batch size: 64, 128, 256, 512, ... 2^k to fits in CPU/GPU memory
|
|
|
|
|
|
|
|
|
|
*** Exponentially weighted average
|
|
|
|
|
|
|
|
|
|
v_t = βv_{t-1} + (1-β)θ_t
|
2017-09-02 21:54:37 +00:00
|
|
|
|
** DONE Week 3: Hyperparameter
|
|
|
|
|
CLOSED: [2017-09-01 Fri 09:52]
|
|
|
|
|
*** Video 1: use random not a grid to search for hyperparameter best value
|
|
|
|
|
*** Video 2: choose appropriate scale to pick hyperparameter
|
|
|
|
|
- uniformly random n^[l] (number of neuron for layer l) or L (number of layers)
|
|
|
|
|
- alpha: between 0.00001 to 1, then shouldn't use linear but instead use log-scale
|
|
|
|
|
r = -4*np.random.rand() <- r in [-4,0]
|
|
|
|
|
α = 10^r <- 10^-4 ... 10^0
|
|
|
|
|
|
|
|
|
|
- β <- 0.9 ... 0.999 (0.9 about avg on 10 values, 0.999 avg about 1000 values)
|
|
|
|
|
1-β = 0.1 .... 0.001
|
|
|
|
|
r <- [-3,-1]
|
|
|
|
|
1-β = 10^r
|
|
|
|
|
*** Hyperparameter: Tuning in practice Panda vs caviar
|
|
|
|
|
- Babysitting one model (panda) for few computer resources
|
|
|
|
|
- Training many models in parallel (caviar) for lot of computer resources
|
|
|
|
|
*** Batch normalization
|
|
|
|
|
**** In a network
|
|
|
|
|
**** Fitting Batch norm into a deep network
|
|
|
|
|
**** Why Batch Normalizing?
|
|
|
|
|
- don't use batch norm as a regularization even if sometime it could have this
|
|
|
|
|
effect
|
|
|
|
|
**** Batch Norm at test time
|
|
|
|
|
μ = 1/m \sum z^(i)
|
|
|
|
|
|
|
|
|
|
σ^2 = 1/m \sum (z^(i) - μ)^2
|
|
|
|
|
|
|
|
|
|
z^(i)_norm = z^(i) - μ / sqrt( σ^2 + ε )
|
|
|
|
|
|
|
|
|
|
~z^(i) = γz^(i)_norm + β
|
|
|
|
|
|
|
|
|
|
Estimate μ and σ with exponentially weighted avg accross minibatches
|
|
|
|
|
*** Multi-class classification
|
|
|
|
|
**** Softmax Regression
|
|
|
|
|
notation: C = #classes (0,1,2...,C-1)
|
|
|
|
|
|
|
|
|
|
last hidden layer nb of neuron is equal to C: n^L = C
|
|
|
|
|
|
|
|
|
|
z^[L] = w[L]a^[L-1] + b[L] (C,1)
|
|
|
|
|
|
|
|
|
|
Activation function:
|
|
|
|
|
|
|
|
|
|
t = e^(Z[L])
|
|
|
|
|
a^[L] = e^(Z[L])/\sum_i=0^C t_i
|
|
|
|
|
|
|
|
|
|
a^[L]_i = t_i / \sum_i=0^C t_i
|
|
|
|
|
|
|
|
|
|
**** Training a softmax classifier
|
|
|
|
|
*** Introduction to programming frameworks
|
|
|
|
|
|
|
|
|
|
**** Deep learning frameworks
|
|
|
|
|
* Structuring your Machine Learning project
|
|
|
|
|
** Week 1
|
|
|
|
|
*** Introduction to ML Strategy
|
|
|
|
|
**** Why ML Strategy
|
|
|
|
|
Try to find quick and effective way to choose a strategy
|
|
|
|
|
|
|
|
|
|
Ways of analyzing ML problems
|
|
|
|
|
|
|
|
|
|
**** Orthogonalization
|
|
|
|
|
|
|
|
|
|
***** Chain of assumptions in ML
|
|
|
|
|
|
|
|
|
|
- Fit training set well on cost function => bigger network, Adam, ...
|
|
|
|
|
- Fit dev set well on cost function => Regularization, Bigger training set
|
|
|
|
|
- Fit test set well on cost function => Bigger dev set
|
|
|
|
|
- Perform well in real world => Change the devset or cost function
|
|
|
|
|
|
|
|
|
|
Try not to use early stoping as it simulanously affect cost on training and dev set.
|
|
|
|
|
|
|
|
|
|
*** Setting up your goal
|
|
|
|
|
|
|
|
|
|
**** Single number evaluation metric
|
|
|
|
|
|
|
|
|
|
***** First
|
|
|
|
|
|
|
|
|
|
| Classifier | Precision | Recall |
|
|
|
|
|
|------------+-----------+--------|
|
|
|
|
|
| A | 95% | 90% |
|
|
|
|
|
| B | 98% | 85% |
|
|
|
|
|
|
|
|
|
|
Rather than using two number, find a new evaluation metric
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Classifier | Precision | Recall | F1 Score |
|
|
|
|
|
|------------+-----------+--------+----------|
|
|
|
|
|
| A | 95% | 90% | 92.4% |
|
|
|
|
|
| B | 98% | 85% | 91.0 |
|
|
|
|
|
|
|
|
|
|
F1 score = 2 / (1/p) + (1/R) :: "Harmonic mean" of precision and recall.
|
|
|
|
|
|
|
|
|
|
So:
|
|
|
|
|
|
|
|
|
|
Having a good Dev set + single evaluation metric, really speed up iterating.
|
|
|
|
|
|
|
|
|
|
***** Another example
|
|
|
|
|
|
|
|
|
|
| Algorithm | US | China | India | Other | *Average* |
|
|
|
|
|
|-----------+-----+-------+-------+-------+-----------|
|
|
|
|
|
| A | 3% | 7% | 5% | 9% | |
|
|
|
|
|
| ... | | | | | |
|
|
|
|
|
| F | ... | ... | | | |
|
|
|
|
|
|
|
|
|
|
Try to improve the average.
|
|
|
|
|
|
|
|
|
|
**** Satisficing and Optimizing metric
|
|
|
|
|
|
|
|
|
|
It's not alway easy to select on metric to optimize.
|
|
|
|
|
|
|
|
|
|
***** Another cat classification example
|
|
|
|
|
|
|
|
|
|
| Classifier | Accuracy | Running Time |
|
|
|
|
|
|------------+----------+--------------|
|
|
|
|
|
| A | 90% | 80ms |
|
|
|
|
|
| B | 92% | 95ms |
|
|
|
|
|
| C | 95% | 1500ms |
|
|
|
|
|
|
|
|
|
|
cost = accuracy - 0.5x running time
|
|
|
|
|
|
|
|
|
|
maximize accuracy s.t. running time < 100ms
|
|
|
|
|
|
|
|
|
|
Accuracy <- Optimizing
|
|
|
|
|
Running time <- Satisficing
|
|
|
|
|
|
|
|
|
|
If you have n metrics, pick one to optimizing, and all the other be satisficing.
|
|
|
|
|
|
|
|
|
|
**** Train/dev/test distribution
|
|
|
|
|
|
|
|
|
|
How you can setup these dataset to speed up your work.
|
|
|
|
|
|
|
|
|
|
***** Cat classification dev/test sets
|
|
|
|
|
|
|
|
|
|
Try to find a way that dev and test set come from the same distribution.
|
|
|
|
|
|
|
|
|
|
***** True story (detail changed)
|
|
|
|
|
|
|
|
|
|
Optimizing on dev set on load approvals for medium income zip codes.
|
|
|
|
|
|
|
|
|
|
(repay loan?)
|
|
|
|
|
|
|
|
|
|
Tested on low income zip codes.
|
|
|
|
|
|
|
|
|
|
Lost 3 months
|
|
|
|
|
|
|
|
|
|
***** Guideline
|
|
|
|
|
|
|
|
|
|
Choose a dev set and test set to reflect data you expect to get in the future
|
|
|
|
|
and consider important to do well on.
|
|
|
|
|
|
|
|
|
|
**** Size of dev and test sets
|
|
|
|
|
|
|
|
|
|
***** Old way of splitting
|
|
|
|
|
70% train, 30% test
|
|
|
|
|
60% train, 20% dev, 20% test
|
|
|
|
|
|
|
|
|
|
For at max 10^4 examples
|
|
|
|
|
|
|
|
|
|
But in new era, 10^6 examples:
|
|
|
|
|
|
|
|
|
|
train: 98%, Dev 1%, Test 1%.
|
|
|
|
|
|
|
|
|
|
***** Size of test set
|
|
|
|
|
|
|
|
|
|
Set your test set to be big enough to give high confidence in the overall
|
|
|
|
|
performance of your system. Can be far less than 30% of your data.
|
|
|
|
|
|
|
|
|
|
For some applications, you don't need test set and only dev set.
|
|
|
|
|
For example if you have a very large dev set.
|
|
|
|
|
|
|
|
|
|
**** When to change dev/test sets and metrics?
|
|
|
|
|
|
|
|
|
|
Metric: classification error
|
|
|
|
|
Algorithm A: 3% error → letting throught a lot of porn images
|
|
|
|
|
Algorithm B: 5% error → doesn't let pass porn images
|
|
|
|
|
|
|
|
|
|
So your metric + evaluation prefer A, but you and your users prefer B.
|
|
|
|
|
|
|
|
|
|
When this happens, mispredict your algorithm B is better.
|
|
|
|
|
|
|
|
|
|
Error: 1/m_dev \sum_i=1^m I{y_pred^(i) /= y^(i)
|
|
|
|
|
|
|
|
|
|
They treat pron and non pron equaly but you don't want that.
|
|
|
|
|
|
|
|
|
|
We add a w(i) = 1 if non porn and 0 if porn in the formula
|
|
|
|
|
|
|
|
|
|
**** Orthogonalization for cat pictures: anti-pron
|
|
|
|
|
|
|
|
|
|
1. So far we've only discussed how to define a metric to evaluate classifier
|
|
|
|
|
2. Worry separately about how to do well on this metric
|
|
|
|
|
|
|
|
|
|
1. placing the target, and 2. is aiming the target.
|
|
|
|
|
|
|
|
|
|
**** Another example
|
|
|
|
|
Alg A: 3% err
|
|
|
|
|
Alg B: 5% err
|
|
|
|
|
|
|
|
|
|
But B does better. You see that users are using blurier images.
|
|
|
|
|
You dev/test are not using the same kind of images.
|
|
|
|
|
|
|
|
|
|
Change your metric and/or dev/test set.
|
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
*** Comparing to Humand-level performance
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
**** Why human-level performance
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Human-level perf vs Bayes optimal error
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Human are generally very close to bayes perf for lot of tasks.
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
- get lableld data from humans
|
|
|
|
|
- gain insight from manual error analysis (why did a person get this right?)
|
|
|
|
|
- better analysis of bias/variance
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
**** Avoidable bias
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
***** Cat classification example
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
| Humans | 1% | 7.5% |
|
|
|
|
|
| Training error | 8% | 8% |
|
|
|
|
|
| Dev error | 10% | 10% |
|
|
|
|
|
| | focus on bias | focus on variance |
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Human level error as a proxy (estimate) for Bayes error.
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
*Diff between Human err and Training err = available bias*
|
|
|
|
|
*Diff between Train and Dev err = variance*
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
**** Understanding Human-level performance
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
***** Human-level error as proxy for Bayes error
|
|
|
|
|
Medical image classification example:
|
|
|
|
|
suppose
|
|
|
|
|
(a) Typical human 3% err
|
|
|
|
|
(b) Typical doctor 1% err
|
|
|
|
|
(c) Experienced doctor 0.7% err
|
|
|
|
|
(d) and team of experienced doctors 0.5% err
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
What is "human-level" error?
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Bayes error is <= to 0.5% err
|
|
|
|
|
So we use that to aim as saw before.
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
For a paper, (b) is good enough to talk about that.
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
***** Error analysis example
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
| Human (proxy for bayes err) | 1, 0.7, 0.5% | 1, 0.7, 0.5 | 1, 0.7, 0.5 |
|
|
|
|
|
| Train err | 5% | 1% | 0.7% |
|
|
|
|
|
| Dev err | 6% | 5% | 0.8% |
|
|
|
|
|
| | | | |
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Case 1:
|
|
|
|
|
For this example it doesn't matter because avoidable bias (5 - 1%), is bigger
|
|
|
|
|
than variance (6-5)
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Case 2: focus on variance
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Case 3, very important you use 0.5 as your "human-level" error. Because it show
|
|
|
|
|
that you should focus on bias and not on variance.
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
This problem arose only when you're doing very good.
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
***** Summary of bias/variance with human-level perf
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Human-level error (proxy for Bayes err)
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
^
|
|
|
|
|
| "Avoidable bias"
|
|
|
|
|
v
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Training error
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
^
|
|
|
|
|
| "Variance"
|
|
|
|
|
v
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Dev error
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
|
|
|
|
**** Surpassing human-level performance
|
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
***** Surpassing human-level performance
|
|
|
|
|
|
|
|
|
|
| Team | 0.5% | 0.5% |
|
|
|
|
|
| One human | 1% | 1% |
|
|
|
|
|
| Training error | 0.6% | 0.3% |
|
|
|
|
|
| Dev error | 0.8% | 0.4% |
|
|
|
|
|
|-----------------+-------+------------|
|
|
|
|
|
| Avoidable bias? | ~0.5% | can't know |
|
|
|
|
|
|
|
|
|
|
***** Problems where ML significantly surpasses human-level performance
|
|
|
|
|
|
|
|
|
|
- Online advertising
|
|
|
|
|
- Product recommendations
|
|
|
|
|
- Logistics (predicting transit time)
|
|
|
|
|
- Loan approvals
|
|
|
|
|
|
|
|
|
|
all thoses examples:
|
|
|
|
|
+ come from structured data
|
|
|
|
|
+ not natural perception problems
|
|
|
|
|
+ Lots of data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Also, Speech recognition, Some image recognition, Medical, ECG, skin cancer,
|
|
|
|
|
etc...
|
|
|
|
|
|
|
|
|
|
**** Improving your model performance
|
|
|
|
|
|
|
|
|
|
Set of guidelines
|
|
|
|
|
|
|
|
|
|
***** The two fundamental assumptions of supervised learning
|
|
|
|
|
|
|
|
|
|
1. You can fit the training set pretty well (~ avoidable bias)
|
|
|
|
|
2. The training set performance generalizes pretty well to the dev/test set
|
|
|
|
|
|
|
|
|
|
***** Reducing (avoidable) bias and variance
|
|
|
|
|
|
|
|
|
|
Human-level error (proxy for Bayes err)
|
|
|
|
|
|
|
|
|
|
^
|
|
|
|
|
| train bigger model
|
|
|
|
|
| "Avoidable bias" => train longer/better optimization algorithms (momentum, RMSprop, Adam)
|
|
|
|
|
| NN architecture/hyperparameters search (RSS, CNN...)
|
|
|
|
|
v
|
|
|
|
|
|
|
|
|
|
Training error
|
|
|
|
|
|
|
|
|
|
^
|
|
|
|
|
| More data
|
|
|
|
|
| "variance" => Regulraization (L2, dropout, data augmentation)
|
|
|
|
|
| NN architecture/hyperparameters search
|
|
|
|
|
v
|
|
|
|
|
|
|
|
|
|
Dev error
|
|
|
|
|
|
|
|
|
|
These concepts are easy to learn, hard to master.
|
|
|
|
|
You'll be more systematics than most ML teams.
|
|
|
|
|
|
|
|
|
|
** Week 2
|
|
|
|
|
*** Error Analysis
|
|
|
|
|
**** Error Analysis
|
|
|
|
|
***** Carrying out error analysis
|
|
|
|
|
- Imagine your cat algo doesn't work as good as expected.
|
|
|
|
|
- One of your colaborator think you should focus on working on dogs.
|
|
|
|
|
- Anaylize manually 100 mislabeled dev set examples
|
|
|
|
|
- Count up how many are dogs
|
|
|
|
|
- Supose 5% are dogs. So at most you could go from 10% err to 9.5% so not much useful.
|
|
|
|
|
|
|
|
|
|
- Supose taht 50% of them are dogs error, so you could go down from 10% to 5%,
|
|
|
|
|
so you could be more confident.
|
|
|
|
|
***** Evaluate multiple idea in parallel
|
|
|
|
|
- fix pictures of dogs
|
|
|
|
|
- fix great cats (lion, panthers, ...)
|
|
|
|
|
- improve performance of blurry images
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Create spreadsheet:
|
|
|
|
|
|
|
|
|
|
| Image | Dog | Great cats | Bluring |
|
|
|
|
|
| 1 | ok | | |
|
|
|
|
|
| 2 | | | ok |
|
|
|
|
|
| 3 | | ok | ok |
|
|
|
|
|
| ... | | | |
|
|
|
|
|
| % of total | 8% | 43% | 61% |
|
|
|
|
|
|
|
|
|
|
You sometime notice other dimensions like instagram filters...
|
|
|
|
|
|
|
|
|
|
Could easily know where you should improve.
|
|
|
|
|
**** Cleaning up incorrectly labeled dataset
|
|
|
|
|
***** Incorrectly labeled examples
|
|
|
|
|
If you have incorrectly labeled data.
|
|
|
|
|
First lets consider the training set.
|
|
|
|
|
|
|
|
|
|
So long as you don't have too much errors, DL is quite robust to random errors.
|
|
|
|
|
|
|
|
|
|
But this is a problem for systematic errors.
|
|
|
|
|
***** Error analysis
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Image | Dog | Great cats | Bluring | Comments |
|
|
|
|
|
| ... | | | | |
|
|
|
|
|
| 98 | ok | | | labeler missed cat in background |
|
|
|
|
|
| 99 | | | ok | |
|
|
|
|
|
| 100 | | ok | ok | drawing of a cat not a real cat |
|
|
|
|
|
| % of total | 8% | 43% | 61% | |
|
|
|
|
|
|
|
|
|
|
1st case:
|
|
|
|
|
|
|
|
|
|
Overall dev set error: 10%
|
|
|
|
|
Error due incorrect labels: 0.6%
|
|
|
|
|
Errors due to other causes: 9.4%
|
|
|
|
|
|
|
|
|
|
2nd case:
|
|
|
|
|
|
|
|
|
|
Overall dev set error: 2%
|
|
|
|
|
Error due incorrect labels: 0.6%
|
|
|
|
|
Errors due to other causes: 1.4%
|
|
|
|
|
|
|
|
|
|
In 2nd case, take the time to fix mislabeled examples.
|
|
|
|
|
***** Correctin incorrect dev/test set examples
|
|
|
|
|
|
|
|
|
|
- Apply same process to your dev and test sets to make sure they continue to
|
|
|
|
|
come from the same distribution.
|
|
|
|
|
- Consider examining examples your algorithm gor right as well as ones it got
|
|
|
|
|
wrong.
|
|
|
|
|
- Train and dev/test data may now crom from slightly different distributions
|
|
|
|
|
**** Buid your first system quickly then iterate
|
|
|
|
|
***** Speech recognition example
|
|
|
|
|
- noisy background
|
|
|
|
|
- café noise
|
|
|
|
|
- car noise
|
|
|
|
|
- Accented speech
|
|
|
|
|
- Far from microphone
|
|
|
|
|
- young children's speech
|
|
|
|
|
- stuttering, uh, ah, um...
|
|
|
|
|
|
|
|
|
|
50 directions you could go, on which should you focus on?
|
|
|
|
|
|
|
|
|
|
1. Set up dev/test set and metric
|
|
|
|
|
2. Build initial system quickly
|
|
|
|
|
3. Use Bias/Variance analysis & Error Analysis to prioritize next steps
|
|
|
|
|
|
|
|
|
|
Guideline: *Build your first system quickly then iterate*
|
|
|
|
|
|
|
|
|
|
Do not otherthink, build something quick and dirty first.
|
|
|
|
|
*** Mismatched training and dev/test set
|
|
|
|
|
**** Training and testing on different distributions
|
|
|
|
|
***** Cat app example
|
|
|
|
|
Two sources of data:
|
|
|
|
|
- data from webpages
|
|
|
|
|
- data from mobile app
|
|
|
|
|
|
|
|
|
|
Let's say you don't have lot of users (~10k from mobile, 200k from web)
|
|
|
|
|
|
|
|
|
|
You care about doing well on mobile images. You don't want to use only the 10k,
|
|
|
|
|
but the dilema is the 200k aren't from the same distribution.
|
|
|
|
|
|
|
|
|
|
Option 1: take the 210k images and split between train/dev/test (train 205k, 2.5k, 2.5k)
|
|
|
|
|
- avantage, same distribution
|
|
|
|
|
- disavantage, perform on web instead of web.
|
|
|
|
|
- only 119 other the 2.5k will be from mobile.
|
|
|
|
|
Option 1 not recommended
|
|
|
|
|
|
|
|
|
|
Option 2:
|
|
|
|
|
- train set have 200k images from the web and 5k from the mobile.
|
|
|
|
|
- dev and test all mobile app images.
|
|
|
|
|
- avantage you know aiming your target where you want it to be.
|
|
|
|
|
- disavantage, your training distribution is different
|
|
|
|
|
But other the long term it will get you better performance
|
|
|
|
|
***** Speech recognition example
|
|
|
|
|
- Speech artificial rearview mirror. (real product in China)
|
|
|
|
|
1. Training: take all the speech data you have; purshased data, smart speaker control, voice keyboard... (500k)
|
|
|
|
|
2. Dev/test: speech activated, rearview mirror (20k)
|
|
|
|
|
|
|
|
|
|
Set your training set to be 500k from 1. and Dev/Test from 2.
|
|
|
|
|
|
|
|
|
|
The training set could be 510k (500k from 1 and 10k from 2.) and Dev/Test set (5k+5k from the rest of 2.)
|
|
|
|
|
|
|
|
|
|
Much bigger training set.
|
|
|
|
|
**** Bias and Variance with mismatched data distribution
|
|
|
|
|
***** Cat classifier example
|
|
|
|
|
Assume humans get ~0% error.
|
|
|
|
|
|
|
|
|
|
| Training error | 1% |
|
|
|
|
|
| Dev error | 10% |
|
|
|
|
|
|
|
|
|
|
Maybe there isn't a variance pb as the distribution is different.
|
|
|
|
|
|
|
|
|
|
Training-dev set: Same distrib as training set but not used for training.
|
|
|
|
|
|
|
|
|
|
Train / dev / test ==> Train split in train-2 and train-dev
|
|
|
|
|
|
|
|
|
|
So now you learn only on train-2 and check on train-dev and dev and test.
|
|
|
|
|
|
|
|
|
|
| Train err% | 1% | 1% |
|
|
|
|
|
| Train-dev err% | 9% | 1.5% |
|
|
|
|
|
| dev err% | 10% | 10% |
|
|
|
|
|
| | Var pb | data mismatch pb |
|
|
|
|
|
|
|
|
|
|
Other examples:
|
|
|
|
|
|
|
|
|
|
| Human err% | 0% | 0% |
|
|
|
|
|
| Train err% | 10% | 10% |
|
|
|
|
|
| Train-dev err% | 11% | 11% |
|
|
|
|
|
| dev err% | 12% | 20% |
|
|
|
|
|
| | Bias pb | Bias + data mismatch pb |
|
|
|
|
|
***** Bias/variance on mismatched trainig and dev/test sets
|
|
|
|
|
|
|
|
|
|
| Human level | 4% |
|
|
|
|
|
avoidable bias
|
|
|
|
|
| Training set error | 7% |
|
|
|
|
|
variance
|
|
|
|
|
| Training-dev set error | 10% |
|
|
|
|
|
data mismatch
|
|
|
|
|
| Dev error | 12% |
|
|
|
|
|
degree of overfitting to dev set
|
|
|
|
|
| Test error | 12% |
|
|
|
|
|
|
|
|
|
|
Example, training is much harder than dev/test set distribution:
|
|
|
|
|
|
|
|
|
|
| Human level | 4% |
|
|
|
|
|
| Training set error | 7% |
|
|
|
|
|
| Training-dev set error | 10% |
|
|
|
|
|
| Dev error | 6% |
|
|
|
|
|
| Test error | 6% |
|
|
|
|
|
|
|
|
|
|
***** More general formulation
|
|
|
|
|
|
|
|
|
|
The numbers can be place onto a table:
|
|
|
|
|
|
|
|
|
|
| | General Speech rec tasks | Rearview mirror speech data |
|
|
|
|
|
|--------------------+--------------------------+-----------------------------|
|
|
|
|
|
| Human lvl | "Human level err" (4%) | 6% |
|
|
|
|
|
| err on trained on | "Training err" (7%) | 6% |
|
|
|
|
|
| err not trained on | "Training-dev err" (10%) | "Dev/Test err" (6%) |
|
|
|
|
|
|
|
|
|
|
**** Addressing data mismatch
|
|
|
|
|
|
|
|
|
|
There are not any systematic way to address that.
|
|
|
|
|
But there are things you can try.
|
|
|
|
|
|
|
|
|
|
***** Addressing data mismatch
|
|
|
|
|
- Carry out manual error analysis to try to understand difference between
|
|
|
|
|
training and dev/test sets.
|
|
|
|
|
ex: you might find that a lot of dev set is noisy (car noise)
|
|
|
|
|
- Make training data more similar, or collect more data similar to dev/test sets.
|
|
|
|
|
ex: simulate noisy in-car data.
|
|
|
|
|
|
|
|
|
|
***** Artificial data synthesis
|
|
|
|
|
|
|
|
|
|
- Clean + car noise = synthetized in-car audio
|
|
|
|
|
|
|
|
|
|
Create more data, and can be a reasonable process.
|
|
|
|
|
|
|
|
|
|
Let's say you have 10k hrs of sound and only 1hr of car noise.
|
|
|
|
|
|
|
|
|
|
There is a risk your algorithm will overfit your 1hr car noise.
|
|
|
|
|
|
|
|
|
|
***** Artificial data synthesis (2)
|
|
|
|
|
Car recognition
|
|
|
|
|
|
|
|
|
|
Using car generated by computer vs just photos. You might overfit generated
|
|
|
|
|
cars. A video game might have only 20 cars, so overfit these 20 cars.
|
|
|
|
|
|
|
|
|
|
*** Learning from multiple tasks
|
|
|
|
|
|
|
|
|
|
**** Transfer learning
|
|
|
|
|
|
|
|
|
|
Learning recognize cats to help to read x-ray scans.
|
|
|
|
|
|
|
|
|
|
***** Transfer learning
|
|
|
|
|
|
|
|
|
|
Create new NN by changing just the last layer (the output).
|
|
|
|
|
|
|
|
|
|
(X,Y) now become (radiology images, diagnosis)
|
|
|
|
|
|
|
|
|
|
retrain the W^[Z], b^[Z].
|
|
|
|
|
|
|
|
|
|
You might want to train just the last layer, you all the layers.
|
|
|
|
|
|
|
|
|
|
The rule of thumb, just the last layer on few data.
|
|
|
|
|
The rule of thumb, all the layer on lot of datas.
|
|
|
|
|
|
|
|
|
|
pre-training, and fine-tuning.
|
|
|
|
|
|
|
|
|
|
A lot of low-level features learning from a very large data set might help.
|
|
|
|
|
|
|
|
|
|
- Another example. Speech recognition system:
|
|
|
|
|
|
2017-11-28 20:23:44 +00:00
|
|
|
|
X (audio) y (speech recognition) (wakeword, trigger word (ok google, hey siri, etc...))
|
2017-09-13 06:55:56 +00:00
|
|
|
|
|
|
|
|
|
You could add several new layers, and retrain the new layers or even more layers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
It make sense to transfer make sense when you have a very different number of examples.
|
|
|
|
|
|
2017-11-28 20:23:44 +00:00
|
|
|
|
- 10^6 image recognition, but only 100 radiology data.
|
2017-09-13 06:55:56 +00:00
|
|
|
|
- 10k hrs sounds, but only 1h data for wake words...
|
|
|
|
|
|
|
|
|
|
Transfering from lot of data to small number of data.
|
|
|
|
|
|
|
|
|
|
It doesn't make sense to transfer the other way.
|
|
|
|
|
|
|
|
|
|
***** When transfer learning makes sense
|
|
|
|
|
|
|
|
|
|
Task from A to B
|
|
|
|
|
|
|
|
|
|
- Task A and B have the same input X
|
|
|
|
|
- You have a lot more data for Task A than Task B
|
|
|
|
|
- Low level features from A could be helpful for learning B
|
|
|
|
|
|
|
|
|
|
**** Multi-task learning
|
|
|
|
|
|
|
|
|
|
Simultaneously learn multiple tasks.
|
|
|
|
|
|
|
|
|
|
***** Simplified autonomous driving example
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| | y^(i) | (4,1)
|
|
|
|
|
|----------------+-------|
|
|
|
|
|
| pedestrians | 0 |
|
|
|
|
|
| cars | 1 |
|
|
|
|
|
| stop signs | 1 |
|
|
|
|
|
| traffic lights | 0 |
|
|
|
|
|
|
|
|
|
|
Y = [ y^(1) y^(2) .... y^(m) ]
|
|
|
|
|
|
|
|
|
|
***** Neural network architecture
|
|
|
|
|
|
|
|
|
|
x -> [] -> [] .... -> ^y in R^4
|
|
|
|
|
|
|
|
|
|
Loss: y(i) -> 1/m \sum_i=1^m \sum_j=1^4 (L(y^(i)_j , y^(i)_j))
|
|
|
|
|
|
|
|
|
|
L is the usual loss function.
|
|
|
|
|
|
|
|
|
|
Unlike softmax regression, one image can have multiple labels.
|
|
|
|
|
|
|
|
|
|
- One NN doing 4 things is better than learning 4 different NN for each task.
|
|
|
|
|
|
|
|
|
|
Some examples might not be fully labelled.
|
|
|
|
|
And you can train by summing only other 0/1 label and not on ? mark (un labeled values).
|
|
|
|
|
|
|
|
|
|
So you can use more informations.
|
|
|
|
|
|
|
|
|
|
***** When multi-task learning makes sens.
|
|
|
|
|
|
|
|
|
|
- Training on set of tasks taht could benefit from having shared lower-level features
|
|
|
|
|
- Usually: amount of data you have for each task is quite similar
|
|
|
|
|
- Can train a big enough neural network to do well on all the tasks
|
|
|
|
|
|
|
|
|
|
Multi-task learning used a lot more than transfer learning.
|
|
|
|
|
|
|
|
|
|
*** End-to-end deep learning
|
|
|
|
|
**** What is end-to-end deep learning?
|
|
|
|
|
***** What is end-to-end deep learning?
|
|
|
|
|
Speech recognition example
|
|
|
|
|
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
audio - MFCC -> features -- ML --> phonemes -> words -> transcript
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
audio ------------------------------------------------> transcript
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
You might need a lot of data.
|
|
|
|
|
3k hrs of data, classical approach better.
|
|
|
|
|
10k to 100k hurs then end-to-end approach generally shines.
|
|
|
|
|
***** Face recognition
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Multi state approach works better:
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
1. detect face, zoom-in and crop to center the face
|
|
|
|
|
2. then feed this croped image to find identity. Generally comparing to all
|
|
|
|
|
employes.
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Why?
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
- Have a lot of data for task 1
|
|
|
|
|
- Have a lot of data for task 2
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
If you were to try to learn everything at the same time you wouldn't have enough
|
|
|
|
|
data.
|
|
|
|
|
***** More examples
|
|
|
|
|
Machine translation:
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
English -> text analysis -> ... -> French
|
|
|
|
|
English -------------------------> French
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Because we have lot of (x,y) examples.
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Estimating child's age from scan of the hand:
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Image -> bones -> age
|
|
|
|
|
Image ----------> age (there is not enough data)
|
|
|
|
|
**** Whether to use end-to-end deep learning
|
|
|
|
|
***** Pros and cons of end-to-end learning
|
|
|
|
|
Pros:
|
|
|
|
|
- let the data speak (no human preconception)
|
|
|
|
|
- Less hand-designing of components needed
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
Cons:
|
|
|
|
|
- May need a large amount of data: input end ----> output end (x,y)
|
|
|
|
|
- Excludes potentially useful hand-designed components. Data, Hand-design
|
|
|
|
|
***** Applying end-to-end deep learning
|
|
|
|
|
Key question: do you have sufficient data to learn a function of the complexity
|
|
|
|
|
needed to map x to y?
|
2017-09-02 21:54:37 +00:00
|
|
|
|
|
2017-09-13 06:55:56 +00:00
|
|
|
|
- choose X->Y mapping
|
|
|
|
|
- pure deep learning approch not appropriate if hard to find end-to-end exmaples.
|
2017-11-28 20:23:44 +00:00
|
|
|
|
* Convolutional Neural Networks
|
|
|
|
|
** Week 1
|
|
|
|
|
*** Computer Vision
|
|
|
|
|
size: 64x64x3 -> 12288
|
|
|
|
|
size: 1000x1000x3 -> 3 millions
|
|
|
|
|
|
|
|
|
|
[x_1 ... x_{3millions}] and w^[1] -> [1000 x 3e6]
|
|
|
|
|
*** Edge Detection Example
|
|
|
|
|
|
|
|
|
|
Take the 6x6 image and "convolve it by the 3x3 matrix" filter: [[1,1,1],[0,0,0],[-1,-1,-1]]
|
|
|
|
|
|
|
|
|
|
python: conv_forward
|
|
|
|
|
tensorflow. tf.nn.conv2d
|
|
|
|
|
keras: conv2D
|
|
|
|
|
*** More edge detection
|
|
|
|
|
|
|
|
|
|
Sobel filter: 1,2,1 , 0,0,0, -1,-2,-1
|
|
|
|
|
Scharr filter: 3,10,3 , 0,0,0, -3,-10,-3
|
|
|
|
|
*** Padding
|
|
|
|
|
|
|
|
|
|
Size of image shrink because of borders.
|
|
|
|
|
If filter as size f and image size n -> final image after filter: n - f +1
|
|
|
|
|
**** first solution but a border around the image: Padding
|
|
|
|
|
|
|
|
|
|
"valid" : nxn * fxf -> n-f+1 x n-f+1
|
|
|
|
|
"same": Pad so the output size is the same as the input size
|
|
|
|
|
n + 2p -f + 1 => p = (f-1)/2
|
|
|
|
|
|
|
|
|
|
3x3 -> p = 3-1/2 = 1
|
|
|
|
|
5x5 -> p = 5-1/2 = 2
|
|
|
|
|
- f is usually odd, easier for padding + the filter has a central position.
|
|
|
|
|
- 1x1, 3x3, 5x5, 7x7.
|
|
|
|
|
*** Strided Convolutions
|
|
|
|
|
|
|
|
|
|
Jump some columns/lines. Instead of sliding the filter on every columns/row, do it every n columns/ n rows.
|
|
|
|
|
|
|
|
|
|
nxn * fxf, padding: p, stride: s
|
|
|
|
|
|
|
|
|
|
floor ((n + 2p -f / s) + 1) x floor ((n + 2p -f / s) + 1)
|
|
|
|
|
**** Summary of convolutions
|
|
|
|
|
|
|
|
|
|
nxn image
|
|
|
|
|
fxf filter
|
|
|
|
|
|
|
|
|
|
padding p
|
|
|
|
|
stride s
|
|
|
|
|
|
|
|
|
|
output size:
|
|
|
|
|
\[ \floor ((n + 2p -f / s) + 1) x \floor ((n + 2p -f / s) + 1) \]
|
|
|
|
|
**** Convolution in math textbook (flip vertical and horizontal the filter)
|
|
|
|
|
|
|
|
|
|
cross-correlation vs convolution
|
|
|
|
|
|
|
|
|
|
By convention we call cross-correlation, convolution operator.
|
|
|
|
|
|
|
|
|
|
The convolution op is cross-associative:
|
|
|
|
|
(A * B) * C = A * (B * C)
|
|
|
|
|
*** Convolution over volumes
|
|
|
|
|
|
|
|
|
|
6x6x3 * 3x3x3 -> 4x4
|
|
|
|
|
height x width x #channels
|
|
|
|
|
|
|
|
|
|
By convention the nb of channels will be same in the image and in the filter.
|
|
|
|
|
**** Multiple filters
|
|
|
|
|
|
|
|
|
|
6x6x3 * 3x3x3 ---\ 4x4
|
|
|
|
|
* 3x3x3 ---/ 4x4 ====> 4x4x2, 2, n_c = #filters
|
|
|
|
|
*** One layer of convolutional neural network
|
|
|
|
|
10 filters that are 3x3x3 in one layer of NN, how many parameters?
|
|
|
|
|
|
|
|
|
|
3x3x3 = 27 + bias = 28 params
|
|
|
|
|
28 x 10 = 280 params
|
|
|
|
|
**** Summary of notation
|
|
|
|
|
|
|
|
|
|
If layer l is a convolutional layer:
|
|
|
|
|
f^[l] = filter size
|
|
|
|
|
p^[l] = padding
|
|
|
|
|
s^[l] = stride
|
|
|
|
|
nc^[l] = number of filters
|
|
|
|
|
|
|
|
|
|
each filter is f^[l] x f^[l] x n_c^[l-1]
|
|
|
|
|
Activations: a^[l] -> n_H^[l] x n_W[l] x n_c^[l]
|
|
|
|
|
A^[l] -> m x n_H^[l] x n_w^[l] x n_c^[l]
|
|
|
|
|
Weights: f^[l] x f^[l] x n_c^[l-1] x n_c^[l] (n_c^[l]: #filters in layer l)
|
|
|
|
|
Bias: n_c^[l] - (1,1,1,n_c^[l])
|
|
|
|
|
|
|
|
|
|
Input: n_H^[l-1] x n_W^[l-1] x n_c^[l-1]
|
|
|
|
|
Output: n_H^[l] x n_W^[l] x n_c^[l]
|
|
|
|
|
|
|
|
|
|
n^[l] = floor ( n^[l-1] + 2p^[l] - f^[l] / s^[l]) +1
|
|
|
|
|
*** A simple convolution neural network example
|
|
|
|
|
|
|
|
|
|
|_|/ ---------------------------->
|
|
|
|
|
39x39x3 f[1]=3, s^[1]=1, p^[1]= 0
|
|
|
|
|
n_H^[0] = n_W^[0] = 39 10 filters
|
|
|
|
|
n_c^[0]=3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|_|/ ---------------------------->
|
|
|
|
|
37x37x10 f[2]=5, s^[1]=2, p^[1]=0
|
|
|
|
|
20 filters
|
|
|
|
|
|
|
|
|
|
|_|/ ---------------------------->
|
|
|
|
|
17x17x20 f[3]=5, s^[1]=2, p^[1]=0
|
|
|
|
|
40 filters
|
|
|
|
|
|
|
|
|
|
|_|/ ---- 1960 params --> softmax y^hat
|
|
|
|
|
7x7x40
|
|
|
|
|
|
|
|
|
|
**** Type of layer in a CNN
|
|
|
|
|
|
|
|
|
|
- Convolution (CONV)
|
|
|
|
|
- Pooling (POOL)
|
|
|
|
|
- Fully connected (FC)
|
|
|
|
|
|
|
|
|
|
*** Pooling Layer
|
|
|
|
|
|
|
|
|
|
**** Max pooling
|
|
|
|
|
|
|
|
|
|
4x4 --- max over 2x2 region --> 2x2
|
|
|
|
|
|
|
|
|
|
Hyperparameters: f=2, s=2
|
|
|
|
|
No parameters to learn!
|
|
|
|
|
|
|
|
|
|
In practice it works well.
|
|
|
|
|
|
|
|
|
|
**** Example
|
|
|
|
|
|
|
|
|
|
5x5 with f=3 s=1
|
|
|
|
|
|
|
|
|
|
1 3 2 1 3
|
|
|
|
|
2 9 1 1 5 9 9 5
|
|
|
|
|
1 3 2 3 2 ====> 9 9 5
|
|
|
|
|
8 3 5 1 0 8 6 9
|
|
|
|
|
5 6 1 2 9
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Overs #channels, the output has the same number of channels.
|
|
|
|
|
Max pooling over each channel independently
|
|
|
|
|
|
|
|
|
|
**** Average Pooling
|
|
|
|
|
|
|
|
|
|
Same as previous, but we take the average instead of the max
|
|
|
|
|
|
|
|
|
|
**** Summarize
|
|
|
|
|
Hyperparameters
|
|
|
|
|
- f: filter size
|
|
|
|
|
- s: stride
|
|
|
|
|
- Max or Average pooling
|
|
|
|
|
- p: padding (almost never used, p=0 in general)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
nH x nW x nc ----> ((nH - f / s) + 1) x ((nW - f / s) + 1) + n_c
|
|
|
|
|
|
|
|
|
|
*** CNN Example
|
|
|
|
|
|
|
|
|
|
- Input Image: 32x32x3 (try to recognize a 7 letter)
|
|
|
|
|
--- conv f=5, s=1 ---->
|
|
|
|
|
- Conv1: 28x28x6 ---- max pool, f=2, s=2 ----> POOL1 14x14x6 } LAYER 1
|
|
|
|
|
--- conv f=5 s=1 -->
|
|
|
|
|
- Conv2: 10x10x16 ---- max pool, f=2, s=2 ----> POOL2 5x5x16 } LAYER 2
|
|
|
|
|
|
|
|
|
|
400 ----> FC3 120 ---> FC4 84 ----> 0 softmax (10 outputs)
|
|
|
|
|
W^[3] (120,400)
|
|
|
|
|
b^[3] (120)
|
|
|
|
|
|
|
|
|
|
In general:
|
|
|
|
|
|
|
|
|
|
- n_H, n_W will decrease as we go deeper
|
|
|
|
|
- n_c will increase as we go deeper
|
|
|
|
|
- CONV - POOL - CONV - POOL - FC -FC -FC -SOFTMAX
|
|
|
|
|
|
|
|
|
|
**** Sizes
|
|
|
|
|
Activation size go down, # parameters, few for Conv and 0 for POOL, a lot in FC
|
|
|
|
|
|
|
|
|
|
*** Why convolutions?
|
|
|
|
|
|
|
|
|
|
Two advantages:
|
|
|
|
|
- parameter sharing and sparsity of connections.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
If we had to make a FC between a 32x32x3 --> 28x28x6, would need millions of parameters.
|
|
|
|
|
Conv as only 156 parameters.
|
|
|
|
|
|
|
|
|
|
- *Parameter sharing*: A feature detector that's useful in one part of the image
|
|
|
|
|
is probably useful in another part of the image.
|
|
|
|
|
|
|
|
|
|
- *Sparsity of connections*: In each layer, each output value depends only on a
|
|
|
|
|
small number of inputs.
|
|
|
|
|
** Week 2
|
|
|
|
|
*** Case studies
|
|
|
|
|
**** Why look at case studies?
|
|
|
|
|
***** Outline
|
|
|
|
|
Classic networks:
|
|
|
|
|
- LeNet-5
|
|
|
|
|
- AlexNet
|
|
|
|
|
- VGG
|
|
|
|
|
|
|
|
|
|
ResNet (residual network), 152 deep network
|
|
|
|
|
|
|
|
|
|
Inception
|
|
|
|
|
**** Classic Networks
|
|
|
|
|
***** LeNet - 5
|
|
|
|
|
Recognize handwritten digits,
|
|
|
|
|
32x32x1 --- conv 5x5 s = 1 --->
|
|
|
|
|
28x28x6 --- avg pool, f=2, s=2 -->
|
|
|
|
|
14x14x6 --- conv 5x5,s=1 -->
|
|
|
|
|
10x10x16 --- avg pool, f=2 s=2 --->
|
|
|
|
|
5x5x16 (400) --> FC 120 --> FC 84 --> Softmax y^
|
|
|
|
|
|
|
|
|
|
Size was 60k parameters. Today hunder millions parameters
|
|
|
|
|
|
|
|
|
|
n_H, n_W decrease, n_C increase
|
|
|
|
|
|
|
|
|
|
Conv, pool, Conv, pool, fc, fc, output
|
|
|
|
|
|
|
|
|
|
LeCun et al. 1998, Gradient-based learning applied to document recognition.
|
|
|
|
|
***** AlexNet
|
|
|
|
|
Alex Krizhevsky et al. 2012, ImageNet classification with deep convolutional neural networks.
|
|
|
|
|
|
|
|
|
|
227x227x3 --- conv 11x11, s=4 --->
|
|
|
|
|
55x55x96 --- MAX Pool, 3x3, s=2 ---->
|
|
|
|
|
27x27x96 --- 5x5 same ------>
|
|
|
|
|
27x27x256 --- MAX POOL, 3x3, s=2 ---->
|
|
|
|
|
13x13x256 --- 3x3, same ----->
|
|
|
|
|
13x13x384 --- 3x3 ---> 13x13x384 ---3x3 ---> 13x13x256 --- MAX POOL, 3x3, s=2 --->
|
|
|
|
|
6x6x256 --- FC 9216 --> FC 4096 --> FC 4096 --> Softmax
|
|
|
|
|
|
|
|
|
|
Similar to previous but MUCH bigger:
|
|
|
|
|
|
|
|
|
|
60 millions parameters
|
|
|
|
|
|
|
|
|
|
Also use ReLU
|
|
|
|
|
***** VGG - 16
|
|
|
|
|
|
|
|
|
|
CONV = 3x3 filter, s=1, same
|
|
|
|
|
MAX-POOL = 2x2, s=2
|
|
|
|
|
|
|
|
|
|
224x224x3 --- [CONV 64]x2 ---> 225x224x64 --- POOL --->
|
|
|
|
|
112x112x64 --- [CONV 128]x2 ---> 112x112x128 --- POOL --->
|
|
|
|
|
56x56x128 --- [CONV 256]x3 ---> 56x56x256 --- POOL --->
|
|
|
|
|
28x28x256 --- [CONV 512]x3 ---> 28x28x512 --- POOL --->
|
|
|
|
|
14x14x512 --- [CONV 512]x3 ---> 14x14x512 --- POOL --->
|
|
|
|
|
7x7x512 ---> FC 4096 --> FC 4096 ---> Sofmax 1000
|
|
|
|
|
|
|
|
|
|
Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition.
|
|
|
|
|
|
|
|
|
|
about 138 millions parameters
|
|
|
|
|
|
|
|
|
|
Also VGG-19 is also another even bigger network, but VGG-16 perform as good as VGG-19
|
|
|
|
|
**** Residual Networks (ResNets)
|
|
|
|
|
Very very deep neural, over 100 layers.
|
|
|
|
|
***** Residual block
|
|
|
|
|
a^[l] ---> a^[l+1] --> a^[l+2]
|
|
|
|
|
|
|
|
|
|
a^[l] --+--> linear --> ReLU --> a^[l+1] -----> linear -----> ReLU --> a^[l+2]
|
|
|
|
|
| |
|
|
|
|
|
+-------------------------------------------------+
|
|
|
|
|
shortcut (skip connection)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a^[l+2] = g (z^[l+2] + a^[l])
|
|
|
|
|
|
|
|
|
|
passes information deeper in the NN.
|
|
|
|
|
|
|
|
|
|
He et al, 2015, Deep residual networks for image recognition
|
|
|
|
|
|
|
|
|
|
x -> [] -> [] -> [] .... -> a^[l]
|
|
|
|
|
|
|
|
|
|
Help in vanishing gradient descent parameters.
|
|
|
|
|
**** Why ResNets Work
|
|
|
|
|
***** Why do residual networks work so well?
|
|
|
|
|
In practice, learning on too deep network make bad results on the training set.
|
|
|
|
|
So that prevent using network that are too deep.
|
|
|
|
|
|
|
|
|
|
But it much less true when learning ResNets.
|
|
|
|
|
|
|
|
|
|
X ---> Big NN ---> a^[l]
|
|
|
|
|
|
|
|
|
|
X ----> Big NN ---> a^[l] -+-> [] --> [] -+->a ^[l+2]
|
|
|
|
|
ReLU a >= 0 | |
|
|
|
|
|
+--------------+
|
|
|
|
|
|
|
|
|
|
a^[l+2] = g (z^[l+2] + a^[l])
|
|
|
|
|
= g ( W^[l+2]a^[l+1] + b^[l+2] + a^[l] )
|
|
|
|
|
|
|
|
|
|
If we're using L2 regularization that would shrink the value of W^[l+2], also b^[l+2]
|
|
|
|
|
If W^[l+2] =0 and also b^[l+2], then we'll have g(a^[l] which is equal to a^[l]
|
|
|
|
|
|
|
|
|
|
So the identity function is easy to learn for ReLU.
|
|
|
|
|
So adding the shortcut don't hurt performance.
|
|
|
|
|
But if its goes good, it's better but almost never worse.
|
|
|
|
|
|
|
|
|
|
Remark: We're assuming z^[l+2] and a^[l] is of the same dimension.
|
|
|
|
|
In case they have different dimensions, add an extra matrix W_s so we have:
|
|
|
|
|
|
|
|
|
|
(W_s x a^[l])
|
|
|
|
|
***** ResNet example
|
|
|
|
|
Plain ----> ResNet
|
|
|
|
|
**** Networks in Networks and 1x1 Convolutions
|
|
|
|
|
Using a 1x1 convolution.
|
|
|
|
|
***** What does a 1x1 convolution do?
|
|
|
|
|
|
|
|
|
|
6x6x1 * 2 ---> simple multiplication
|
|
|
|
|
|
|
|
|
|
6x6x32 x 1x1x32 ---> 6x6x#filters
|
|
|
|
|
|
|
|
|
|
- Lin et al. 2013, Network in Network.
|
|
|
|
|
***** Using 1x1 conv
|
|
|
|
|
|
|
|
|
|
28x28x192 ---- ReLU, CONV 1x1, 32 filters ---> 28x28x32
|
|
|
|
|
|
|
|
|
|
Let shrink n_C as well
|
|
|
|
|
|
|
|
|
|
Effect non-linearity, we could keep the nb of layers and its fine.
|
|
|
|
|
|
|
|
|
|
*1x1 conv does a non trivial operation.*
|
|
|
|
|
**** Inception Network Motivation
|
|
|
|
|
We might have to pick, conv 3x3, pool, layer, etc...
|
|
|
|
|
***** Motivation for inception network.
|
|
|
|
|
Do all transformations at the same time:
|
|
|
|
|
|
|
|
|
|
28x28x192 --- 1x1 ----> 28x28x64
|
|
|
|
|
\\\--- 3x3,same ----> 28x28x128
|
|
|
|
|
\\--- 5x5,same ----> 28x28x32
|
|
|
|
|
\--- max pool, same, s=1 ----> 28x28x32
|
|
|
|
|
------------------------------------------------------
|
|
|
|
|
28x28x256
|
|
|
|
|
|
|
|
|
|
Szegedy et al. 2014, Going deeper with convolutions.
|
|
|
|
|
|
|
|
|
|
pb computational cost.
|
|
|
|
|
***** The problem of computational cost
|
|
|
|
|
28x28x192 --- conv 5x5, same, 32 ----> 28x28x32
|
|
|
|
|
|
|
|
|
|
32 filters, filters are 5x5x192
|
|
|
|
|
|
|
|
|
|
Nb of multiplications: 28x28x32 x 5x5x192 = 120 millions (costly)
|
|
|
|
|
***** Using 1x1 conv
|
|
|
|
|
28x28x192 ------- conv 1x1, 16, 1x1x192 --->
|
|
|
|
|
28x28x16 -------- conv 5x5, 32, 5x5x16 ---> 28x28x32
|
|
|
|
|
Bottleneck Layer
|
|
|
|
|
|
|
|
|
|
cost of 1st conv layer: 28x28x16 x 192 = 2.4 millions
|
|
|
|
|
cost of 2nd conv layer: 28x28x32 x 5x5x16 = 10 millions
|
|
|
|
|
total cost: 12.4 millions (about 10x less than before)
|
|
|
|
|
|
|
|
|
|
You can reduce substantially the size without hurting performace of the NN while
|
|
|
|
|
improving performances.
|
|
|
|
|
**** Inception Network
|
|
|
|
|
***** Inception module
|
|
|
|
|
Previous Activation: 28x28x192
|
|
|
|
|
|
|
|
|
|
1x1 conv --------------------------------------------> 28x28x64 -\
|
|
|
|
|
1x1 conv ----------------> 3x3 conv ----------------> 28x28x128 -- Channel concat:
|
|
|
|
|
1x1 conv ----------------> 5x5 conv ----------------> 28x28x32 // 28x28x256
|
|
|
|
|
MAXPOOL,3x3,s=1, same --> 28x28x192 --> 1x1 CONV ---> 28x28x32 /
|
|
|
|
|
|
|
|
|
|
Inception network, si the same pattern (block) connected in many layers.
|
|
|
|
|
|
|
|
|
|
Inception-block --> Inception block ---> ..... ---> Softmax layer output
|
|
|
|
|
\-> softmax layer \-> softmax layer
|
|
|
|
|
|
|
|
|
|
GoogleNet
|
|
|
|
|
***** Fun fact
|
|
|
|
|
We need to go deeper (from the Inception meme)
|
|
|
|
|
|
|
|
|
|
Since the dev, there are newer versions of inception modules.
|
|
|
|
|
Inception v1, v2, v3 ....
|
|
|
|
|
*** Practical advice for using ConvNet
|
|
|
|
|
**** Using Open-Source Implementation
|
|
|
|
|
Lot of these networks are difficult to reproduce, replicate the work.
|
|
|
|
|
|
|
|
|
|
Search look online implementation, instead of re-implementing from scratch.
|
|
|
|
|
|
|
|
|
|
Demo:
|
|
|
|
|
1. google
|
|
|
|
|
2. github repository
|
|
|
|
|
3. check license (MIT for example)
|
|
|
|
|
4. git clone ...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Also advantage, pre-trained networks.
|
|
|
|
|
Starting from open-source implementation is faster.
|
|
|
|
|
**** Transfer Learning
|
|
|
|
|
There are lot of data set on the Internet.
|
|
|
|
|
You could often download have an initial pretrained NN.
|
|
|
|
|
***** Transfer Learning
|
|
|
|
|
Classification Problem: is it Tigger, Misty or Neither (recognize cats)
|
|
|
|
|
|
|
|
|
|
Trained network trained on ImageNet.
|
|
|
|
|
|
|
|
|
|
Get rid of the soft-max layer and create your own, and output, Tigger/Misty or Neither.
|
|
|
|
|
And freeze the other parameters.
|
|
|
|
|
Just learn the Softmax layer.
|
|
|
|
|
|
|
|
|
|
You might get pretty good results.
|
|
|
|
|
|
|
|
|
|
Depending on the framework:
|
|
|
|
|
- TrainableParameters = 0
|
|
|
|
|
- freeze = 1
|
|
|
|
|
|
|
|
|
|
Pre-compute and save the last layer using for activation.
|
|
|
|
|
You don't need to recompute these activations.
|
|
|
|
|
|
|
|
|
|
*If you have a larger dataset*:
|
|
|
|
|
|
|
|
|
|
Freeze, fewer layers (the firsts)
|
|
|
|
|
Or freese a few layers and create a new layers with your own architecture.
|
|
|
|
|
|
|
|
|
|
*If you have a lot of data*:
|
|
|
|
|
|
|
|
|
|
Use the hole thing and use it as initialization.
|
|
|
|
|
|
|
|
|
|
You should almost always do transer learning unless you have an exceptionally
|
|
|
|
|
large dataset to train.
|
|
|
|
|
**** Data Augmentation
|
|
|
|
|
For the majority of Computer vision problem, having more data is almost always
|
|
|
|
|
useful and help.
|
|
|
|
|
***** Common augmentation
|
|
|
|
|
- mirroring images
|
|
|
|
|
- Random cropping, not perfect, but work well in practice
|
|
|
|
|
- Also: Rotation, shearing, local warping... but not much used in practice
|
|
|
|
|
***** Color shifting
|
|
|
|
|
Add to RGB different distortions.
|
|
|
|
|
Ex: +20,-20,+20 ---> more mauve
|
|
|
|
|
etc...
|
|
|
|
|
|
|
|
|
|
Advanced: PCA color augmentation (in AlexNet paper)
|
|
|
|
|
***** Implementing distortions during training
|
|
|
|
|
Harddisk ---> CPU threadd ---> distortions ----> training (CPU/GPU)
|
|
|
|
|
\-- load -> color
|
|
|
|
|
minibatch
|
|
|
|
|
|
|
|
|
|
Meta parameters, so certainly use open-source implementation to data
|
|
|
|
|
augmentation.
|
|
|
|
|
**** State of Computer Vision
|
|
|
|
|
***** Data vs hand-engineered
|
|
|
|
|
Most ML problem.
|
|
|
|
|
|
|
|
|
|
Little data <--------------------------------------------> Lots of data
|
|
|
|
|
|
|
|
|
|
speach recognition: lot of data
|
|
|
|
|
image recognition: OK data
|
|
|
|
|
Object detection: less data
|
|
|
|
|
|
|
|
|
|
Lot of data: simpler algorithms, less hand-engineering.
|
|
|
|
|
Few data: more hand-engineering, hacks
|
|
|
|
|
|
|
|
|
|
Two sources of knowledge:
|
|
|
|
|
- Labeled dataset (x,y)
|
|
|
|
|
- Hand engineering features/network architecture/other components
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
When very few data: Transfer learning.
|
|
|
|
|
***** Tips for doing well on benchmarks/winning competitions
|
|
|
|
|
- Ensembling
|
|
|
|
|
- Train several networks independently and average their outputs (not their weights)
|
|
|
|
|
- 3/15 networks (but almost never used in production, because it is costly for few benefits)
|
|
|
|
|
- Multi-crop at test time
|
|
|
|
|
- run classifier on multiple versions of test images and average results
|
|
|
|
|
- 10-crop: central + 4 corner + same on mirrored
|
|
|
|
|
|
|
|
|
|
Do not do this in production systems.
|
|
|
|
|
***** Use open source code
|
|
|
|
|
- Use architectures of networks published in the literature
|
|
|
|
|
- Use open source implementations if possible
|
|
|
|
|
- Use pretrained models and fine-tune on your dataset
|