2017-08-24 18:58:13 +00:00
|
|
|
|
-- #+TITLE: Deep Learning Coursera
|
|
|
|
|
-- #+AUTHOR: Yann Esposito
|
|
|
|
|
#+STARTUP: latexpreview
|
|
|
|
|
#+TODO: TODO IN-PROGRESS WAITING | DONE CANCELED
|
2017-09-02 21:54:37 +00:00
|
|
|
|
#+COLUMNS: %TODO %3PRIORITY %40ITEM(Task) %17EFFORT(Estimated Effort){:} %CLOCKSUM %8TAGS(TAG)
|
2017-08-24 18:58:13 +00:00
|
|
|
|
|
|
|
|
|
* Plan
|
|
|
|
|
|
|
|
|
|
5 courses
|
|
|
|
|
|
|
|
|
|
** Neural Network and Deep Learning
|
|
|
|
|
*** Week 1: Introduction
|
|
|
|
|
*** Week 2: Basic of Neural Network programming
|
|
|
|
|
*** Week 3: One hidden layer Neural Networks
|
|
|
|
|
*** Week 4: Deep Neural Network
|
|
|
|
|
** Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
|
|
|
|
|
** Structuring your Machine Learning project
|
|
|
|
|
** Convolutional Neural Networks
|
|
|
|
|
** Natural Language Processing: Building sequence models
|
|
|
|
|
* DONE Neural Network and Deep Learning
|
|
|
|
|
CLOSED: [2017-08-22 Tue 13:43]
|
|
|
|
|
** Introduction
|
|
|
|
|
|
|
|
|
|
*** What is a neural network?
|
|
|
|
|
|
|
|
|
|
*** Supervised Learning with Neural Networks
|
|
|
|
|
|
|
|
|
|
- Lucrative application: ads, showing the add you're most likely to click on
|
|
|
|
|
- Photo tagging
|
|
|
|
|
- Speech recognition
|
|
|
|
|
- Machine translation
|
|
|
|
|
- Autonomous driving
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
***** Convolutional NN good for images
|
|
|
|
|
|
|
|
|
|
***** Strutured data (db of data) vs Unstructured data
|
|
|
|
|
|
|
|
|
|
- Structured data: Tables
|
|
|
|
|
- Unstructured data: Audio, image, text...
|
|
|
|
|
|
|
|
|
|
Computer are much better at interpreting unstructured data.
|
|
|
|
|
|
|
|
|
|
*** Why is Deep Learning taking off?
|
|
|
|
|
|
|
|
|
|
[[///Users/yaesposi/Library/Mobile%20Documents/com~apple~CloudDocs/deft/img/Scale%20drives%20deep%20learning%20progress.png]]
|
|
|
|
|
|
|
|
|
|
- Data (lot of data)
|
|
|
|
|
- Computation (faster learning loop)
|
|
|
|
|
- Algorithms (ex, use ReLU instead of sigma)
|
|
|
|
|
** Geoffrey Hinton interview
|
|
|
|
|
** Binary Classification
|
|
|
|
|
|
|
|
|
|
\[ (x,y) x\in \mathbb{R}^{n_x}, y \in {0,1} \]
|
|
|
|
|
|
|
|
|
|
$m$ training examples: $$ {(x^{(1)},y^{(1)}), ... (x^{(m)},y^{(m)})} $$
|
|
|
|
|
|
|
|
|
|
$$ m = m_{train} , m_{test} = #test examples $$
|
|
|
|
|
|
|
|
|
|
$$ X = [ X^{(1)} ... X^{(m)} ] is an n_x x m matrix $$
|
|
|
|
|
$$ X.shape (n_x,m) $$
|
|
|
|
|
|
|
|
|
|
$$ Y = [ y^{(1)} ... y^{(m)} ] $$
|
|
|
|
|
$$ Y.shape = (1,m) $$
|
|
|
|
|
|
|
|
|
|
** Logistic Regression
|
|
|
|
|
|
|
|
|
|
Given $X \in \mathbb{R}^{n_x}$ you want $\hat{y} = P(y=1 | X)$
|
|
|
|
|
|
|
|
|
|
Paramters: $w \in \mathbb{R}^{n_x}, b\in \mathbb{R}$
|
|
|
|
|
|
|
|
|
|
Output: $\hat{y} = \sigma(w^Tx + b) = \sigma(z)$
|
|
|
|
|
|
|
|
|
|
$$\sigma(z)= \frac{1}{1 + e^{-z}}$$
|
|
|
|
|
|
|
|
|
|
If $z \rightarrow \infty => \sigma(z) \approx 1$
|
|
|
|
|
If $z \rightarrow - \infty => \sigma(z) \approx 0$
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Alternative notation not used in this course:
|
|
|
|
|
|
|
|
|
|
$X_0=1, x\in\mathbb{R}^{n_x+1}$
|
|
|
|
|
$\hat{y} = \sigma(\Theta^Tx)$
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
** Logistic Regression Cost Function
|
|
|
|
|
|
|
|
|
|
Search a convex loss function:
|
|
|
|
|
|
|
|
|
|
$L(\hat{y},y) = - (y\log(\hat{y}) + (1-y)\log(1-\hat{y}))$
|
|
|
|
|
|
|
|
|
|
If y = 1 : $L(\hat{y},y) = -\log\hat{y}$ <- want log\haty larg, want \hat{y} large
|
|
|
|
|
If y = 0 : $L(\hat{y},y) = -\log\hat{y}$ <- want log (1-\hat{y}) large, want \hat{y} sall
|
|
|
|
|
|
|
|
|
|
Cost function: $$ J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y^\{(i)}},y^{(i)}) = ... $$
|
|
|
|
|
|
|
|
|
|
** Gradient Descent
|
|
|
|
|
|
|
|
|
|
Minize $J(w,b)$
|
|
|
|
|
|
|
|
|
|
1. initialize w,b (generaly uses zero)
|
|
|
|
|
2. Take a step in the steepest descent direction
|
|
|
|
|
3. repeat 2 until reaching global optimum
|
|
|
|
|
|
|
|
|
|
Repeat {
|
|
|
|
|
$w := w - \alpha\frac{dJ(w)}{dw} = w - \alpha\mathtext{dw}$
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
** Derivatives
|
|
|
|
|
** More Derivative Examples
|
|
|
|
|
** Computaion Graph
|
|
|
|
|
** Computing Derivatives
|
|
|
|
|
** Computing Derivatives for multiple examples
|
|
|
|
|
** Vectorization
|
|
|
|
|
getting rid of explicit for loops in your code
|
|
|
|
|
** Vectorizing Logistic Regression
|
|
|
|
|
** Vectorizing Logistic Regression's Gradient Computation
|
|
|
|
|
** Broadcasting in Python
|
|
|
|
|
** Quick Tour of Jupyter / ipython notebooks
|
|
|
|
|
** Neural Network Basics
|
|
|
|
|
J = a*b + a*c - (b+c) = a (b + c) - (b + c) = (a - 1) (b + c)
|
2017-09-02 21:54:37 +00:00
|
|
|
|
* DONE Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
|
|
|
|
|
CLOSED: [2017-09-01 Fri 09:52]
|
2017-08-24 18:58:13 +00:00
|
|
|
|
** DONE Week 1: Setting up your Machine
|
|
|
|
|
CLOSED: [2017-08-22 Tue 13:43]
|
|
|
|
|
*** Recipe
|
|
|
|
|
|
|
|
|
|
If *High bias*? (bad training set performance?)
|
|
|
|
|
Then try:
|
|
|
|
|
- Bigger network
|
|
|
|
|
- Training longer
|
|
|
|
|
- (NN architecture search)
|
|
|
|
|
Else if *High variance*? (bad dev set performance?)
|
|
|
|
|
Then try:
|
|
|
|
|
- More data
|
|
|
|
|
- Regularization
|
|
|
|
|
- (NN architecture search)
|
|
|
|
|
|
|
|
|
|
Deep learning, not much bias/variance tradeoff if we have a big amount of
|
|
|
|
|
computer power (bigger network) and lot of data.
|
|
|
|
|
*** Regularization
|
|
|
|
|
**** Regularization: reduce variance
|
|
|
|
|
- L2 regularization
|
|
|
|
|
|
|
|
|
|
λ / 2m || w ||_2 ^2
|
|
|
|
|
|
|
|
|
|
- L1 regularization: same with |w| instead of ||w||_2^2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
λ is a regularization parameter (in code named =lambd=)
|
|
|
|
|
|
|
|
|
|
Cost = J(w^[1], b^[1], ..., w^[L], b^[L]) = 1/m \sum L(^y(i), y(i)) + λ/2m \sum_l=1^L || W^[l] ||^2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
call the "Frobenius norm"
|
|
|
|
|
|
|
|
|
|
dW = from backprop + λ/m W^l
|
|
|
|
|
|
|
|
|
|
update W^l = W^l - αdW^l still works
|
|
|
|
|
|
|
|
|
|
Sometime L2 regularization called "weight decay".
|
|
|
|
|
|
|
|
|
|
**** Dropout Regularization
|
|
|
|
|
|
|
|
|
|
Eliminates nodes by layer randomly for each training example.
|
|
|
|
|
|
|
|
|
|
- implementing, (inverted dropout)
|
|
|
|
|
- gen random boolean vector:
|
|
|
|
|
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob # (for each iteration)
|
|
|
|
|
a3 = np.mulitply(a3,d3)
|
|
|
|
|
a3 /= keep_prob (for normalization to be certain the a3 output still the same, reduce testing problems)
|
|
|
|
|
|
|
|
|
|
Making prediction at test time: no drop out
|
|
|
|
|
|
|
|
|
|
**** Over regularization methods
|
|
|
|
|
|
|
|
|
|
- Data augmentation, (flipping images for example, random crops, random distortions, etc...)
|
|
|
|
|
- Early stopping, stop earlier iteration
|
|
|
|
|
|
|
|
|
|
*** Setting up your optimization problem
|
|
|
|
|
|
|
|
|
|
**** Normalizing Inputs
|
|
|
|
|
|
|
|
|
|
- μ = 1/m Sum X^(i)
|
|
|
|
|
- x := x - μ (centralize)
|
|
|
|
|
- σ = 1/m Sum X^(i)^2
|
|
|
|
|
- x /= σ^2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**** Gradient Checking
|
|
|
|
|
***** Don't use gard check in traingin, only in debug
|
|
|
|
|
***** If algorithm fail, grad check, look at component (is db? dW? dW on certain layer, etc...)
|
|
|
|
|
***** Remember regularization
|
|
|
|
|
***** Doesn't work with dropout, turn off drop out (put 1.0) then check
|
|
|
|
|
***** Run at random initialization; perhaps again after training
|
|
|
|
|
|
|
|
|
|
** DONE Week 2: Optimization Algorithms
|
|
|
|
|
CLOSED: [2017-08-22 Tue 13:43]
|
|
|
|
|
*** Mini batch
|
|
|
|
|
|
|
|
|
|
X :: X^(1) ... X^(m)
|
|
|
|
|
|
|
|
|
|
X,Y -> X^{i},Y^{i} where X^{i} = X^(i*batch-size ---> (i+1)*batch-size)
|
|
|
|
|
|
|
|
|
|
*** Minibatch size
|
|
|
|
|
|
|
|
|
|
- if mini batch size = m => Batch gradient descent (X^{1},Y^{1}) = (X,Y)
|
|
|
|
|
- if mini match size = 1 => Stochastic gradient descent, every example is its own mini batch.
|
|
|
|
|
- in practice in between 1 and m, m --> too long, 1 loose speedup from vectorization.
|
|
|
|
|
+ vectorization ~1000
|
|
|
|
|
|
|
|
|
|
1. If small training set, use batch gradient descent (m <= 2000)
|
|
|
|
|
2. Typical mini-batch size: 64, 128, 256, 512, ... 2^k to fits in CPU/GPU memory
|
|
|
|
|
|
|
|
|
|
*** Exponentially weighted average
|
|
|
|
|
|
|
|
|
|
v_t = βv_{t-1} + (1-β)θ_t
|
2017-09-02 21:54:37 +00:00
|
|
|
|
** DONE Week 3: Hyperparameter
|
|
|
|
|
CLOSED: [2017-09-01 Fri 09:52]
|
|
|
|
|
*** Video 1: use random not a grid to search for hyperparameter best value
|
|
|
|
|
*** Video 2: choose appropriate scale to pick hyperparameter
|
|
|
|
|
- uniformly random n^[l] (number of neuron for layer l) or L (number of layers)
|
|
|
|
|
- alpha: between 0.00001 to 1, then shouldn't use linear but instead use log-scale
|
|
|
|
|
r = -4*np.random.rand() <- r in [-4,0]
|
|
|
|
|
α = 10^r <- 10^-4 ... 10^0
|
|
|
|
|
|
|
|
|
|
- β <- 0.9 ... 0.999 (0.9 about avg on 10 values, 0.999 avg about 1000 values)
|
|
|
|
|
1-β = 0.1 .... 0.001
|
|
|
|
|
r <- [-3,-1]
|
|
|
|
|
1-β = 10^r
|
|
|
|
|
*** Hyperparameter: Tuning in practice Panda vs caviar
|
|
|
|
|
- Babysitting one model (panda) for few computer resources
|
|
|
|
|
- Training many models in parallel (caviar) for lot of computer resources
|
|
|
|
|
*** Batch normalization
|
|
|
|
|
**** In a network
|
|
|
|
|
**** Fitting Batch norm into a deep network
|
|
|
|
|
**** Why Batch Normalizing?
|
|
|
|
|
- don't use batch norm as a regularization even if sometime it could have this
|
|
|
|
|
effect
|
|
|
|
|
**** Batch Norm at test time
|
|
|
|
|
μ = 1/m \sum z^(i)
|
|
|
|
|
|
|
|
|
|
σ^2 = 1/m \sum (z^(i) - μ)^2
|
|
|
|
|
|
|
|
|
|
z^(i)_norm = z^(i) - μ / sqrt( σ^2 + ε )
|
|
|
|
|
|
|
|
|
|
~z^(i) = γz^(i)_norm + β
|
|
|
|
|
|
|
|
|
|
Estimate μ and σ with exponentially weighted avg accross minibatches
|
|
|
|
|
*** Multi-class classification
|
|
|
|
|
**** Softmax Regression
|
|
|
|
|
notation: C = #classes (0,1,2...,C-1)
|
|
|
|
|
|
|
|
|
|
last hidden layer nb of neuron is equal to C: n^L = C
|
|
|
|
|
|
|
|
|
|
z^[L] = w[L]a^[L-1] + b[L] (C,1)
|
|
|
|
|
|
|
|
|
|
Activation function:
|
|
|
|
|
|
|
|
|
|
t = e^(Z[L])
|
|
|
|
|
a^[L] = e^(Z[L])/\sum_i=0^C t_i
|
|
|
|
|
|
|
|
|
|
a^[L]_i = t_i / \sum_i=0^C t_i
|
|
|
|
|
|
|
|
|
|
**** Training a softmax classifier
|
|
|
|
|
|
|
|
|
|
*** Introduction to programming frameworks
|
|
|
|
|
|
|
|
|
|
**** Deep learning frameworks
|
|
|
|
|
* Structuring your Machine Learning project
|
|
|
|
|
** Week 1
|
|
|
|
|
*** Introduction to ML Strategy
|
|
|
|
|
**** Why ML Strategy
|
|
|
|
|
Try to find quick and effective way to choose a strategy
|
|
|
|
|
|
|
|
|
|
Ways of analyzing ML problems
|
|
|
|
|
|
|
|
|
|
**** Orthogonalization
|
|
|
|
|
|
|
|
|
|
***** Chain of assumptions in ML
|
|
|
|
|
|
|
|
|
|
- Fit training set well on cost function => bigger network, Adam, ...
|
|
|
|
|
- Fit dev set well on cost function => Regularization, Bigger training set
|
|
|
|
|
- Fit test set well on cost function => Bigger dev set
|
|
|
|
|
- Perform well in real world => Change the devset or cost function
|
|
|
|
|
|
|
|
|
|
Try not to use early stoping as it simulanously affect cost on training and dev set.
|
|
|
|
|
|
|
|
|
|
*** Setting up your goal
|
|
|
|
|
|
|
|
|
|
**** Single number evaluation metric
|
|
|
|
|
|
|
|
|
|
***** First
|
|
|
|
|
|
|
|
|
|
| Classifier | Precision | Recall |
|
|
|
|
|
|------------+-----------+--------|
|
|
|
|
|
| A | 95% | 90% |
|
|
|
|
|
| B | 98% | 85% |
|
|
|
|
|
|
|
|
|
|
Rather than using two number, find a new evaluation metric
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Classifier | Precision | Recall | F1 Score |
|
|
|
|
|
|------------+-----------+--------+----------|
|
|
|
|
|
| A | 95% | 90% | 92.4% |
|
|
|
|
|
| B | 98% | 85% | 91.0 |
|
|
|
|
|
|
|
|
|
|
F1 score = 2 / (1/p) + (1/R) :: "Harmonic mean" of precision and recall.
|
|
|
|
|
|
|
|
|
|
So:
|
|
|
|
|
|
|
|
|
|
Having a good Dev set + single evaluation metric, really speed up iterating.
|
|
|
|
|
|
|
|
|
|
***** Another example
|
|
|
|
|
|
|
|
|
|
| Algorithm | US | China | India | Other | *Average* |
|
|
|
|
|
|-----------+-----+-------+-------+-------+-----------|
|
|
|
|
|
| A | 3% | 7% | 5% | 9% | |
|
|
|
|
|
| ... | | | | | |
|
|
|
|
|
| F | ... | ... | | | |
|
|
|
|
|
|
|
|
|
|
Try to improve the average.
|
|
|
|
|
|
|
|
|
|
**** Satisficing and Optimizing metric
|
|
|
|
|
|
|
|
|
|
It's not alway easy to select on metric to optimize.
|
|
|
|
|
|
|
|
|
|
***** Another cat classification example
|
|
|
|
|
|
|
|
|
|
| Classifier | Accuracy | Running Time |
|
|
|
|
|
|------------+----------+--------------|
|
|
|
|
|
| A | 90% | 80ms |
|
|
|
|
|
| B | 92% | 95ms |
|
|
|
|
|
| C | 95% | 1500ms |
|
|
|
|
|
|
|
|
|
|
cost = accuracy - 0.5x running time
|
|
|
|
|
|
|
|
|
|
maximize accuracy s.t. running time < 100ms
|
|
|
|
|
|
|
|
|
|
Accuracy <- Optimizing
|
|
|
|
|
Running time <- Satisficing
|
|
|
|
|
|
|
|
|
|
If you have n metrics, pick one to optimizing, and all the other be satisficing.
|
|
|
|
|
|
|
|
|
|
**** Train/dev/test distribution
|
|
|
|
|
|
|
|
|
|
How you can setup these dataset to speed up your work.
|
|
|
|
|
|
|
|
|
|
***** Cat classification dev/test sets
|
|
|
|
|
|
|
|
|
|
Try to find a way that dev and test set come from the same distribution.
|
|
|
|
|
|
|
|
|
|
***** True story (detail changed)
|
|
|
|
|
|
|
|
|
|
Optimizing on dev set on load approvals for medium income zip codes.
|
|
|
|
|
|
|
|
|
|
(repay loan?)
|
|
|
|
|
|
|
|
|
|
Tested on low income zip codes.
|
|
|
|
|
|
|
|
|
|
Lost 3 months
|
|
|
|
|
|
|
|
|
|
***** Guideline
|
|
|
|
|
|
|
|
|
|
Choose a dev set and test set to reflect data you expect to get in the future
|
|
|
|
|
and consider important to do well on.
|
|
|
|
|
|
|
|
|
|
**** Size of dev and test sets
|
|
|
|
|
|
|
|
|
|
***** Old way of splitting
|
|
|
|
|
70% train, 30% test
|
|
|
|
|
60% train, 20% dev, 20% test
|
|
|
|
|
|
|
|
|
|
For at max 10^4 examples
|
|
|
|
|
|
|
|
|
|
But in new era, 10^6 examples:
|
|
|
|
|
|
|
|
|
|
train: 98%, Dev 1%, Test 1%.
|
|
|
|
|
|
|
|
|
|
***** Size of test set
|
|
|
|
|
|
|
|
|
|
Set your test set to be big enough to give high confidence in the overall
|
|
|
|
|
performance of your system. Can be far less than 30% of your data.
|
|
|
|
|
|
|
|
|
|
For some applications, you don't need test set and only dev set.
|
|
|
|
|
For example if you have a very large dev set.
|
|
|
|
|
|
|
|
|
|
**** When to change dev/test sets and metrics?
|
|
|
|
|
|
|
|
|
|
Metric: classification error
|
|
|
|
|
Algorithm A: 3% error → letting throught a lot of porn images
|
|
|
|
|
Algorithm B: 5% error → doesn't let pass porn images
|
|
|
|
|
|
|
|
|
|
So your metric + evaluation prefer A, but you and your users prefer B.
|
|
|
|
|
|
|
|
|
|
When this happens, mispredict your algorithm B is better.
|
|
|
|
|
|
|
|
|
|
Error: 1/m_dev \sum_i=1^m I{y_pred^(i) /= y^(i)
|
|
|
|
|
|
|
|
|
|
They treat pron and non pron equaly but you don't want that.
|
|
|
|
|
|
|
|
|
|
We add a w(i) = 1 if non porn and 0 if porn in the formula
|
|
|
|
|
|
|
|
|
|
**** Orthogonalization for cat pictures: anti-pron
|
|
|
|
|
|
|
|
|
|
1. So far we've only discussed how to define a metric to evaluate classifier
|
|
|
|
|
2. Worry separately about how to do well on this metric
|
|
|
|
|
|
|
|
|
|
1. placing the target, and 2. is aiming the target.
|
|
|
|
|
|
|
|
|
|
**** Another example
|
|
|
|
|
Alg A: 3% err
|
|
|
|
|
Alg B: 5% err
|
|
|
|
|
|
|
|
|
|
But B does better. You see that users are using blurier images.
|
|
|
|
|
You dev/test are not using the same kind of images.
|
|
|
|
|
|
|
|
|
|
Change your metric and/or dev/test set.
|
|
|
|
|
|
|
|
|
|
** Comparing to Humand-level performance
|
|
|
|
|
|
|
|
|
|
*** Why human-level performance
|
|
|
|
|
|
|
|
|
|
Human-level perf vs Bayes optimal error
|
|
|
|
|
|
|
|
|
|
Human are generally very close to bayes perf for lot of tasks.
|
|
|
|
|
|
|
|
|
|
- get lableld data from humans
|
|
|
|
|
- gain insight from manual error analysis (why did a person get this right?)
|
|
|
|
|
- better analysis of bias/variance
|
|
|
|
|
|
|
|
|
|
*** Avoidable bias
|
|
|
|
|
|
|
|
|
|
**** Cat classification example
|
|
|
|
|
|
|
|
|
|
| Humans | 1% | 7.5% |
|
|
|
|
|
| Training error | 8% | 8% |
|
|
|
|
|
| Dev error | 10% | 10% |
|
|
|
|
|
| | focus on bias | focus on variance |
|
|
|
|
|
|
|
|
|
|
Human level error as a proxy (estimate) for Bayes error.
|
|
|
|
|
|
|
|
|
|
*Diff between Human err and Training err = available bias*
|
|
|
|
|
*Diff between Train and Dev err = variance*
|
|
|
|
|
|
|
|
|
|
*** Understanding Human-level performance
|
|
|
|
|
|
|
|
|
|
**** Human-level error as proxy for Bayes error
|
|
|
|
|
Medical image classification example:
|
|
|
|
|
suppose
|
|
|
|
|
(a) Typical human 3% err
|
|
|
|
|
(b) Typical doctor 1% err
|
|
|
|
|
(c) Experienced doctor 0.7% err
|
|
|
|
|
(d) and team of experienced doctors 0.5% err
|
|
|
|
|
|
|
|
|
|
What is "human-level" error?
|
|
|
|
|
|
|
|
|
|
Bayes error is <= to 0.5% err
|
|
|
|
|
So we use that to aim as saw before.
|
|
|
|
|
|
|
|
|
|
For a paper, (b) is good enough to talk about that.
|
|
|
|
|
|
|
|
|
|
**** Error analysis example
|
|
|
|
|
|
|
|
|
|
| Human (proxy for bayes err) | 1, 0.7, 0.5% | 1, 0.7, 0.5 | 1, 0.7, 0.5 |
|
|
|
|
|
| Train err | 5% | 1% | 0.7% |
|
|
|
|
|
| Dev err | 6% | 5% | 0.8% |
|
|
|
|
|
| | | | |
|
|
|
|
|
|
|
|
|
|
Case 1:
|
|
|
|
|
For this example it doesn't matter because avoidable bias (5 - 1%), is bigger
|
|
|
|
|
than variance (6-5)
|
|
|
|
|
|
|
|
|
|
Case 2: focus on variance
|
|
|
|
|
|
|
|
|
|
Case 3, very important you use 0.5 as your "human-level" error. Because it show
|
|
|
|
|
that you should focus on bias and not on variance.
|
|
|
|
|
|
|
|
|
|
This problem arose only when you're doing very good.
|
|
|
|
|
|
|
|
|
|
**** Summary of bias/variance with human-level perf
|
|
|
|
|
|
|
|
|
|
Human-level error (proxy for Bayes err)
|
|
|
|
|
|
|
|
|
|
^
|
|
|
|
|
| "Avoidable bias"
|
|
|
|
|
v
|
|
|
|
|
|
|
|
|
|
Training error
|
|
|
|
|
|
|
|
|
|
^
|
|
|
|
|
| "Variance"
|
|
|
|
|
v
|
|
|
|
|
|
|
|
|
|
Dev error
|
|
|
|
|
|
|
|
|
|
*** Surpassing human-level performance
|
|
|
|
|
|
|
|
|
|
**** Surpassing human-level performance
|
|
|
|
|
|
|
|
|
|
| Team | 0.5% | 0.5% |
|
|
|
|
|
| One human | 1% | 1% |
|
|
|
|
|
| Training error | 0.6% | 0.3% |
|
|
|
|
|
| Dev error | 0.8% | 0.4% |
|
|
|
|
|
|-----------------+-------+------------|
|
|
|
|
|
| Avoidable bias? | ~0.5% | can't know |
|
|
|
|
|
|
|
|
|
|
**** Problems where ML significantly surpasses human-level performance
|
|
|
|
|
|
|
|
|
|
- Online advertising
|
|
|
|
|
- Product recommendations
|
|
|
|
|
- Logistics (predicting transit time)
|
|
|
|
|
- Loan approvals
|
|
|
|
|
|
|
|
|
|
all thoses examples:
|
|
|
|
|
+ come from structured data
|
|
|
|
|
+ not natural perception problems
|
|
|
|
|
+ Lots of data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Also, Speech recognition, Some image recognition, Medical, ECG, skin cancer,
|
|
|
|
|
etc...
|
|
|
|
|
|
|
|
|
|
*** Improving your model performance
|
|
|
|
|
|
|
|
|
|
Set of guidelines
|
|
|
|
|
|
|
|
|
|
**** The two fundamental assumptions of supervised learning
|
|
|
|
|
|
|
|
|
|
1. You can fit the training set pretty well (~ avoidable bias)
|
|
|
|
|
2. The training set performance generalizes pretty well to the dev/test set
|
|
|
|
|
|
|
|
|
|
**** Reducing (avoidable) bias and variance
|
|
|
|
|
|
|
|
|
|
Human-level error (proxy for Bayes err)
|
|
|
|
|
^
|
|
|
|
|
| train bigger model
|
|
|
|
|
| "Avoidable bias" => train longer/better optimization algorithms (momentum, RMSprop, Adam)
|
|
|
|
|
| NN architecture/hyperparameters search (RSS, CNN...)
|
|
|
|
|
v
|
|
|
|
|
|
|
|
|
|
Training error
|
|
|
|
|
|
|
|
|
|
^
|
|
|
|
|
| More data
|
|
|
|
|
| "variance" => Regulraization (L2, dropout, data augmentation)
|
|
|
|
|
| NN architecture/hyperparameters search
|
|
|
|
|
v
|
|
|
|
|
|
|
|
|
|
Dev error
|
|
|
|
|
|
|
|
|
|
These concepts are easy to learn, hard to master.
|
|
|
|
|
You'll be more systematics than most ML teams.
|