deft/Deep-Learning.org at 119cb5523c4074dad323b7061761476fcf2bf9ce

yogsototh/deft

Fork 0

Yann Esposito (Yogsototh) 119cb5523c

update, finished Coursera cours 3

2017-09-13 08:55:56 +02:00

26 KiB

Raw Blame History

Plan
Neural Network and Deep Learning
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Structuring your Machine Learning project
- Week 1
- Week 2

– #+TITLE: Deep Learning Coursera – #+AUTHOR: Yann Esposito

Plan

5 courses

Neural Network and Deep Learning

Week 1: Introduction

Week 2: Basic of Neural Network programming

Week 3: One hidden layer Neural Networks

Week 4: Deep Neural Network

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Structuring your Machine Learning project

Convolutional Neural Networks

Natural Language Processing: Building sequence models

DONE Neural Network and Deep Learning

CLOSED: [2017-08-22 Tue 13:43]

Introduction

What is a neural network?

Supervised Learning with Neural Networks

Lucrative application: ads, showing the add you're most likely to click on
Photo tagging
Speech recognition
Machine translation
Autonomous driving

Convolutional NN good for images

Strutured data (db of data) vs Unstructured data

Structured data: Tables
Unstructured data: Audio, image, text…

Computer are much better at interpreting unstructured data.

Why is Deep Learning taking off?

/Users/yaesposi/Library/Mobile%20Documents/com~apple~CloudDocs/deft/img/Scale%20drives%20deep%20learning%20progress.png

Data (lot of data)
Computation (faster learning loop)
Algorithms (ex, use ReLU instead of sigma)

Geoffrey Hinton interview

Binary Classification

\[ (x,y) x\in \mathbb{R}^{n_x}, y \in {0,1} \]

$m$ training examples: $$ {(x^{(1)},y^{(1)}), ... (x^{(m)},y^{(m)})} $$

$$ m = m_{train} , m_{test} = #test examples $$

$$ X = [ X^{(1)} ... X^{(m)} ] is an n_x x m matrix $$ $$ X.shape (n_x,m) $$

$$ Y = [ y^{(1)} ... y^{(m)} ] $$ $$ Y.shape = (1,m) $$

Logistic Regression

Given $X \in \mathbb{R}^{n_x}$ you want $\hat{y} = P(y=1 | X)$

Paramters: $w \in \mathbb{R}^{n_x}, b\in \mathbb{R}$

Output: $\hat{y} = \sigma(w^Tx + b) = \sigma(z)$

$$\sigma(z)= \frac{1}{1 + e^{-z}}$$

If $z \rightarrow \infty => \sigma(z) \approx 1$ If $z \rightarrow - \infty => \sigma(z) \approx 0$

Alternative notation not used in this course:

$X_0=1, x\in\mathbb{R}^{n_x+1}$ $\hat{y} = \sigma(\Theta^Tx)$ …

Logistic Regression Cost Function

Search a convex loss function:

$L(\hat{y},y) = - (y\log(\hat{y}) + (1-y)\log(1-\hat{y}))$

If y = 1 : $L(\hat{y},y) = -\log\hat{y}$ <- want log\haty larg, want \hat{y} large If y = 0 : $L(\hat{y},y) = -\log\hat{y}$ <- want log (1-\hat{y}) large, want \hat{y} sall

Cost function: $$ J(w,b) = \frac{1}{m}\sum_{i=1}^mL(\hat{y^\{(i)}},y^{(i)}) = ... $$

Gradient Descent

Minize $J(w,b)$

initialize w,b (generaly uses zero)
Take a step in the steepest descent direction
repeat 2 until reaching global optimum

Repeat { $w := w - \alpha\frac{dJ(w)}{dw} = w - \alpha\mathtext{dw}$ }

Derivatives

More Derivative Examples

Computaion Graph

Computing Derivatives

Computing Derivatives for multiple examples

Vectorization

getting rid of explicit for loops in your code

Vectorizing Logistic Regression

Vectorizing Logistic Regression's Gradient Computation

Broadcasting in Python

Quick Tour of Jupyter / ipython notebooks

Neural Network Basics

J = a*b + a*c - (b+c) = a (b + c) - (b + c) = (a - 1) (b + c)

DONE Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

CLOSED: [2017-09-01 Fri 09:52]

DONE Week 1: Setting up your Machine

CLOSED: [2017-08-22 Tue 13:43]

Recipe

If High bias? (bad training set performance?) Then try:

Bigger network
Training longer
(NN architecture search)

Else if High variance? (bad dev set performance?) Then try:

More data
Regularization
(NN architecture search)

Deep learning, not much bias/variance tradeoff if we have a big amount of computer power (bigger network) and lot of data.

Regularization

Regularization: reduce variance

L2 regularization

λ / 2m || w ||_2 ^2

L1 regularization: same with |w| instead of ||w||_2^2

λ is a regularization parameter (in code named lambd)

Cost = J(w^[1], b^[1], …, w^[L], b^[L]) = 1/m ∑ L(^y(i), y(i)) + λ/2m ∑_l=1^L || W^[l] ||^2

call the "Frobenius norm"

dW = from backprop + λ/m W^l

update W^l = W^l - αdW^l still works

Sometime L2 regularization called "weight decay".

Dropout Regularization

Eliminates nodes by layer randomly for each training example.

implementing, (inverted dropout)
- gen random boolean vector: d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob # (for each iteration) a3 = np.mulitply(a3,d3) a3 /= keep_prob (for normalization to be certain the a3 output still the same, reduce testing problems)

Making prediction at test time: no drop out

Over regularization methods

Data augmentation, (flipping images for example, random crops, random distortions, etc…)
Early stopping, stop earlier iteration

Setting up your optimization problem

Normalizing Inputs

μ = 1/m Sum X^(i)
x := x - μ (centralize)
σ = 1/m Sum X^(i)^2
x /= σ^2

Gradient Checking

Don't use gard check in traingin, only in debug

If algorithm fail, grad check, look at component (is db? dW? dW on certain layer, etc…)

Remember regularization

Doesn't work with dropout, turn off drop out (put 1.0) then check

Run at random initialization; perhaps again after training

DONE Week 2: Optimization Algorithms

CLOSED: [2017-08-22 Tue 13:43]

Mini batch

X :: X^(1) … X^(m)

X,Y -> Xⁱ,Yⁱ where Xⁱ = X^(i*batch-size —> (i+1)*batch-size)

Minibatch size

if mini batch size = m => Batch gradient descent (X¹,Y¹) = (X,Y)
if mini match size = 1 => Stochastic gradient descent, every example is its own mini batch.
in practice in between 1 and m, m –> too long, 1 loose speedup from vectorization.
- vectorization ~1000

If small training set, use batch gradient descent (m <= 2000)
Typical mini-batch size: 64, 128, 256, 512, … 2^k to fits in CPU/GPU memory

Exponentially weighted average

v_t = βv_t-1 + (1-β)θ_t

DONE Week 3: Hyperparameter

CLOSED: [2017-09-01 Fri 09:52]

Video 1: use random not a grid to search for hyperparameter best value

Video 2: choose appropriate scale to pick hyperparameter

uniformly random n^[l] (number of neuron for layer l) or L (number of layers)
alpha: between 0.00001 to 1, then shouldn't use linear but instead use log-scale r = -4*np.random.rand() <- r in [-4,0] α = 10^r <- 10^-4 … 10^0
β <- 0.9 … 0.999 (0.9 about avg on 10 values, 0.999 avg about 1000 values) 1-β = 0.1 …. 0.001 r <- [-3,-1] 1-β = 10^r

Hyperparameter: Tuning in practice Panda vs caviar

Babysitting one model (panda) for few computer resources
Training many models in parallel (caviar) for lot of computer resources

Batch normalization

In a network

Fitting Batch norm into a deep network

Why Batch Normalizing?

don't use batch norm as a regularization even if sometime it could have this effect

Batch Norm at test time

μ = 1/m ∑ z^(i)

σ^2 = 1/m ∑ (z^(i) - μ)^2

z^(i)_norm = z^(i) - μ / sqrt( σ^2 + ε )

~z^(i) = γz^(i)_norm + β

Estimate μ and σ with exponentially weighted avg accross minibatches

Multi-class classification

Softmax Regression

notation: C = #classes (0,1,2…,C-1)

last hidden layer nb of neuron is equal to C: n^L = C

z^[L] = w[L]a^[L-1] + b[L] (C,1)

Activation function:

t = e^(Z[L]) a^[L] = e^(Z[L])/∑_i=0^C t_i

a^[L]_i = t_i / ∑_i=0^C t_i

Training a softmax classifier

Introduction to programming frameworks

Deep learning frameworks

Structuring your Machine Learning project

Week 1

Introduction to ML Strategy

Why ML Strategy

Try to find quick and effective way to choose a strategy

Ways of analyzing ML problems

Orthogonalization

Chain of assumptions in ML

Fit training set well on cost function => bigger network, Adam, …
Fit dev set well on cost function => Regularization, Bigger training set
Fit test set well on cost function => Bigger dev set
Perform well in real world => Change the devset or cost function

Try not to use early stoping as it simulanously affect cost on training and dev set.

Setting up your goal

Single number evaluation metric

First

Classifier	Precision	Recall
A	95%	90%
B	98%	85%

Rather than using two number, find a new evaluation metric

Classifier	Precision	Recall	F1 Score
A	95%	90%	92.4%
B	98%	85%	91.0

F1 score = 2 / (1/p) + (1/R) :: "Harmonic mean" of precision and recall.

So:

Having a good Dev set + single evaluation metric, really speed up iterating.

Another example

Algorithm	US	China	India	Other
A	3%	7%	5%	9%
…
F	…	…

Try to improve the average.

Satisficing and Optimizing metric

It's not alway easy to select on metric to optimize.

Another cat classification example

Classifier	Accuracy	Running Time
A	90%	80ms
B	92%	95ms
C	95%	1500ms

cost = accuracy - 0.5x running time

maximize accuracy s.t. running time < 100ms

Accuracy <- Optimizing Running time <- Satisficing

If you have n metrics, pick one to optimizing, and all the other be satisficing.

Train/dev/test distribution

How you can setup these dataset to speed up your work.

Cat classification dev/test sets

Try to find a way that dev and test set come from the same distribution.

True story (detail changed)

Optimizing on dev set on load approvals for medium income zip codes.

(repay loan?)

Tested on low income zip codes.

Lost 3 months

Guideline

Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

Size of dev and test sets

Old way of splitting

70% train, 30% test 60% train, 20% dev, 20% test

For at max 10^4 examples

But in new era, 10^6 examples:

train: 98%, Dev 1%, Test 1%.

Size of test set

Set your test set to be big enough to give high confidence in the overall performance of your system. Can be far less than 30% of your data.

For some applications, you don't need test set and only dev set. For example if you have a very large dev set.

When to change dev/test sets and metrics?

Metric: classification error Algorithm A: 3% error → letting throught a lot of porn images Algorithm B: 5% error → doesn't let pass porn images

So your metric + evaluation prefer A, but you and your users prefer B.

When this happens, mispredict your algorithm B is better.

Error: 1/m_dev ∑_i=1^m I{y_pred^(i) /= y^(i)

They treat pron and non pron equaly but you don't want that.

We add a w(i) = 1 if non porn and 0 if porn in the formula

Orthogonalization for cat pictures: anti-pron

So far we've only discussed how to define a metric to evaluate classifier
Worry separately about how to do well on this metric
placing the target, and 2. is aiming the target.

Another example

Alg A: 3% err Alg B: 5% err

But B does better. You see that users are using blurier images. You dev/test are not using the same kind of images.

Change your metric and/or dev/test set.

Comparing to Humand-level performance

Why human-level performance

Human-level perf vs Bayes optimal error

Human are generally very close to bayes perf for lot of tasks.

get lableld data from humans
gain insight from manual error analysis (why did a person get this right?)
better analysis of bias/variance

Avoidable bias

Cat classification example

Humans	1%	7.5%
Training error	8%	8%
Dev error	10%	10%
	focus on bias	focus on variance

Human level error as a proxy (estimate) for Bayes error.

Diff between Human err and Training err = available bias Diff between Train and Dev err = variance

Understanding Human-level performance

Human-level error as proxy for Bayes error

Medical image classification example: suppose (a) Typical human 3% err (b) Typical doctor 1% err (c) Experienced doctor 0.7% err (d) and team of experienced doctors 0.5% err

What is "human-level" error?

Bayes error is <= to 0.5% err So we use that to aim as saw before.

For a paper, (b) is good enough to talk about that.

Error analysis example

Human (proxy for bayes err)	1, 0.7, 0.5%	1, 0.7, 0.5	1, 0.7, 0.5
Train err	5%	1%	0.7%
Dev err	6%	5%	0.8%

Case 1: For this example it doesn't matter because avoidable bias (5 - 1%), is bigger than variance (6-5)

Case 2: focus on variance

Case 3, very important you use 0.5 as your "human-level" error. Because it show that you should focus on bias and not on variance.

This problem arose only when you're doing very good.

Summary of bias/variance with human-level perf

Human-level error (proxy for Bayes err)

"Avoidable bias"

Training error

"Variance"

Dev error

Surpassing human-level performance

Team	0.5%	0.5%
One human	1%	1%
Training error	0.6%	0.3%
Dev error	0.8%	0.4%
Avoidable bias?	~0.5%	can't know

Problems where ML significantly surpasses human-level performance

Online advertising
Product recommendations
Logistics (predicting transit time)
Loan approvals

all thoses examples:

come from structured data
not natural perception problems
Lots of data

Also, Speech recognition, Some image recognition, Medical, ECG, skin cancer, etc…

Improving your model performance

Set of guidelines

The two fundamental assumptions of supervised learning

You can fit the training set pretty well (~ avoidable bias)
The training set performance generalizes pretty well to the dev/test set

Reducing (avoidable) bias and variance

Human-level error (proxy for Bayes err)

train bigger model

"Avoidable bias" => train longer/better optimization algorithms (momentum, RMSprop, Adam)

NN architecture/hyperparameters search (RSS, CNN…)

Training error

More data

"variance" => Regulraization (L2, dropout, data augmentation)

NN architecture/hyperparameters search

Dev error

These concepts are easy to learn, hard to master. You'll be more systematics than most ML teams.

Week 2

Error Analysis

Carrying out error analysis

Imagine your cat algo doesn't work as good as expected.
One of your colaborator think you should focus on working on dogs.
Anaylize manually 100 mislabeled dev set examples
Count up how many are dogs
Supose 5% are dogs. So at most you could go from 10% err to 9.5% so not much useful.
Supose taht 50% of them are dogs error, so you could go down from 10% to 5%, so you could be more confident.

Evaluate multiple idea in parallel

fix pictures of dogs
fix great cats (lion, panthers, …)
improve performance of blurry images

Create spreadsheet:

Image	Dog	Great cats	Bluring
1	ok
2			ok
3		ok	ok
…
% of total	8%	43%	61%

You sometime notice other dimensions like instagram filters…

Could easily know where you should improve.

Cleaning up incorrectly labeled dataset

Incorrectly labeled examples

If you have incorrectly labeled data. First lets consider the training set.

So long as you don't have too much errors, DL is quite robust to random errors.

But this is a problem for systematic errors.

Error analysis

Image	Dog	Great cats	Bluring	Comments
…
98	ok			labeler missed cat in background
99			ok
100		ok	ok	drawing of a cat not a real cat
% of total	8%	43%	61%

1st case:

Overall dev set error: 10% Error due incorrect labels: 0.6% Errors due to other causes: 9.4%

2nd case:

Overall dev set error: 2% Error due incorrect labels: 0.6% Errors due to other causes: 1.4%

In 2nd case, take the time to fix mislabeled examples.

Correctin incorrect dev/test set examples

Apply same process to your dev and test sets to make sure they continue to come from the same distribution.
Consider examining examples your algorithm gor right as well as ones it got wrong.
Train and dev/test data may now crom from slightly different distributions

Buid your first system quickly then iterate

Speech recognition example

noisy background
- café noise
- car noise
Accented speech
Far from microphone
young children's speech
stuttering, uh, ah, um…

50 directions you could go, on which should you focus on?

Set up dev/test set and metric
Build initial system quickly
Use Bias/Variance analysis & Error Analysis to prioritize next steps

Guideline: Build your first system quickly then iterate

Do not otherthink, build something quick and dirty first.

Mismatched training and dev/test set

Training and testing on different distributions

Cat app example

Two sources of data:

data from webpages
data from mobile app

Let's say you don't have lot of users (~10k from mobile, 200k from web)

You care about doing well on mobile images. You don't want to use only the 10k, but the dilema is the 200k aren't from the same distribution.

Option 1: take the 210k images and split between train/dev/test (train 205k, 2.5k, 2.5k)

avantage, same distribution
disavantage, perform on web instead of web.
only 119 other the 2.5k will be from mobile.

Option 1 not recommended

Option 2:

train set have 200k images from the web and 5k from the mobile.
dev and test all mobile app images.
avantage you know aiming your target where you want it to be.
disavantage, your training distribution is different

But other the long term it will get you better performance

Speech recognition example

Speech artificial rearview mirror. (real product in China)

Training: take all the speech data you have; purshased data, smart speaker control, voice keyboard… (500k)
Dev/test: speech activated, rearview mirror (20k)

Set your training set to be 500k from 1. and Dev/Test from 2.

The training set could be 510k (500k from 1 and 10k from 2.) and Dev/Test set (5k+5k from the rest of 2.)

Much bigger training set.

Bias and Variance with mismatched data distribution

Cat classifier example

Assume humans get ~0% error.

Training error	1%
Dev error	10%

Maybe there isn't a variance pb as the distribution is different.

Training-dev set: Same distrib as training set but not used for training.

Train / dev / test ==> Train split in train-2 and train-dev

So now you learn only on train-2 and check on train-dev and dev and test.

Train err%	1%	1%
Train-dev err%	9%	1.5%
dev err%	10%	10%
	Var pb	data mismatch pb

Other examples:

Human err%	0%	0%
Train err%	10%	10%
Train-dev err%	11%	11%
dev err%	12%	20%
	Bias pb	Bias + data mismatch pb

Bias/variance on mismatched trainig and dev/test sets

Human level

avoidable bias

Training set error

variance

Training-dev set error

10%

data mismatch

Dev error

12%

degree of overfitting to dev set

Test error

12%

Example, training is much harder than dev/test set distribution:

Human level	4%
Training set error	7%
Training-dev set error	10%
Dev error	6%
Test error	6%

More general formulation

The numbers can be place onto a table:

	General Speech rec tasks	Rearview mirror speech data
Human lvl	"Human level err" (4%)	6%
err on trained on	"Training err" (7%)	6%
err not trained on	"Training-dev err" (10%)	"Dev/Test err" (6%)

Addressing data mismatch

There are not any systematic way to address that. But there are things you can try.

Addressing data mismatch

Carry out manual error analysis to try to understand difference between training and dev/test sets. ex: you might find that a lot of dev set is noisy (car noise)
Make training data more similar, or collect more data similar to dev/test sets. ex: simulate noisy in-car data.

Artificial data synthesis

Clean + car noise = synthetized in-car audio

Create more data, and can be a reasonable process.

Let's say you have 10k hrs of sound and only 1hr of car noise.

There is a risk your algorithm will overfit your 1hr car noise.

Artificial data synthesis (2)

Car recognition

Using car generated by computer vs just photos. You might overfit generated cars. A video game might have only 20 cars, so overfit these 20 cars.

Learning from multiple tasks

Transfer learning

Learning recognize cats to help to read x-ray scans.

Transfer learning

Create new NN by changing just the last layer (the output).

(X,Y) now become (radiology images, diagnosis)

retrain the W^[Z], b^[Z].

You might want to train just the last layer, you all the layers.

The rule of thumb, just the last layer on few data. The rule of thumb, all the layer on lot of datas.

pre-training, and fine-tuning.

A lot of low-level features learning from a very large data set might help.

Another example. Speech recognition system:

X (audio) y (speech recognintion) (wakeword, trigger word (ok google, hey siri, etc…))

You could add several new layers, and retrain the new layers or even more layers.

It make sense to transfer make sense when you have a very different number of examples.

10^6 image recognintion, but only 100 radiology data.
10k hrs sounds, but only 1h data for wake words…

Transfering from lot of data to small number of data.

It doesn't make sense to transfer the other way.

When transfer learning makes sense

Task from A to B

Task A and B have the same input X
You have a lot more data for Task A than Task B
Low level features from A could be helpful for learning B

Multi-task learning

Simultaneously learn multiple tasks.

Simplified autonomous driving example

	y^(i)	(4,1)
pedestrians	0
cars	1
stop signs	1
traffic lights	0

Y = [ y^(1) y^(2) …. y^(m) ]

Neural network architecture

x -> [] -> [] …. -> ^y in R^4

Loss: y(i) -> 1/m ∑_i=1^m ∑_j=1^4 (L(y^(i)_j , y^(i)_j))

L is the usual loss function.

Unlike softmax regression, one image can have multiple labels.

One NN doing 4 things is better than learning 4 different NN for each task.

Some examples might not be fully labelled. And you can train by summing only other 0/1 label and not on ? mark (un labeled values).

So you can use more informations.

When multi-task learning makes sens.

Training on set of tasks taht could benefit from having shared lower-level features
Usually: amount of data you have for each task is quite similar
Can train a big enough neural network to do well on all the tasks

Multi-task learning used a lot more than transfer learning.

End-to-end deep learning

What is end-to-end deep learning?

Speech recognition example

audio - MFCC -> features – ML –> phonemes -> words -> transcript

audio ————————————————> transcript

You might need a lot of data. 3k hrs of data, classical approach better. 10k to 100k hurs then end-to-end approach generally shines.

Face recognition

Multi state approach works better:

detect face, zoom-in and crop to center the face
then feed this croped image to find identity. Generally comparing to all employes.

Why?

Have a lot of data for task 1
Have a lot of data for task 2

If you were to try to learn everything at the same time you wouldn't have enough data.

More examples

Machine translation:

English -> text analysis -> … -> French English ————————-> French

Because we have lot of (x,y) examples.

Estimating child's age from scan of the hand:

Image -> bones -> age Image ———-> age (there is not enough data)

Whether to use end-to-end deep learning

Pros and cons of end-to-end learning

Pros:

let the data speak (no human preconception)
Less hand-designing of components needed

Cons:

May need a large amount of data: input end —-> output end (x,y)
Excludes potentially useful hand-designed components. Data, Hand-design

Applying end-to-end deep learning

Key question: do you have sufficient data to learn a function of the complexity needed to map x to y?

choose X->Y mapping
pure deep learning approch not appropriate if hard to find end-to-end exmaples.

26 KiB Raw Blame History Unescape Escape

Plan

Neural Network and Deep Learning

Week 1: Introduction

Week 2: Basic of Neural Network programming

Week 3: One hidden layer Neural Networks

Week 4: Deep Neural Network

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Structuring your Machine Learning project

Convolutional Neural Networks

Natural Language Processing: Building sequence models

DONE Neural Network and Deep Learning

Introduction

What is a neural network?

Supervised Learning with Neural Networks

Convolutional NN good for images

Strutured data (db of data) vs Unstructured data

Why is Deep Learning taking off?

Geoffrey Hinton interview

Binary Classification

Logistic Regression

Logistic Regression Cost Function

Gradient Descent

Derivatives

More Derivative Examples

Computaion Graph

Computing Derivatives

Computing Derivatives for multiple examples

Vectorization

Vectorizing Logistic Regression

Vectorizing Logistic Regression's Gradient Computation

Broadcasting in Python

Quick Tour of Jupyter / ipython notebooks

Neural Network Basics

DONE Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

DONE Week 1: Setting up your Machine

Recipe

Regularization

Regularization: reduce variance

Dropout Regularization

Over regularization methods

Setting up your optimization problem

Normalizing Inputs

Gradient Checking

Don't use gard check in traingin, only in debug

If algorithm fail, grad check, look at component (is db? dW? dW on certain layer, etc…)

Remember regularization

Doesn't work with dropout, turn off drop out (put 1.0) then check

Run at random initialization; perhaps again after training

DONE Week 2: Optimization Algorithms

Mini batch

Minibatch size

Exponentially weighted average

DONE Week 3: Hyperparameter

Video 1: use random not a grid to search for hyperparameter best value

Video 2: choose appropriate scale to pick hyperparameter

Hyperparameter: Tuning in practice Panda vs caviar

Batch normalization

In a network

Fitting Batch norm into a deep network

Why Batch Normalizing?

Batch Norm at test time

Multi-class classification

Softmax Regression

Training a softmax classifier

Introduction to programming frameworks

Deep learning frameworks

Structuring your Machine Learning project

Week 1

Introduction to ML Strategy

Why ML Strategy

Orthogonalization

Chain of assumptions in ML

Setting up your goal

Single number evaluation metric

First

Another example

Satisficing and Optimizing metric

Another cat classification example

Train/dev/test distribution

26 KiB

Raw Blame History