Coursera Deep Learning Module 1 Week 2 Notes

logo of deeplearning.ai neural networks course

Neural Networks

Week 2 - Neural Networks Basics

Binary Classification

Given a pair $(x, y)$ where $x \in \mathbb{R}^{n_x}, y \in \{0, 1\}$. With $m$ training examples you would have $\{(x^1, y^1), (x^2, y^2),...(x^m, y^m)\}$. Stacking them together you would come up with the matrices X and Y below which are much easier to train in batches since there are many vectorized optimizations readily available. $X \in \mathbb{R}^{n_x \times m}, Y \in \mathbb{R}^{1 \times m}$

Logistic Regression

Given $x$ we want the $ \hat{y} = P(y=1 | x) $
$x \in \mathbb{R}^{n_x}, 0 \geq \hat{y} \geq 1$
Parameters: $ w \in \mathbb{R}^{n_x}, b \in \mathbb{R} $
Output $\hat{y} = \sigma(w^Tx + b)$
$z = w^Tx + b$
$\sigma(z) = \frac{1}{1+e^{-z}}$
If z is large ($\displaystyle\lim_{z\to\infty}$) then $\sigma(z) \approx \frac{1}{1+0} = 1$
If z is a large negative number ($\displaystyle\lim_{z\to-\infty}$) then $\sigma(z) \approx \frac{1}{1+\infty} \approx 0$

Logistic Regression Cost Function

Given $\{(x^1, y^1), (x^2, y^2),...(x^m, y^m)\}$, we want $ \hat{y}^{(i)} \approx y^{(i)}$
Squared Error: $\mathcal{L}(\hat{y}, y) = \frac{(\hat{y} - y)^2}{2}$ might not be a good option because it’s not convex
$\mathcal{L}(\hat{y}, y) = (y\log\hat{y} + (1-y)\log(1-\hat{y}))$
If $y=1: \mathcal{L}(\hat{y}, y) = -\log\hat{y}$ <- want $\log\hat{y}$ to be large meaning we want $\hat{y}$ large
If $y=0: \mathcal{L}(\hat{y}, y) = -\log(1-hat{y})$ <- want $\log(1-\hat{y})$ to be large meaning we want $\hat{y}$ small
Cost function: $\mathcal{J}(w, b) = \frac{1}{m} \sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)}, y^{(i)})$, in other words, the cost function is the average of the loss on the given examples
Loss function applies to a single training example
Cost function applies to all the examples

Gradient Descent

We want to find $w, b$ that minimize $\mathcal{J}(w, b)$
This loss function is convex, therefore pointing to a single minima
Gradient Descent: $w := w - \alpha \frac{\partial\mathcal{J}(w, b)}{\partial w}$ and $b := b - \alpha \frac{\partial\mathcal{J}(w, b)}{\partial b}$ where $\alpha$ is the learning rate

Derivatives

Intuition about derivatives: $f(a) = 3a$ the slope of $f(a)$ at $a=2$ is 3 and at $a=5$ is 3 as well. Therefore $\frac{df(a)}{da} = 3 = \frac{d}{da}f(a)$
In a line the slope doesn’t change, therefore the derivative is constant

More Derivative Examples

$f(a) = a^2$ the slope of $f(a)$ at $a=2$ is 4 and at $a=5$ is 10. Therefore $\frac{df(a)}{da} = 2a $
The slope is different for different values as t doesn’t increases linearly.
$f(a) = a^3$ has derivative as $\frac{df(a)}{da} = 3a^2 $
$f(a) = \ln(a)$ has derivative as $\frac{df(a)}{da} = \frac{1}{a} $
The derivative means the slope of the function
The slope of the function can be different on different points of the function

Computation Graph

$\mathcal{J} = 3(a+bc)$
$ u = bc$
$ v = a+u$
$\mathcal{J} = 3v$
One step of backward propagation on a computation graph yields derivative of final output variable.

Derivatives with a Computation Graph

$\frac{d\mathcal{J}}{dv} = 3$
$\frac{d\mathcal{J}}{da} = \frac{d\mathcal{J}}{dv} \frac{dv}{da} = 3 * 1 = 3$ (chain rule)
$\frac{d\mathcal{J}}{du} = \frac{d\mathcal{J}}{dv} \frac{dv}{du} = 3 * 1 = 3$ (chain rule)
$\frac{d\mathcal{J}}{db} = \frac{d\mathcal{J}}{du} \frac{du}{db} = 3 * 2 = 6$ (chain rule)
$\frac{d\mathcal{J}}{db} = \frac{d\mathcal{J}}{du} \frac{du}{dc} = 3 * 3 = 9$ (chain rule)
The best way is to flow from the right to left
In this class, what does the coding convention dvar represent?
- The derivative of any variable used in the code.
- The derivative of input variables with respect to various intermediate quantities.
- The derivative of a final output variable with respect to various intermediate quantities.

Logistic Regression Gradient Descent

Suppose we only have two features $x_1$ and $x_2$
Using the backward propagation on the formulas of the logistic regression in the computational graph we can update the correspondent changes in the weights and the bias
In this video, what is the simplified formula for the derivative of the loss with respect to z? A: a - y

Gradient Descent on $m$ examples

Cost function is the mean of the loss function on all the m examples
Let $J=0, dw_1=0, dw_2=0, db=0$
For $i=1$ to $m$:
- $z^{(i)} = w^Tx^{(i)}+b$
- $a^{(i)} = \sigma(z^{(i)})$
- $\mathcal{J} +=[y^{(i)}\log a^{(i)} + (1-y^{(i)})\log (1-a^{(i)})]$
- $dz^{(i)} += a^{(i)} - y^{(i)}$
- $dw_1 += x_1^{(i)} d_z^{(i)}$
- $dw_2 += x_2^{(i)} d_z^{(i)}$
- $db += d_z^{(i)}$
$J /= m$
$dw_1 /= m$
$dw_2 /= m$
$db /= m$
$w_1 := w_1 - \alpha dw_1$
$w_2 := w_2 - \alpha dw_2$
$b := b - \alpha db$
The derivatives are being calculated cumulatively only to be divided by m in the end of the loop therefore calculating the mean
Using explicit for loops are bad for parallelization. Since vectorization has appeared for loops are not a guideline anymore.

Vectorization

Vectorization is basically the art of getting rid of explicit folders in your code. – Andrew Ng

What is vectorization? Is the ability to remove the for loops from your code and implement it mostly with matrix (or tensor) operations

	import numpy as np
	a = np.array([1, 2, 3, 4])
	print(a)
	#Vectorized
	import time
	a = np.random.rand(1000000000)
	b = np.random.rand(1000000000)
	tic = time.time()
	c = np.dot(a, b)
	toc = time.time()
	print(c)
	print("Vectorized version:" + str(1000*(toc-tic)) + "ms") #about 1.5ms
	# Non-vectorized
	c = 0
	tic = time.time()
	for i in range(1000000000):
		c += a[i]*b[i]
	toc = time.time()
	print(c)
	print("For loop:" + str(1000*(toc-tic)) + "ms") #about 400~500ms

The rule of thumb to remember is whenever possible, avoid using explicit four loops
Vectorization can be done without a GPU.

More vectorization examples

Use built-in functions and avoid explicit for loops (This guideline also applies largely to numpy and pandas)
Let $J=0, db=0$
$dw = np.zeros((n_x, 1))$
For $i=1$ to $m$:
- $z^{(i)} = w^Tx^{(i)}+b$
- $a^{(i)} = \sigma(z^{(i)})$
- $\mathcal{J} +=[y^{(i)}\log a^{(i)} + (1-y^{(i)})\log (1-a^{(i)})]$
- $dz^{(i)} += a^{(i)} - y^{(i)}$
- $dw += x^{(i)} d_z^{(i)}$
- $db += d_z^{(i)}$
$J /= m$
$dw /= m$
$db /= m$
$w_1 := w_1 - \alpha dw_1$
$w_2 := w_2 - \alpha dw_2$
$b := b - \alpha db$

Vectorizing Logistic Regression

Finally uses the X and Y matrices discussed in the beginning of the lesson
Z = np.dot(w.T, X) + b
A = $\sigma(Z)$

Vectorizing Logistic Regression’s Gradient Output

Find a $dZ = [ dz^{(1)}, dz^{(2)}, …, dz^{(m)}]$
$A = [ a^{(1)}, a^{(2)}, …, a^{(m)}]$
$Y = [ y^{(1)}, y^{(2)}, …, y^{(m)}]$
$dZ = A-Y = [ a^{(1)} - y^{(1)}, a^{(2)} - y^{(2)}, …, a^{(m)} - y^{(m)}]$
$db = \frac{1}{m} \sum_{i=1}^{m} dz^{(i)} = \frac{1}{m} np.sum(dZ)$
$dw = \frac{1}{m} X dZ^T $
Now our algorithm looks like something:
Forward pass
Z = np.dot(w.T, X) + b
A = $\sigma(Z)$
Backpropagation pass
$dZ = A-Y$
$dw = \frac{1}{m} X dZ^T$
$db = \frac{1}{m} np.sum(dZ)$
Weights update pass
$w := w - \alpha dw$
$b := b - \alpha db$

Broadcasting in Python

import numpy as np
A = np.array([
	[56.0, 0.0, 4.4, 68.0],
	[1.2, 104.0, 52.0, 0.0],
	[1.8, 135.0, 99.0, 0.9]
	])
print(A)
cal = A.sum(axis=0)
print(cal)
percentage = 100*A/cal.reshape(1, 4)
print(percentage)

Given matrix (m,n) a given operation and a factor which is (1, n) or (m, 1) the factor is expanded to (m, n) by copy and the operation is executed element-wise
Given a matrix (m, 1) (or a vector) and a numeric fator the factor is expanded to (m, 1) by copy and the operation is executed element-wise

A note on python/numpy vectors

	import numpy as np
	a = np.random.rand(5)
	a.shape # (5, ), rank 1 array
	a.T # == a which a.T.shape = (5,)
	np.dot(a, a.T)# returns a single number

	a = np.random.rand(5, 1)
	a.shape # (5, 1)
	a.T # != a which a.T.shape = (1,5)
	np.dot(a, a.T)# returns a matrix

	#(5, 1) = column vector
	#(1, 5) = row vector

	# To avoid you can use assert
	assert(a.shape == (5,1))
	#or reshape
	a.reshape(1, 5)

Quiz

What does a neuron compute?
- A neuron computes an activation function followed by a linear function (z = Wx + b)
- A neuron computes the mean of all features before applying the output to an activation function
- A neuron computes a linear function (z = Wx + b) followed by an activation function
- A neuron computes a function g that scales the input x linearly (Wx + b)
Which of these is the “Logistic Loss”?
- (y^(i),y(i))=∣y(i)−y^(i)∣
- (y^(i),y(i))=max(0,y(i)−y^(i))
- (y^(i),y(i))=∣y(i)−y^(i)∣2
- X(y^(i),y(i))=−(y(i)log(y^(i))+(1−y(i))log(1−y^(i)))
Suppose img is a (32,32,3) array, representing a 32x32 image with 3 color channels red, green and blue. How do you reshape this into a column vector?
- x = img.reshape((1,3232,3))
- x = img.reshape((3,32*32))
- x = img.reshape((32*32,3))
- x = img.reshape((32323,1))
Consider the two following random arrays “a” and “b”:
```
a = np.random.randn(2, 3) # a.shape = (2, 3)
b = np.random.randn(2, 1) # b.shape = (2, 1)
c = a + b
```
What will be the shape of “c”?
- c.shape = (3, 2)
- c.shape = (2, 3)
- c.shape = (2, 1)
- The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
Consider the two following random arrays “a” and “b”:
```
a = np.random.randn(4, 3) # a.shape = (4, 3)
b = np.random.randn(3, 2) # b.shape = (3, 2)
c = a*b
```
What will be the shape of “c”?
- c.shape = (4, 3)
- c.shape = (4,2)
- c.shape = (3, 3)
- The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
Suppose you have $n_x$ input features per example. Recall that X = $[x^{(1)} x^{(2)} … x^{(m)}]$. What is the dimension of X?
- ($n_x$, m)
- (1,m)
- (m,$n_x$)
- (m,1)
Recall that “np.dot(a,b)” performs a matrix multiplication on a and b, whereas “a*b” performs an element-wise multiplication. Consider the two following random arrays “a” and “b”:
```
a = np.random.randn(12288, 150) # a.shape = (12288, 150)
b = np.random.randn(150, 45) # b.shape = (150, 45)
c = np.dot(a,b)
```
What is the shape of c?
- Te computation cannot happen because the sizes don’t match. It’s going to be “Error”!
- c.shape = (12288, 45)
- c.shape = (12288, 150)
- c.shape = (150,150)

Consider the following code snippet:

import numpy as np # a.shape = (3,4) and b.shape = (4,1)
for i in range(3):
  for j in range(4):
 c[i][j] = a[i][j] + b[j]

How do you vectorize this?

c = a.T + b.T
c = a + b
c = a + b.T
c = a.T + b

Consider the following code:
```
a = np.random.randn(3, 3)
b = np.random.randn(3, 1)
c = a*b
```
What will be c? (If you’re not sure, feel free to run this in python to find out).
- This will invoke broadcasting, so b is copied three times to become (3,3), and * is an element-wise product so c.shape will be (3, 3)
- This will invoke broadcasting, so b is copied three times to become (3, 3), and * invokes a matrix multiplication operation of two 3x3 matrices so c.shape will be (3, 3)
- This will multiply a 3x3 matrix a with a 3x1 vector, thus resulting in a 3x1 vector. That is, c.shape = (3,1).
- It will lead to an error since you cannot use “*” to operate on these two matrices. You need to instead use np.dot(a,b)
Consider the following computation graph. What is the output J?
- J = (c - 1)*(b + a)
- J = (a - 1) * (b + c)
- J = a*b + b*c + a*c
- J = (b - 1) * (c + a)

« Previous Next »