Coursera Deep Learning Module 1 Week 2 Notes
Week 2  Neural Networks Basics
Binary Classification
Given a pair \((x, y)\) where \(x \in \mathbb{R}^{n_x}, y \in \{0, 1\}\). With \(m\) training examples you would have \(\{(x^1, y^1), (x^2, y^2),...(x^m, y^m)\}\). Stacking them together you would come up with the matrices X and Y below which are much easier to train in batches since there are many vectorized optimizations readily available. \(X \in \mathbb{R}^{n_x \times m}, Y \in \mathbb{R}^{1 \times m}\)
Logistic Regression
 Given $x$ we want the $ \hat{y} = P(y=1  x) $
 $x \in \mathbb{R}^{n_x}, 0 \geq \hat{y} \geq 1$
 Parameters: $ w \in \mathbb{R}^{n_x}, b \in \mathbb{R} $
 Output $\hat{y} = \sigma(w^Tx + b)$
 $z = w^Tx + b$
 $\sigma(z) = \frac{1}{1+e^{z}}$
 If z is large (\(\displaystyle\lim_{z\to\infty}\)) then $\sigma(z) \approx \frac{1}{1+0} = 1$
 If z is a large negative number (\(\displaystyle\lim_{z\to\infty}\)) then $\sigma(z) \approx \frac{1}{1+\infty} \approx 0$
Logistic Regression Cost Function
 Given \(\{(x^1, y^1), (x^2, y^2),...(x^m, y^m)\}\), we want $ \hat{y}^{(i)} \approx y^{(i)}$
 Squared Error: $\mathcal{L}(\hat{y}, y) = \frac{(\hat{y}  y)^2}{2}$ might not be a good option because it’s not convex
 $\mathcal{L}(\hat{y}, y) = (y\log\hat{y} + (1y)\log(1\hat{y}))$
 If $y=1: \mathcal{L}(\hat{y}, y) = \log\hat{y}$ < want $\log\hat{y}$ to be large meaning we want $\hat{y}$ large
 If $y=0: \mathcal{L}(\hat{y}, y) = \log(1hat{y})$ < want $\log(1\hat{y})$ to be large meaning we want $\hat{y}$ small
 Cost function: $\mathcal{J}(w, b) = \frac{1}{m} \sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)}, y^{(i)})$, in other words, the cost function is the average of the loss on the given examples
 Loss function applies to a single training example
 Cost function applies to all the examples
Gradient Descent
 We want to find $w, b$ that minimize $\mathcal{J}(w, b)$
 This loss function is convex, therefore pointing to a single minima
 Gradient Descent: $w := w  \alpha \frac{\partial\mathcal{J}(w, b)}{\partial w}$ and $b := b  \alpha \frac{\partial\mathcal{J}(w, b)}{\partial b}$ where $\alpha$ is the learning rate
Derivatives
 Intuition about derivatives: $f(a) = 3a$ the slope of $f(a)$ at $a=2$ is 3 and at $a=5$ is 3 as well. Therefore $\frac{df(a)}{da} = 3 = \frac{d}{da}f(a)$
 In a line the slope doesn’t change, therefore the derivative is constant
More Derivative Examples
 $f(a) = a^2$ the slope of $f(a)$ at $a=2$ is 4 and at $a=5$ is 10. Therefore $\frac{df(a)}{da} = 2a $
 The slope is different for different values as t doesn’t increases linearly.
 $f(a) = a^3$ has derivative as $\frac{df(a)}{da} = 3a^2 $
 $f(a) = \ln(a)$ has derivative as $\frac{df(a)}{da} = \frac{1}{a} $
 The derivative means the slope of the function
 The slope of the function can be different on different points of the function
Computation Graph
 $\mathcal{J} = 3(a+bc)$
 $ u = bc$
 $ v = a+u$
 $\mathcal{J} = 3v$
 One step of backward propagation on a computation graph yields derivative of final output variable.
Derivatives with a Computation Graph
 $\frac{d\mathcal{J}}{dv} = 3$
 \(\frac{d\mathcal{J}}{da} = \frac{d\mathcal{J}}{dv} \frac{dv}{da} = 3 * 1 = 3\) (chain rule)
 $\frac{d\mathcal{J}}{du} = \frac{d\mathcal{J}}{dv} \frac{dv}{du} = 3 * 1 = 3$ (chain rule)
 $\frac{d\mathcal{J}}{db} = \frac{d\mathcal{J}}{du} \frac{du}{db} = 3 * 2 = 6$ (chain rule)
 $\frac{d\mathcal{J}}{db} = \frac{d\mathcal{J}}{du} \frac{du}{dc} = 3 * 3 = 9$ (chain rule)
 The best way is to flow from the right to left
 In this class, what does the coding convention dvar represent?
 The derivative of any variable used in the code.
 The derivative of input variables with respect to various intermediate quantities.
 The derivative of a final output variable with respect to various intermediate quantities.
Logistic Regression Gradient Descent
 Suppose we only have two features \(x_1\) and \(x_2\)
 Using the backward propagation on the formulas of the logistic regression in the computational graph we can update the correspondent changes in the weights and the bias
 In this video, what is the simplified formula for the derivative of the loss with respect to z? A: a  y
Gradient Descent on \(m\) examples
 Cost function is the mean of the loss function on all the m examples
 Let \(J=0, dw_1=0, dw_2=0, db=0\)
 For $i=1$ to $m$:
 $z^{(i)} = w^Tx^{(i)}+b$
 $a^{(i)} = \sigma(z^{(i)})$
 $\mathcal{J} +=[y^{(i)}\log a^{(i)} + (1y^{(i)})\log (1a^{(i)})]$
 $dz^{(i)} += a^{(i)}  y^{(i)}$
 $dw_1 += x_1^{(i)} d_z^{(i)}$
 $dw_2 += x_2^{(i)} d_z^{(i)}$
 $db += d_z^{(i)}$
 $J /= m$
 $dw_1 /= m$
 $dw_2 /= m$
 $db /= m$
 $w_1 := w_1  \alpha dw_1$
 $w_2 := w_2  \alpha dw_2$
 $b := b  \alpha db$
 The derivatives are being calculated cumulatively only to be divided by m in the end of the loop therefore calculating the mean
 Using explicit for loops are bad for parallelization. Since vectorization has appeared for loops are not a guideline anymore.
Vectorization
Vectorization is basically the art of getting rid of explicit folders in your code. – Andrew Ng
 What is vectorization? Is the ability to remove the for loops from your code and implement it mostly with matrix (or tensor) operations
import numpy as np
a = np.array([1, 2, 3, 4])
print(a)
#Vectorized
import time
a = np.random.rand(1000000000)
b = np.random.rand(1000000000)
tic = time.time()
c = np.dot(a, b)
toc = time.time()
print(c)
print("Vectorized version:" + str(1000*(toctic)) + "ms") #about 1.5ms
# Nonvectorized
c = 0
tic = time.time()
for i in range(1000000000):
c += a[i]*b[i]
toc = time.time()
print(c)
print("For loop:" + str(1000*(toctic)) + "ms") #about 400~500ms
 The rule of thumb to remember is whenever possible, avoid using explicit four loops
 Vectorization can be done without a GPU.
More vectorization examples
 Use builtin functions and avoid explicit for loops (This guideline also applies largely to numpy and pandas)
 Let \(J=0, db=0\)
 $dw = np.zeros((n_x, 1))$
 For $i=1$ to $m$:
 $z^{(i)} = w^Tx^{(i)}+b$
 $a^{(i)} = \sigma(z^{(i)})$
 $\mathcal{J} +=[y^{(i)}\log a^{(i)} + (1y^{(i)})\log (1a^{(i)})]$
 $dz^{(i)} += a^{(i)}  y^{(i)}$
 $dw += x^{(i)} d_z^{(i)}$
 $db += d_z^{(i)}$
 $J /= m$
 $dw /= m$
 $db /= m$
 $w_1 := w_1  \alpha dw_1$
 $w_2 := w_2  \alpha dw_2$
 $b := b  \alpha db$
Vectorizing Logistic Regression
 Finally uses the X and Y matrices discussed in the beginning of the lesson
 Z = np.dot(w.T, X) + b
 A = $\sigma(Z)$
Vectorizing Logistic Regression’s Gradient Output
 Find a $dZ = [ dz^{(1)}, dz^{(2)}, …, dz^{(m)}]$
 $A = [ a^{(1)}, a^{(2)}, …, a^{(m)}]$
 $Y = [ y^{(1)}, y^{(2)}, …, y^{(m)}]$
 $dZ = AY = [ a^{(1)}  y^{(1)}, a^{(2)}  y^{(2)}, …, a^{(m)}  y^{(m)}]$
 $db = \frac{1}{m} \sum_{i=1}^{m} dz^{(i)} = \frac{1}{m} np.sum(dZ)$

$dw = \frac{1}{m} X dZ^T $
 Now our algorithm looks like something:
 Forward pass
 Z = np.dot(w.T, X) + b
 A = $\sigma(Z)$
 Backpropagation pass
 $dZ = AY$
 $dw = \frac{1}{m} X dZ^T$
 $db = \frac{1}{m} np.sum(dZ)$
 Weights update pass
 $w := w  \alpha dw$
 $b := b  \alpha db$
Broadcasting in Python
import numpy as np
A = np.array([
[56.0, 0.0, 4.4, 68.0],
[1.2, 104.0, 52.0, 0.0],
[1.8, 135.0, 99.0, 0.9]
])
print(A)
cal = A.sum(axis=0)
print(cal)
percentage = 100*A/cal.reshape(1, 4)
print(percentage)
 Given matrix (m,n) a given operation and a factor which is (1, n) or (m, 1) the factor is expanded to (m, n) by copy and the operation is executed elementwise
 Given a matrix (m, 1) (or a vector) and a numeric fator the factor is expanded to (m, 1) by copy and the operation is executed elementwise
A note on python/numpy vectors
import numpy as np
a = np.random.rand(5)
a.shape # (5, ), rank 1 array
a.T # == a which a.T.shape = (5,)
np.dot(a, a.T)# returns a single number
a = np.random.rand(5, 1)
a.shape # (5, 1)
a.T # != a which a.T.shape = (1,5)
np.dot(a, a.T)# returns a matrix
#(5, 1) = column vector
#(1, 5) = row vector
# To avoid you can use assert
assert(a.shape == (5,1))
#or reshape
a.reshape(1, 5)
Quiz
 What does a neuron compute?
 A neuron computes an activation function followed by a linear function (z = Wx + b)
 A neuron computes the mean of all features before applying the output to an activation function
 A neuron computes a linear function (z = Wx + b) followed by an activation function
 A neuron computes a function g that scales the input x linearly (Wx + b)
 Which of these is the “Logistic Loss”?
 (y^(i),y(i))=∣y(i)−y^(i)∣
 (y^(i),y(i))=max(0,y(i)−y^(i))
 (y^(i),y(i))=∣y(i)−y^(i)∣2
 X(y^(i),y(i))=−(y(i)log(y^(i))+(1−y(i))log(1−y^(i)))
 Suppose img is a (32,32,3) array, representing a 32x32 image with 3 color channels red, green and blue. How do you reshape this into a column vector?
 x = img.reshape((1,3232,3))
 x = img.reshape((3,32*32))
 x = img.reshape((32*32,3))
 x = img.reshape((32323,1))
 Consider the two following random arrays “a” and “b”:
a = np.random.randn(2, 3) # a.shape = (2, 3) b = np.random.randn(2, 1) # b.shape = (2, 1) c = a + b
What will be the shape of “c”?
 c.shape = (3, 2)
 c.shape = (2, 3)
 c.shape = (2, 1)
 The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
 Consider the two following random arrays “a” and “b”:
a = np.random.randn(4, 3) # a.shape = (4, 3) b = np.random.randn(3, 2) # b.shape = (3, 2) c = a*b
What will be the shape of “c”?
 c.shape = (4, 3)
 c.shape = (4,2)
 c.shape = (3, 3)
 The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
 Suppose you have $n_x$ input features per example. Recall that X = $[x^{(1)} x^{(2)} … x^{(m)}]$. What is the dimension of X?
 ($n_x$, m)
 (1,m)
 (m,$n_x$)
 (m,1)
 Recall that “np.dot(a,b)” performs a matrix multiplication on a and b, whereas “a*b” performs an elementwise multiplication. Consider the two following random arrays “a” and “b”:
a = np.random.randn(12288, 150) # a.shape = (12288, 150) b = np.random.randn(150, 45) # b.shape = (150, 45) c = np.dot(a,b)
What is the shape of c?
 Te computation cannot happen because the sizes don’t match. It’s going to be “Error”!
 c.shape = (12288, 45)
 c.shape = (12288, 150)
 c.shape = (150,150)
 Consider the following code snippet:
import numpy as np # a.shape = (3,4) and b.shape = (4,1) for i in range(3): for j in range(4): c[i][j] = a[i][j] + b[j]
How do you vectorize this?
 c = a.T + b.T
 c = a + b
 c = a + b.T
 c = a.T + b
 Consider the following code:
a = np.random.randn(3, 3) b = np.random.randn(3, 1) c = a*b
What will be c? (If you’re not sure, feel free to run this in python to find out).
 This will invoke broadcasting, so b is copied three times to become (3,3), and * is an elementwise product so c.shape will be (3, 3)
 This will invoke broadcasting, so b is copied three times to become (3, 3), and * invokes a matrix multiplication operation of two 3x3 matrices so c.shape will be (3, 3)
 This will multiply a 3x3 matrix a with a 3x1 vector, thus resulting in a 3x1 vector. That is, c.shape = (3,1).
 It will lead to an error since you cannot use “*” to operate on these two matrices. You need to instead use np.dot(a,b)
 Consider the following computation graph. What is the output J?
 J = (c  1)*(b + a)
 J = (a  1) * (b + c)
 J = a*b + b*c + a*c
 J = (b  1) * (c + a)