logo of deeplearning.ai neural networks course

Neural Networks

Week 2 - Neural Networks Basics

Binary Classification

Given a pair \((x, y)\) where \(x \in \mathbb{R}^{n_x}, y \in \{0, 1\}\). With \(m\) training examples you would have \(\{(x^1, y^1), (x^2, y^2),...(x^m, y^m)\}\). Stacking them together you would come up with the matrices X and Y below which are much easier to train in batches since there are many vectorized optimizations readily available. \(X \in \mathbb{R}^{n_x \times m}, Y \in \mathbb{R}^{1 \times m}\)

Logistic Regression

  • Given $x$ we want the $ \hat{y} = P(y=1 | x) $
  • $x \in \mathbb{R}^{n_x}, 0 \geq \hat{y} \geq 1$
  • Parameters: $ w \in \mathbb{R}^{n_x}, b \in \mathbb{R} $
  • Output $\hat{y} = \sigma(w^Tx + b)$
  • $z = w^Tx + b$
  • $\sigma(z) = \frac{1}{1+e^{-z}}$
  • If z is large (\(\displaystyle\lim_{z\to\infty}\)) then $\sigma(z) \approx \frac{1}{1+0} = 1$
  • If z is a large negative number (\(\displaystyle\lim_{z\to-\infty}\)) then $\sigma(z) \approx \frac{1}{1+\infty} \approx 0$

Logistic Regression Cost Function

  • Given \(\{(x^1, y^1), (x^2, y^2),...(x^m, y^m)\}\), we want $ \hat{y}^{(i)} \approx y^{(i)}$
  • Squared Error: $\mathcal{L}(\hat{y}, y) = \frac{(\hat{y} - y)^2}{2}$ might not be a good option because it’s not convex
  • $\mathcal{L}(\hat{y}, y) = (y\log\hat{y} + (1-y)\log(1-\hat{y}))$
  • If $y=1: \mathcal{L}(\hat{y}, y) = -\log\hat{y}$ <- want $\log\hat{y}$ to be large meaning we want $\hat{y}$ large
  • If $y=0: \mathcal{L}(\hat{y}, y) = -\log(1-hat{y})$ <- want $\log(1-\hat{y})$ to be large meaning we want $\hat{y}$ small
  • Cost function: $\mathcal{J}(w, b) = \frac{1}{m} \sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)}, y^{(i)})$, in other words, the cost function is the average of the loss on the given examples
  • Loss function applies to a single training example
  • Cost function applies to all the examples

Gradient Descent

  • We want to find $w, b$ that minimize $\mathcal{J}(w, b)$
  • This loss function is convex, therefore pointing to a single minima
  • Gradient Descent: $w := w - \alpha \frac{\partial\mathcal{J}(w, b)}{\partial w}$ and $b := b - \alpha \frac{\partial\mathcal{J}(w, b)}{\partial b}$ where $\alpha$ is the learning rate

Derivatives

  • Intuition about derivatives: $f(a) = 3a$ the slope of $f(a)$ at $a=2$ is 3 and at $a=5$ is 3 as well. Therefore $\frac{df(a)}{da} = 3 = \frac{d}{da}f(a)$
  • In a line the slope doesn’t change, therefore the derivative is constant

More Derivative Examples

  • $f(a) = a^2$ the slope of $f(a)$ at $a=2$ is 4 and at $a=5$ is 10. Therefore $\frac{df(a)}{da} = 2a $
  • The slope is different for different values as t doesn’t increases linearly.
  • $f(a) = a^3$ has derivative as $\frac{df(a)}{da} = 3a^2 $
  • $f(a) = \ln(a)$ has derivative as $\frac{df(a)}{da} = \frac{1}{a} $
  • The derivative means the slope of the function
  • The slope of the function can be different on different points of the function

Computation Graph

  • $\mathcal{J} = 3(a+bc)$
  • $ u = bc$
  • $ v = a+u$
  • $\mathcal{J} = 3v$
  • One step of backward propagation on a computation graph yields derivative of final output variable.

Derivatives with a Computation Graph

  • $\frac{d\mathcal{J}}{dv} = 3$
  • \(\frac{d\mathcal{J}}{da} = \frac{d\mathcal{J}}{dv} \frac{dv}{da} = 3 * 1 = 3\) (chain rule)
  • $\frac{d\mathcal{J}}{du} = \frac{d\mathcal{J}}{dv} \frac{dv}{du} = 3 * 1 = 3$ (chain rule)
  • $\frac{d\mathcal{J}}{db} = \frac{d\mathcal{J}}{du} \frac{du}{db} = 3 * 2 = 6$ (chain rule)
  • $\frac{d\mathcal{J}}{db} = \frac{d\mathcal{J}}{du} \frac{du}{dc} = 3 * 3 = 9$ (chain rule)
  • The best way is to flow from the right to left
  • In this class, what does the coding convention dvar represent?
    • The derivative of any variable used in the code.
    • The derivative of input variables with respect to various intermediate quantities.
    • The derivative of a final output variable with respect to various intermediate quantities.

Logistic Regression Gradient Descent

  • Suppose we only have two features \(x_1\) and \(x_2\)
  • Using the backward propagation on the formulas of the logistic regression in the computational graph we can update the correspondent changes in the weights and the bias
  • In this video, what is the simplified formula for the derivative of the loss with respect to z? A: a - y

Gradient Descent on \(m\) examples

  • Cost function is the mean of the loss function on all the m examples
  • Let \(J=0, dw_1=0, dw_2=0, db=0\)
  • For $i=1$ to $m$:
    • $z^{(i)} = w^Tx^{(i)}+b$
    • $a^{(i)} = \sigma(z^{(i)})$
    • $\mathcal{J} +=[y^{(i)}\log a^{(i)} + (1-y^{(i)})\log (1-a^{(i)})]$
    • $dz^{(i)} += a^{(i)} - y^{(i)}$
    • $dw_1 += x_1^{(i)} d_z^{(i)}$
    • $dw_2 += x_2^{(i)} d_z^{(i)}$
    • $db += d_z^{(i)}$
  • $J /= m$
  • $dw_1 /= m$
  • $dw_2 /= m$
  • $db /= m$
  • $w_1 := w_1 - \alpha dw_1$
  • $w_2 := w_2 - \alpha dw_2$
  • $b := b - \alpha db$
  • The derivatives are being calculated cumulatively only to be divided by m in the end of the loop therefore calculating the mean
  • Using explicit for loops are bad for parallelization. Since vectorization has appeared for loops are not a guideline anymore.

Vectorization

Vectorization is basically the art of getting rid of explicit folders in your code. – Andrew Ng

  • What is vectorization? Is the ability to remove the for loops from your code and implement it mostly with matrix (or tensor) operations
	import numpy as np
	a = np.array([1, 2, 3, 4])
	print(a)
	#Vectorized
	import time
	a = np.random.rand(1000000000)
	b = np.random.rand(1000000000)
	tic = time.time()
	c = np.dot(a, b)
	toc = time.time()
	print(c)
	print("Vectorized version:" + str(1000*(toc-tic)) + "ms") #about 1.5ms
	# Non-vectorized
	c = 0
	tic = time.time()
	for i in range(1000000000):
		c += a[i]*b[i]
	toc = time.time()
	print(c)
	print("For loop:" + str(1000*(toc-tic)) + "ms") #about 400~500ms
  • The rule of thumb to remember is whenever possible, avoid using explicit four loops
  • Vectorization can be done without a GPU.

More vectorization examples

  • Use built-in functions and avoid explicit for loops (This guideline also applies largely to numpy and pandas)
  • Let \(J=0, db=0\)
  • $dw = np.zeros((n_x, 1))$
  • For $i=1$ to $m$:
    • $z^{(i)} = w^Tx^{(i)}+b$
    • $a^{(i)} = \sigma(z^{(i)})$
    • $\mathcal{J} +=[y^{(i)}\log a^{(i)} + (1-y^{(i)})\log (1-a^{(i)})]$
    • $dz^{(i)} += a^{(i)} - y^{(i)}$
    • $dw += x^{(i)} d_z^{(i)}$
    • $db += d_z^{(i)}$
  • $J /= m$
  • $dw /= m$
  • $db /= m$
  • $w_1 := w_1 - \alpha dw_1$
  • $w_2 := w_2 - \alpha dw_2$
  • $b := b - \alpha db$

Vectorizing Logistic Regression

  • Finally uses the X and Y matrices discussed in the beginning of the lesson
  • Z = np.dot(w.T, X) + b
  • A = $\sigma(Z)$

Vectorizing Logistic Regression’s Gradient Output

  • Find a $dZ = [ dz^{(1)}, dz^{(2)}, …, dz^{(m)}]$
  • $A = [ a^{(1)}, a^{(2)}, …, a^{(m)}]$
  • $Y = [ y^{(1)}, y^{(2)}, …, y^{(m)}]$
  • $dZ = A-Y = [ a^{(1)} - y^{(1)}, a^{(2)} - y^{(2)}, …, a^{(m)} - y^{(m)}]$
  • $db = \frac{1}{m} \sum_{i=1}^{m} dz^{(i)} = \frac{1}{m} np.sum(dZ)$
  • $dw = \frac{1}{m} X dZ^T $

  • Now our algorithm looks like something:
  • Forward pass
  • Z = np.dot(w.T, X) + b
  • A = $\sigma(Z)$
  • Backpropagation pass
  • $dZ = A-Y$
  • $dw = \frac{1}{m} X dZ^T$
  • $db = \frac{1}{m} np.sum(dZ)$
  • Weights update pass
  • $w := w - \alpha dw$
  • $b := b - \alpha db$

Broadcasting in Python

import numpy as np
A = np.array([
	[56.0, 0.0, 4.4, 68.0],
	[1.2, 104.0, 52.0, 0.0],
	[1.8, 135.0, 99.0, 0.9]
	])
print(A)
cal = A.sum(axis=0)
print(cal)
percentage = 100*A/cal.reshape(1, 4)
print(percentage)
  • Given matrix (m,n) a given operation and a factor which is (1, n) or (m, 1) the factor is expanded to (m, n) by copy and the operation is executed element-wise
  • Given a matrix (m, 1) (or a vector) and a numeric fator the factor is expanded to (m, 1) by copy and the operation is executed element-wise

A note on python/numpy vectors

	import numpy as np
	a = np.random.rand(5)
	a.shape # (5, ), rank 1 array
	a.T # == a which a.T.shape = (5,)
	np.dot(a, a.T)# returns a single number

	a = np.random.rand(5, 1)
	a.shape # (5, 1)
	a.T # != a which a.T.shape = (1,5)
	np.dot(a, a.T)# returns a matrix

	#(5, 1) = column vector
	#(1, 5) = row vector

	# To avoid you can use assert
	assert(a.shape == (5,1))
	#or reshape
	a.reshape(1, 5)

Quiz

  1. What does a neuron compute?
    • A neuron computes an activation function followed by a linear function (z = Wx + b)
    • A neuron computes the mean of all features before applying the output to an activation function
    • A neuron computes a linear function (z = Wx + b) followed by an activation function
    • A neuron computes a function g that scales the input x linearly (Wx + b)
  2. Which of these is the “Logistic Loss”?
    • (y^(i),y(i))=∣y(i)−y^(i)∣
    • (y^(i),y(i))=max(0,y(i)−y^(i))
    • (y^(i),y(i))=∣y(i)−y^(i)∣2
    • X(y^(i),y(i))=−(y(i)log(y^(i))+(1−y(i))log(1−y^(i)))
  3. Suppose img is a (32,32,3) array, representing a 32x32 image with 3 color channels red, green and blue. How do you reshape this into a column vector?
    • x = img.reshape((1,3232,3))
    • x = img.reshape((3,32*32))
    • x = img.reshape((32*32,3))
    • x = img.reshape((32323,1))
  4. Consider the two following random arrays “a” and “b”:
    a = np.random.randn(2, 3) # a.shape = (2, 3)
    b = np.random.randn(2, 1) # b.shape = (2, 1)
    c = a + b
    

    What will be the shape of “c”?

    • c.shape = (3, 2)
    • c.shape = (2, 3)
    • c.shape = (2, 1)
    • The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
  5. Consider the two following random arrays “a” and “b”:
    a = np.random.randn(4, 3) # a.shape = (4, 3)
    b = np.random.randn(3, 2) # b.shape = (3, 2)
    c = a*b
    

    What will be the shape of “c”?

    • c.shape = (4, 3)
    • c.shape = (4,2)
    • c.shape = (3, 3)
    • The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
  6. Suppose you have $n_x$ input features per example. Recall that X = $[x^{(1)} x^{(2)} … x^{(m)}]$. What is the dimension of X?
    • ($n_x$, m)
    • (1,m)
    • (m,$n_x$)
    • (m,1)
  7. Recall that “np.dot(a,b)” performs a matrix multiplication on a and b, whereas “a*b” performs an element-wise multiplication. Consider the two following random arrays “a” and “b”:
    a = np.random.randn(12288, 150) # a.shape = (12288, 150)
    b = np.random.randn(150, 45) # b.shape = (150, 45)
    c = np.dot(a,b)
    

    What is the shape of c?

    • Te computation cannot happen because the sizes don’t match. It’s going to be “Error”!
    • c.shape = (12288, 45)
    • c.shape = (12288, 150)
    • c.shape = (150,150)
  8. Consider the following code snippet:
    import numpy as np # a.shape = (3,4) and b.shape = (4,1)
    for i in range(3):
      for j in range(4):
     c[i][j] = a[i][j] + b[j]
    

    How do you vectorize this?

    • c = a.T + b.T
    • c = a + b
    • c = a + b.T
    • c = a.T + b
  9. Consider the following code:
    a = np.random.randn(3, 3)
    b = np.random.randn(3, 1)
    c = a*b
    

    What will be c? (If you’re not sure, feel free to run this in python to find out).

    • This will invoke broadcasting, so b is copied three times to become (3,3), and * is an element-wise product so c.shape will be (3, 3)
    • This will invoke broadcasting, so b is copied three times to become (3, 3), and * invokes a matrix multiplication operation of two 3x3 matrices so c.shape will be (3, 3)
    • This will multiply a 3x3 matrix a with a 3x1 vector, thus resulting in a 3x1 vector. That is, c.shape = (3,1).
    • It will lead to an error since you cannot use “*” to operate on these two matrices. You need to instead use np.dot(a,b)
  10. Consider the following computation graph. What is the output J?
    • J = (c - 1)*(b + a)
    • J = (a - 1) * (b + c)
    • J = a*b + b*c + a*c
    • J = (b - 1) * (c + a)