Coursera Deep Learning Module 1 Week 4 Notes

logo of deeplearning.ai neural networks course

Neural Networks

« Previous Next »

Week 4

Deep L-Layer neural network

$L = 4$ (# of layers)
$n^{[l]} = $ #units in layers l
$n^{[0]} = 3 \text{ (input layer) }, n^{[1]} = 5, n^{[2]} = 5, n^{[3]} = 3, n^{[4]} = 1 \text{ (output layer) }$
$a^{[l]}$ (activation in layer l)
$a^{[l]} = g^{[l]}(z^{[l]}), w^{[l]} = \text{ weights for } z^{[l]}, b^{[l]} = \text{ bias for } z^{[l]}$
$a^{[4]} = \hat{y}$

Forward Propagation in a Deep Neural Network

$Z^{[l]} = W^{[l]}A^{[l-1]} + B^{[l]}$
$A^{[l]} = g^{[l]}(A^{[l]})$

Getting your matrix dimension right

$Z^{[l]}.shape = (n^{[l]}, m)$
$W^{[l]}.shape = (n^{[l]}, n^{[l-1]})$
$A^{[l]}.shape = (n^{[l-1]}, m)$
$dW^{[l]}.shape = (n^{[l]}, n^{[l-1]})$
$db^{[l]}.shape = (n^{[l]}, m)$

Why deep representations

Compositional representation: Shallow networks are able to detect simple features, deep layers are able to detect complex functions and are able to model much more complex data from the simple features
Circuit theory: There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute. e.g.: XOR detection: 2 layer 3-2-1 neurons vs 1 layer with $2^n$ neurons to map all the combinations of the inputs

Parameter vs Hyperparameter

Parameters:
- Weights
- Biases
Hyperparameters:
- Learning rate $\alpha$ or $f(t) = \theta $
- # iterations
- # hidden units
- choice of activation function
- Momentum
- Mini-batch size
- Regularization

Quiz

What is the “cache” used for in our implementation of forward propagation and backward propagation?
- We use it to pass variables computed during backward propagation to the corresponding forward propagation step. It contains useful values for forward propagation to compute activations.
- We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.
- It is used to keep track of the hyperparameters that we are searching over, to speed up computation.
- It is used to cache the intermediate values of the cost function during training.
Among the following, which ones are “hyperparameters”? (Check all that apply.)
- activation values $a^{[l]}$
- number of iterations
- weight matrices $W^{[l]}$
- number of layers $L$ in the neural network
- learning rate $\alpha$
- size of the hidden layers $n^{[l]}$
- bias vectors $b^{[l]}$
Which of the following statements is true?
- The deeper layers of a neural network are typically computing more complex features of the input than the earlier layers
- The earlier layers of a neural network are typically computing more complex features of the input than the deeper layers
Vectorization allows you to compute forward propagation in an LL-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L. True/False?
- True
- False
Assume we store the values for n^{[l]} in an array called layers, as follows: layer_dims = [n_x,4,3,2,1]. So layer 1 has four hidden units, layer 2 has 3 hidden units and so on. Which of the following for-loops will allow you to initialize the parameters for the model?
- python for(i in range(1, len(layer_dims)/2)): parameter[‘W’ + str(i)] = np.random.randn(layers[i], layers[i-1])) * 0.01 parameter[‘b’ + str(i)] = np.random.randn(layers[i], 1) * 0.01
- python for(i in range(1, len(layer_dims)/2)): parameter[‘W’ + str(i)] = np.random.randn(layers[i], layers[i-1])) * 0.01 parameter[‘b’ + str(i)] = np.random.randn(layers[i-1], 1) * 0.01
- python for(i in range(1, len(layer_dims))): parameter[‘W’ + str(i)] = np.random.randn(layers[i-1], layers[i])) * 0.01 parameter[‘b’ + str(i)] = np.random.randn(layers[i], 1) * 0.01
- python for(i in range(1, len(layer_dims))): parameter[‘W’ + str(i)] = np.random.randn(layers[i], layers[i-1])) * 0.01 parameter[‘b’ + str(i)] = np.random.randn(layers[i], 1) * 0.01
Consider the following neural network. How many layers does this network have?
- The number of layers L is 4. The number of hidden layers is 3.
- The number of layers L is 3. The number of hidden layers is 3.
- The number of layers L is 4. The number of hidden layers is 4.
- The number of layers L is 5. The number of hidden layers is 4.
During forward propagation, in the forward function for a layer l you need to know what is the activation function in a layer (Sigmoid, tanh, ReLU, etc.). During backpropagation, the corresponding backward function also needs to know what is the activation function for layer l, since the gradient depends on it. True/False?
- True
- False
There are certain functions with the following properties: (i) To compute the function using a shallow network circuit, you will need a large network (where we measure size by the number of logic gates in the network), but (ii) To compute it using a deep network circuit, you need only an exponentially smaller network. True/False?
- True
- False
Consider the following 2 hidden layer neural network: Which of the following statements are True? (Check all that apply).
- $W^{[1]}$ will have shape (4, 4)
- $b^{[1]}$ will have shape (4, 1)
- $W^{[1]}$ will have shape (3, 4)
- $b^{[1]}$ will have shape (3, 1)
- $W^{[2]}$ will have shape (3, 4)
- $b^{[2]}$ will have shape (1, 1)
- $W^{[2]}$ will have shape (3, 1)
- $b^{[2]}$ will have shape (3, 1)
- $W^{[3]}$ will have shape (3, 1)
- $b^{[3]}$ will have shape (1, 1)
- $W^{[3]}$ will have shape (1, 3)
- $b^{[3]}$ will have shape (3, 1)
Whereas the previous question used a specific network, in the general case what is the dimension of W^{[l]}, the weight matrix associated with layer ll?
- $W^{[l]}$ has shape $(n^{[l-1]}, n^{[l]})$
- $W^{[l]}$ has shape $(n^{[l]}, n^{[l-1]})$
- $W^{[l]}$ has shape $(n^{[l]}, n^{[l+1]})$
- $W^{[l]}$ has shape $(n^{[l+1]}, n^{[l]})$

« Previous Next »