Last lecture focused on, Feed Forward Neural Networks and Universal Approximation Theorem. In today’s lecture we will learn about Back Propagation Algorithm, Gradient Descent Algorithm, Stochastic Gradient Descent Algorithm, types of functions.

** Back Propagation Algorithm **

It computes the gradient linearly in time and in the size of the network i.e. running time is O(V+E) where V is the number of the nodes in the network and E is number of the edges.

Steps in **Back Propagation Algorithm**

- Compute the partial derivative of the loss function wrt bias.
Next, we calculate the partial derivative of the loss wrt weights.

Similarly, if we compute partial derivative wrt hidden unit.

- We will use the outputs of the equations above to compute the partial derivatives in the layer below,
Total time take in all these steps is O(V+E) and it works for any directed acyclic graph(dag). For more on back propagation refer to Chapter 6 in the book Deep Learning

** Types of Gradient Descent **

- Full Gradient Descent
- Stochastic Gradient Descent
- Mini Batch Stochastic Gradient Descent

** Gradient Descent **

The goal of feed forward network is to approximate some function . One of the mostly used iterative algorithm is Gradient Descent.

Loss function:

Steps in Gradient Descent

- \{ \text{ : Learning Rate} \}

The time taken for a full gradient descent over the full network is O(P(V+E)). It is very slow to apply on the entire network, so we use Stochastic Gradient Descent to speed up our computation. Reference to Gradient Descent Proof

** Stochastic Gradient Descent **

In its simplest form, stochastic gradient descent updates weight,

Below mentioned are few of the reasons for using stochastic gradient descent,

- Running Full GD is too expensive
- Already not minimizing the general function but minimizing a noisy function. So, adding a noise doesn’t change much.
- We are minimizing a non convex function and we have to worry about saddle points. SGD is better at escaping Saddle points Reference to issues regarding saddles points in deep networks.

In mini-batch gradient descent, the cost function (and therefore gradient) is averaged over a small number of samples. The combination of mini batch and stochastic gradient descent is commonly used in deep learning.

** Must Know Theorems **

** Theorem 1 **

If function is a G-bounded (i.e. ) function then after T steps

** Theorem 2 **

If the function is (i.e. ) smooth function after T steps

** Notes **

1. Convex Functions

Critical Points: global minimum f(x) = 0

2. Non Convex Functions

Critical Points: global min/max, local min/max, saddle point

If the function is twice differentiable, we can look at Hessian matrix

- local/global min H(x) 0, \text{positive semi definite (all eigen values non negative)}
- local/global max H(x) 0, \text{negative semi definite}
- saddle points H(x) has both negative and positive eigen values.