# Deep Learning Basics II

Last lecture focused on, Feed Forward Neural Networks and Universal Approximation Theorem. In today’s lecture we will learn about Back Propagation Algorithm, Gradient Descent Algorithm, Stochastic Gradient Descent Algorithm, types of functions.

Back Propagation Algorithm

It computes the gradient linearly in time and in the size of the network i.e. running time is O(V+E) where V is the number of the nodes in the network and E is number of the edges.

Steps in Back Propagation Algorithm

• Compute the partial derivative of the loss function wrt bias.

$\displaystyle \begin{array}{rcl} \frac{\partial l}{\partial b_{l+1,1}} &=& \frac{\partial (y_i-\hat{y_i})^2}{\partial b_{l+1,1}} \\ &=& \frac{\partial (y_i-\hat{y_i})^2}{\partial \hat{y_i}} \frac{\partial \hat{y_i}}{\partial b_{l+1,1}} \\ &=& -2(y_i-\hat{y_i}) \end{array}$

Next, we calculate the partial derivative of the loss wrt weights.

$\displaystyle \begin{array}{rcl} \frac{\partial l}{\partial w_{i,1}^{l+1}} &=& \frac{\partial (y_i-\hat{y_i})^2}{\partial w_{i,1}^{l+1}} \\ &=& \frac{\partial (y_i-\hat{y_i})^2}{\partial \hat{y_i}} \frac{\partial \hat{y_i}}{\partial w_{i,1}^{l+1}} \\ &=& -2(y_i-\hat{y_i})h_{l,i} \end{array}$

Similarly, if we compute partial derivative wrt hidden unit.

$\displaystyle \begin{array}{rcl} \frac{\partial l}{\partial w_{l,i}} &=& -2(y_i-\hat{y_i})w_{i,1}^{l+1} \end{array}$

• We will use the outputs of the equations above to compute the partial derivatives in the layer below,

$\displaystyle \begin{array}{rcl} \frac{\partial l}{\partial b_{l,1}} &=& \frac{\partial l}{\partial h_{l,1}} \frac{\partial h_{l,1}}{\partial b_{l,1}} \\ \frac{\partial l}{\partial w_{i,1}^{l-1}} &=& \frac{\partial l}{\partial h_{l,1}} \frac{\partial h_{l,1}}{\partial w_{i,1}^{l-1}} \\ \frac{\partial l}{\partial h_{l-1,1}} &=& \sum_{j=1}^{d} \frac{\partial l}{\partial h_{l,j}} \frac{\partial h_{l,j}}{\partial h_{l-1,i}} \end{array}$

Total time take in all these steps is O(V+E) and it works for any directed acyclic graph(dag). For more on back propagation refer to Chapter 6 in the book Deep Learning

• Mini Batch Stochastic Gradient Descent

The goal of feed forward network is to approximate some function ${f^{*}}$. One of the mostly used iterative algorithm is Gradient Descent.

$\displaystyle \begin{array}{rcl} \min_{X \in R^n} f(x) \end{array}$

Loss function:

$\displaystyle \begin{array}{rcl} L = \frac{1}{P} \sum_{i=1}^{P}L_i(y_i,\hat{y_i}) \end{array}$

1. ${x_0 = 0}$
2. ${x_{t+1} = x_t - \eta \nabla f(x_t)}$ \{ \text{${\eta}$ : Learning Rate} \}
3. ${b_{l,i}^{t+1} = b_{l,i}^{t} - \eta \frac{\partial l}{\partial b_{l,i}}}$
4. ${w_{i,j}^{l,t+1} = w_{i,j}^{l,t} - \eta \frac{\partial l}{\partial w_{i,j}^{l}}}$

The time taken for a full gradient descent over the full network is O(P(V+E)). It is very slow to apply on the entire network, so we use Stochastic Gradient Descent to speed up our computation. Reference to Gradient Descent Proof

$\displaystyle \begin{array}{rcl} w_{i,j}^{l,t+1} = w_{i,j}^{l,t} - \eta \frac{\partial l_k()}{\partial w_{i,j}^{l}} && \text{k is picked at random from P samples.} \end{array}$

Below mentioned are few of the reasons for using stochastic gradient descent,

1. Running Full GD is too expensive
2. Already not minimizing the general function but minimizing a noisy function. So, adding a noise doesn’t change much.
3. We are minimizing a non convex function and we have to worry about saddle points. SGD is better at escaping Saddle points Reference to issues regarding saddles points in deep networks.

In mini-batch gradient descent, the cost function (and therefore gradient) is averaged over a small number of samples. The combination of mini batch and stochastic gradient descent is commonly used in deep learning.

Must Know Theorems

Theorem 1

If function is a G-bounded (i.e. ${\| \nabla f(x) \| \leq G}$) function then after T steps

$\displaystyle \begin{array}{rcl} \frac{1}{T}\sum_{t=1}^{T} f(x_t)-f(x^*) &\leq \frac{RG}{\sqrt{T}} \leq \epsilon \qquad \text{where} R = \|x^*\|, \eta = \frac{R}{G\sqrt{T}} \end{array}$

Theorem 2

If the function is ${\beta}$ (i.e. ${\beta smooth function : \| \nabla f(x) - \nabla f(y) \| \leq \beta \|x-y\|}$) smooth function after T steps

$\displaystyle \begin{array}{rcl} f(x_t)-f(x^*) &\leq \frac{\beta R^2}{2T} \qquad ,\text{where } \eta = \frac{1}{\beta} \end{array}$

Notes

1. Convex Functions

Critical Points: global minimum ${\nabla}$ f(x) = 0

2. Non Convex Functions

Critical Points: global min/max, local min/max, saddle point

If the function is twice differentiable, we can look at Hessian matrix

$\displaystyle H(x) = \frac{\partial^2 f}{\partial x_ix_j}$

• local/global min H(x) ${\geq}$ 0, \text{positive semi definite (all eigen values non negative)}
• local/global max H(x) ${\leq}$ 0, \text{negative semi definite}
• saddle points H(x) has both negative and positive eigen values.