# Deep Learning Basics III

In this post, we discuss two theorems that provide guarantees on the convergence of Gradient Descent algorithm for minimizing a convex function. We show how making assumptions about the functions that we want to minimize can result in faster convergence. Then we discuss other learning algorithms to train a deep network, along with techniques to address over-fitting. Finally, we talk briefly about two of the popular architectures for deep network- Convolutional Neural Network and Recurrent Neural Network.

1. Convex Minimization

The problem of convex minimization can be formally defined as:
Given a convex function ${f:R^n->R}$, find ${x^*=argmin_{x\in R^n} f(x)}$

Our goal is to approximate the minimizer ${x^*}$. In other words: Output ${\hat{x}}$ such that ${f(\hat{x}) \leq f(x^*)+\epsilon}$. \par If we do not make any assumptions about ${f}$, we can use the interior point method to achieve the above task. Unfortunately, the number of iterations for this method is dependent on the dimensionality of ${f}$ and increases exponentially. Hence it is not useful in higher dimensions.

Gradient Descent consists of following two steps:

1. ${x_0=0}$
2. ${x_{t+1}=x_t-\eta \nabla f(x_t)}$

The number of iterations in first order methods such as Gradient Descent is independent of n. However we need to make assumptions on ${f}$ to guarantee faster convergence. We make the following two assumptions independently and show that we achieve faster convergence in both cases:

1. ${f}$ is G-bounded: ${\forall x}$ ${|\nabla f(x)| \leq G}$
2. ${f}$ is ${\beta}$ smooth: ${\forall x,y}$ ${||\nabla f(x)-\nabla f(y)|| \leq \beta ||x-y||}$

The following theorems hold true for the above cases:

Theorem 1: If ${f}$ is G-bounded and convex, then after T steps of gradient descent:

${\frac{1}{T}}$ ${\Sigma^{T}_{t=1} f(x_t)-f(x^*)}$ ${\leq}$ ${\frac{RG}{\sqrt[]{T}} }$
where ${R= ||x^*||}$ and learning rate is chosen to be ${\eta=\frac{R}{G\sqrt[]{T}}}$

The theorem gives guarantees for only average values of the function, and not the value at any particular time step. The reason is that, G-bounded functions without ${\beta}$-smooth property can not guarantee that the gradient would decrease while approaching the goal. Taking ${f(x) =|x|}$ for example, the gradient is either +1, 0, or -1. When it’s near the goal, it may oscillate and never reach the goal. The solution is to decrease ${\eta}$ as we go along in order to explicitly decrease the term ${\eta \nabla f(x_t)}$ in the second step of gradient descent.

Theorem 2: If ${f}$ is ${\beta}$ smooth and convex, then after T steps of gradient descent:

${\Sigma^{T}_{t=1} f(x_t)-f(x^*)}$ ${\leq}$ ${\frac{\beta R^2}{{T}} }$
where ${R= ||x^*||}$ and learning rate is chosen to be ${\eta=\frac{1}{\beta}}$

A few points to note:

• Guarantees much faster convergence than only G-bounded functions (${\frac{1}{T}}$ instead of ${\frac{1}{\sqrt[]{T}})}$.
• Guaranteed to minimize the function at a particular time step.
• Has a higher and constant learning rate, because the gradient ${\nabla f(x_t)}$ is guaranteed to become smaller with time. So we do not have to reduce the learning rate ${\eta}$ explicitly to decrease the term ${\eta \nabla f(x_t)}$ in the second step of gradient descent.

1.2. Analysis for ${\beta}$ smooth functions

Here we prove Theorem 2 which gives the guarantee on convergence for ${\beta}$ smooth function. We make use of the following two properties of the function:

1. Convexity: ${f(y)}$ ${\geq}$ ${f(x) + \nabla f(x).(y-x)}$
2. ${\beta}$ smoothness: ${f(y)}$ ${\leq}$ ${f(x) + \nabla f(x).(y-x) + \frac{\beta}{2} ||y-x||^2 }$. Note that for a function to be ${\beta}$ smooth, it does not have to be necessarily convex.

The proof goes as follows:

Step 1: Use ${\beta}$ smoothness with ${y=x_{t+1}}$, ${x=x_t}$

\par ${f(x_{t+1})}$ ${\leq}$ ${f(x_t) + \nabla f(x_t).(x_{t+1}-x_t) + \frac{\beta}{2}||x_{t+1}-x_t||^2}$

Substitute: ${x_{t+1}=x_{t}-\frac{1}{\beta}\nabla f(x_t)}$ (From step 2 of Gradient Descent with ${\eta=\frac{1}{\beta}}$)

${f(x_{t+1})}$ ${\leq}$ ${f(x_t) - \frac{1}{\beta}||\nabla f(x_t)||^2+\frac{1}{2\beta}||\nabla f(x_t)||^2}$

${\Rightarrow f(x_{t+1})}$ ${\leq}$ ${f(x_t) - \frac{1}{2\beta}||\nabla f(x_t)||^2}$

Note that this already guarantees that the function value will decrease at each time step (though not necessarily at a high rate). This partly explains why gradient descent should work for non convex functions as well.

Step 2: Use convexity with ${y=x^*}$, ${x=x_t}$

\par ${f(x^*)}$ ${\geq}$ ${f(x_t) + \nabla f(x_t).(x^*-x_t)}$

Step 3: Combine the results of step 1 and step 2:

\par ${f(x_{t+1})-f(x^*)}$ ${\leq}$ ${\nabla f(x_t).(x_{t}-x^*) - \frac{1}{2\beta}||\nabla f(x_t)||^2}$

Adding and subtracting ${\frac{\beta}{2}||x_t-x^*||^2}$

${f(x_{t+1})-f(x^*)}$ ${\leq}$ ${\nabla f(x_t).(x_{t}-x^*) - \frac{1}{2\beta}||\nabla f(x_t)||^2-\frac{\beta}{2}||x_t-x^*||^2+\frac{\beta}{2}||x_t-x^*||^2}$

${\Rightarrow}$ ${f(x_{t+1})-f(x^*)}$ ${\leq}$ ${\frac{-\beta}{2}[||x_t-x^*||^2+\frac{1}{\beta^2}||\nabla f(x_t)||^2-\frac{2}{\beta}\nabla f(x_t).(x_t-x^*)]+\frac{\beta}{2}||x_t-x^*||^2}$

${\Rightarrow}$ ${f(x_{t+1})-f(x^*)}$ ${\leq}$ ${\frac{-\beta}{2}[||(x_t-x^*)-\frac{1}{\beta} \nabla f(x_t)||^2+\frac{\beta}{2}||x_t-x^*||^2}$

From gradient descent, substitute ${x_{t}-\frac{1}{\beta}\nabla f(x_t)=x_{t+1}}$

${\Rightarrow}$ ${f(x_{t+1})-f(x^*)}$ ${\leq}$ ${\frac{-\beta}{2}||x_{t+1}-x^*||^2 + \frac{\beta}{2}||x_t-x^*||^2}$

Notice that when summed over T, this becomes a telescoping sum:

${\Sigma_{t=0}^T}$ ${f(x_{t+1})-f(x^*)}$ ${\leq}$ ${\Sigma_{t=0}^T}$ ${\frac{-\beta}{2}||x_{t+1}-x^*||^2 + \frac{\beta}{2}||x_t-x^*||^2}$

${\Rightarrow}$ ${\frac{1}{T} \Sigma_{t=0}^T}$ ${f(x_t)-f(x^*)}$ ${\leq}$ ${\frac{\beta R^2}{2T}}$

Hence for ${\beta}$ smooth function, we have:

${f(x_T)-f(x^*)}$ ${\leq}$ ${\frac{\beta R^2}{2T}}$

1.3. Momentum Technique

In this section, we discuss about a common technique that is used to accelerate gradient descent. One of the advantages of using this technique is that it avoids getting stuck in local optimums as well. This is known as a heavy-ball method. It gets the name because of its analogy with a heavy ball rolling down a hill(Figure 1).

Figure 1: A simplified representation of heavy ball technique.

This method is a two-step procedure defined as follows:

$\displaystyle \begin{array}{rcl} y_{t+1} &=& x_t - \eta \nabla f(x_t) \\ x_{t+1} &=& (1-\gamma)y_{t+1} + \gamma y_{t} \end{array}$

The first step is similar to gradient descent which takes the value of ${x_t}$ to ${y_{t+1}}$. We denote the intermediate value with ${y}$. The second step is just a linear combination of ${y_{t+1}}$ and ${y_t}$. This step takes ${y_{t+1}}$ a little bit further in the direction of previous time step.

The illustrations below in Figure 2 show gradient descent on the left and gradient descent with momentum on the right.

The theoretical guarantee of working of this algorithm is given by a theorem by Nesterov:

Theorem 1 For a ${\beta}$-smooth and convex function ${f}$, Nesterov’s Accelerated Gradient Descent satisfies

$\displaystyle \begin{array}{rcl} f(x_t) - f(x^*) \leq \frac{\beta R^2}{T^2} \end{array}$

It is interesting to note here that the term on right hand side is inversely proportional to the ${T^2}$, unlike the case of gradient descent (where it was inversely proportional to ${T}$). This causes momentum techniques to converge faster. An interested reader can also read to the proof for this theorem here. Stochastic Gradient Descent with Momentum is generally used in deep learning as it has been empirically better at converging the networks.

2. Techniques to counter over-fitting

When the model is able to describe the training data just too well, it is called the problem of over-fitting. The model has low error on the training data but does not perform well on new inputs (i.e. does not generalize well). As deep networks have a lot of parameters, they can model complex functions and hence these models tend to suffer from the problem of over-fitting. In this section, we briefly look at some of the common techniques used to overcome this problem.

Some of the techniques are:

• Weight RegularizationThis is a general technique used in many machine learning models. Multiple types of regularizations exist. Some common ones include ${L2}$\-regularization where a ${\lambda w^2}$ term is added to the optimization function and ${L1}$-regularization where a ${\lambda\vert w \vert}$ term is added to the optimization function. The intuition is to stop the weights from having high values and hence the weight vector as a whole is regularized. This technique is not commonly used in deep learning.
• Early StoppingFor this next technique, the training dataset is divided into two sets – training and validation dataset. The training set is used for running the back propagation algorithm as usual while the validation dataset is used just to calculate the validation error. The errors are monitored as they reduce with each passing iteration over the dataset (also known as epoch).

Figure 3: Early stopping example.

The Figure 3 shows the validation and training error. In early stopping technique, we stop when the validation error starts increasing. In the second part of this diagram, we also see an example where the validation error does not decrease further. In this case we try to decrease the learning rate which decreases the validation error to an extent.

• DropoutIn this technique, while training the neurons are dropped at random with a probability of ${p}$ for each sample in the whole dataset. After training is completed, the network is used as but the activations are scaled with the same ${p}$. Figure 4 illustrates the concept of dropout on a 3-layer network.

Figure 4: Dropout for a 3 layer network.

The intuition behind this is somewhat similar to concept of bagging, where different models are trained and average of these models is taken for prediction. By implementing dropout we get exponential number of models simultaneously. A heuristic explanation that is also given is that the singular nodes are not dependent on their neighbors and function as independent units, each one contributing to the network.

Figure 5: Effect of probability on error

For a linear model, it can be shown that dropout is analogous to weight normalization. However, the same results are not currently proven for deep networks. An interested reader can read more about dropout in this paper

• Batch NormalizationIn this technique, a batch of input data is taken and the inputs are normalized to 0 mean and 1 variance for each layer. This is done to reduce internal covariate shift.More about this method can be read in this paper

3. Introduction to Deep Networks

As the network grows deep, the number of parameters keep increasing. This generally increases the expressiveness of the model but at the same time causes problems with the computations required to train this network. To deal with this, sparse networks were introduced. We are going to see some of the deep networks in brief in the following sections.

3.1. Convolutional Neural Networks

These networks are similar to regular neural networks but the only difference is that explicit sparsity is introduced in these networks. In standard feed forward networks, the output ${h_i}$ is given by:

$\displaystyle \begin{array}{rcl} h_i = \sigma(wx + b) \end{array}$

but in Convolutional Neural Networks(CNNs) the hypotheses is given using convolution and pooling operations.

Figure 6: Convolutional Neural Network

Let us have a look at these operations:

• Convolution This operation is inspired from the Computer Vision community. They have been using filters for images but these filters were rigid, e.g. Gaussian filter. A filter can be seen as a matrix which is used for the convolution operation. The image is divided into smaller sections of the same size as the filter and then convolution operation is performed.

Figure 7: A convolution operation applied with a ${K_{3 \times 3}}$ filter

In CNNs, this filter is represented as the weights on a neural network. The weights are also not rigid and are is updated by learning. The filter is constant across the whole input and the filter is slided across the image (See Figure 7).

• PoolingIn this operation, the maximum value in a neighborhood of intermediate outputs is taken and others are discarded.

For a ${d \times d}$ feed-forward network, the number of weights are ${d_2}$ Whereas in a CNN, as the weights are shared for the convolution operation, we have ${dt}$ weights, where ${t}$ is the number of filters used.

3.2. Recurrent Neural Networks

These networks are used to model long term dependencies. Commonly used for text and speech. The weights are shared among these networks as well.

Figure 8: A Recurrent Neural network.

The Figure 8 shows a Recurrent Neural Network (RNN). Here ${x_i}$ is the input, ${h_i}$ is the hidden state at time step ${i}$ and ${o_t}$ is the output at time step ${t}$. The hidden state at time step ${t}$ is given by:

$\displaystyle \begin{array}{rcl} h_t = \sigma ( w_1 h_{t-1} + w_2 x_t) \end{array}$

Due to this representation, one of the main problems with these networks is that the gradient vanishes as the width increases.

$\displaystyle \begin{array}{rcl} \frac{\delta L}{\delta h_1} = w^T \end{array}$

This causes the gradient to increase or decrease quickly. To overcome this problem, nowadays Long Short Term Memory (LSTM) Networks are used.

An interested reader is invited to read more about these networks in chapters 9 and 10 of the deep learning book