Source: http://onmyphd.com/?p=gradient.descent

- Basic mathematical background
- Gradient of a function

**Gradient descent method is a way to find a local minimum of a function**. The way it works is we start with an initial guess of the solution and we take the gradient of the function at that point. We **step the solution in the negative direction of the gradient** and we repeat the process. The algorithm will eventually converge where the gradient is zero (which correspond to a local minimum). Its brother, the gradient ascent, finds the local maximum nearer the current solution by stepping it towards the positive direction of the gradient. They are both first-order algorithms because they take only the first derivative of the function.

Let's say we are trying to find the solution to the minimum of some function $f(\mathbf{x})$. Given some initial value $\mathbf{x_0}$ for $\mathbf{x}$, we can change its value in many directions (proportional to the dimension of $\mathbf{x}$: with only one dimension, we can make it higher or lower). To figure out what is the best direction to minimize $f$, we take the gradient $\nabla f$ of it (the derivative along every dimension of $\mathbf{x}$). Intuitively, the gradient will give the slope of the curve at that $\mathbf{x}$ and its direction will point to an increase in the function. So we change $\mathbf{x}$ in the opposite direction to lower the function value: $$\mathbf{x}_{k+1} = \mathbf{x}_k - \lambda \nabla f(\mathbf{x}_k)$$ The $\lambda \gt 0$ is a small number that forces the algorithm to make small jumps. That keeps the algorithm stable and its optimal value depends on the function. Given stable conditions (a certain choice of $\lambda$), it is guaranteed that $f(\mathbf{x}_{k+1}) \le f(\mathbf{x}_k)$.

$x_0$ = | |

$\lambda$ = |

The sliders above control the initial point $x_0$ and the constant $\lambda$. In the above plot you can see the function to be minimized and the points at each iteration of the gradient descent. If you increase $\lambda$ too much, the method will not converge.

There is a chronical problem to the gradient descent. For functions that have valleys (in the case of descent) or saddle points (in the case of ascent), the gradient descent/ascent algorithm zig-zags, because the gradient is nearly orthogonal to the direction of the local minimum in these regions. It is like being inside a round tube and trying to stay in the lower part of the tube. In case we are not, the gradient tells us we should go almost perpendicular to the longitudinal direction of the tube. If the local minimum is at the end of the tube, it will take a long time to reach it because we keep jumping between the sides of the tube (zig-zag). The Rosenbrock function is used to test this difficult problem: $$f(y,x) = (1-y)^2 + 100(x - y^2)^2$$

Click in the contour plot above to select the initial point from 1000 iterations of gradient descent. You easily see that as soon as the current iteration hits the valley (in dark blue), the iterations almost get stuck in the same position and move very slowly. The minimum is at (1,1).

A wrong step size $\lambda$ may not reach convergence, so a careful selection of the step size is important. Too large it will diverge, too small it will take a long time to converge. One option is to choose a fixed step size that will assure convergence wherever you start gradient descent. Another option is to choose a different step size at each iteration (adaptive step size).

Any differentiable function has a maximum derivative value, i.e., the maximum of the derivatives at all points. If this maximum is not infinite, this value is known as the Lipschitz constant and the function is Lipschitz continuous. $$\frac{\left\Vert f(\mathbf{x})-f(\mathbf{y}) \right\Vert}{\left\Vert \mathbf{x} - \mathbf{y} \right\Vert}\le L(f), \textrm{ for any } \mathbf{x},\mathbf{y} $$ This constant is important because it says that, given a certain function, any derivative will have a smaller value than the Lipschitz constant. The same can be said for the gradient of the function: if the maximum second derivative is finite, the function is Lipschitz continuous gradient and that value is the Lipschitz constant of $\nabla f$. $$\frac{\left\Vert \nabla f(\mathbf{x})-\nabla f(\mathbf{y}) \right\Vert}{\left\Vert \mathbf{x} - \mathbf{y} \right\Vert}\le L(\nabla f), \textrm{ for any } \mathbf{x},\mathbf{y} $$ For the $f(x) = x^2$ example, the derivative is $df(x)/dx = 2x$ and therefore the function is not Lipschitz continuous. But the second derivative is $d^2f(x)/dx^2 = 2$, and the function is Lipschitz continuous gradient with Lipschitz constant of $\nabla f = 2$.

Each gradient descent can be viewed as a minimization of the function: $$\mathbf{x}_{k+1} = \underset{\mathbf{x}}{\arg\min} f(\mathbf{x}_k) + (\mathbf{x} - \mathbf{x}_k)^T\nabla f(\mathbf{x}_k) + \frac{1}{2\lambda}\left\Vert\mathbf{x} - \mathbf{x}_k\right\Vert^2 $$ If we differentiate the equation with respect to $\mathbf{x}$, we get: $$0 = \nabla f(\mathbf{x}_k) + \frac{1}{\lambda}(\mathbf{x} - \mathbf{x}_k)$$ $$\mathbf{x} = \mathbf{x}_k - \lambda\nabla f(\mathbf{x}_k)$$ It can be shown that for any $\lambda \le 1/L(\nabla f)$: $$f(\mathbf{x}) \le f(\mathbf{x}_k) + (\mathbf{x} - \mathbf{x}_k)^T\nabla f(\mathbf{x}_k) + \frac{1}{2\lambda}\left\Vert\mathbf{x} - \mathbf{x}_k\right\Vert^2 $$ i.e., for any $\mathbf{x}$, $f(\mathbf{x})$ will always be equal or lower than the function that the gradient descent minimizes, if $\lambda \le 1/L(\nabla f)$. Under this view, the result we get from minimizing the right hand side is an upper bound to the real value of the function at $\mathbf{x}$. In our good example above, $L(\nabla f) = 2$, then $\lambda \le 0.5$ for good convergence. In fact, for this simple case, $\lambda = 0.5$ is a perfect value for the step size: smaller will converge slower, larger will exceed the optimum point at each iteration (although it still converges up to $\lambda = 1$).

Each method defines its own function, based on some assumptions. Exact methods accurately minimize $h(\lambda)$, while inexact methods make an approximation that just improves on the last iteration. The following are just a few examples.

One of the most obvious choices of $\lambda$ is to choose the one that minimizes the objective function: $$\lambda_k = \underset{\lambda}{\arg\min} f(\mathbf{x}_k - \lambda \nabla f(\mathbf{x}_k)) $$ This approach is conveniently called the steepest descent method. Although it seems to the best choice, it converges only linearly (error $\propto 1/k$) and is very sensitive to ill-conditioning of problems.

An approach proposed in 1988 is to find the step size that minimizes: $$\lambda_k = \underset{\lambda}{\arg\min} \left\Vert \Delta\mathbf{x} - \lambda \Delta g(\mathbf{x}) \right\Vert^2$$ where $\Delta\mathbf{x} = \mathbf{x}_k - \mathbf{x}_{k-1}$ and $\Delta g(\mathbf{x}) = \nabla f(\mathbf{x}_k) - \nabla f(\mathbf{x}_{k-1})$. This is an approximation to the secant equation, used in quasi-Newton methods. The solution to this problem is easily solved differentiating with respect to $\lambda$: $$0 = \Delta g(\mathbf{x})^T (\Delta\mathbf{x} - \lambda \Delta g(\mathbf{x})) $$ $$\lambda = \frac{\Delta g(\mathbf{x})^T \Delta\mathbf{x}}{\Delta g(\mathbf{x})^T\Delta g(\mathbf{x})} $$ This approach works really well, even for large dimensional problems.

This is considered an inexact line search, in the sense it does not optimize $\lambda$, but instead chooses one that guarantees some descent.
Start with $\lambda > 0$ and $\tau \in (0,1)$.

Repeat $\lambda = \tau \lambda$

Until the Armijo-Goldstein or Wolfe conditions are met