- The importance of gradients into finding the minimum/maximum of functions

All optimization problems are related to minimizing a function (usually termed loss, cost or error function) or maximizing a function (such as the likelihood) with respect to some variable $\mathbf{x}$. **If there are constraints in the possible values of $\mathbf{x}$, the method of Lagrange Multipliers can restrict the search of solutions in the feasible set of values of $\mathbf{x}$**. Let us limit the constraints to equalities, for now. The problem can be formulated as:
$$\mathbf{x^\star} = \underset{\mathbf{x}}{\textrm{argmin}} f(\mathbf{x}) $$
$$\textrm{subject to } h_i(\mathbf{x}) = 0, \forall i=1,..,m$$
In words, find the solution that minimizes $f(\mathbf{x})$, as long as all equalities $h_i(\mathbf{x}) = 0$ hold. It is easy to see that any equality constraint can be defined, so long as all terms are in the left side of the equation. The method of Lagrange Multipliers works as follows: **Put the cost function as well as the constraints in a single minimization problem, but multiply each constraint by a factor $\lambda_i$ (the lagrange multipliers)**. In our example, we would have $m$ lagrange multipliers. Hence the expression for the optimization problem becomes:
$$\mathbf{x^\star} = \underset{\mathbf{x}}{\textrm{argmin }} L(\mathbf{x}, \mathbf{\lambda}) = \underset{\mathbf{x}}{\textrm{argmin }} f(\mathbf{x}) + \underset{i=1}{\overset{m}{\sum}} \lambda_i h_i(\mathbf{x}),$$
where $L(\mathbf{x}, \mathbf{\lambda})$ is the Lagrangian and depends also on $\mathbf{\lambda}$, which is a vector of all multipliers.

As usual, we find the roots of the gradient of the loss function with respect to $\mathbf{x}$ to find the minimum (in case of a convex function) or the maximum (in case of a concave function). That will give us the optimality condition: $$\nabla_{\mathbf{x}}L(\mathbf{x}, \lambda) = \nabla_{\mathbf{x}}f(\mathbf{x}) + \sum_i\lambda_i \nabla_{\mathbf{x}}h_i(\mathbf{x}) = 0$$ We have number of variables equal to the elements in $\mathbf{x}$ (say $k$) plus the number of lagrange multipliers ($m$), and, as of now, we only have $k$ equations coming from the gradient with respect to $\mathbf{x}$. If we derive the loss function with respect to each lagrange multiplier, we have $m$ more equations and each new equation is: $$\frac{\partial L(\mathbf{x}, \lambda)}{\partial \lambda_i} = h_i(\mathbf{x}) = 0,$$ which means that by forcing their derivates to zero we are also limiting the possible solutions to the ones that meet the constraints. Having the same number of equations and variables makes the problem determined and can be solved. Since in the solution, the terms multiplied by the multipliers are zero, they have no contribution to the value of the loss function.

There is also an intuitive graphical explanation for the method. Referring to Fig. 1, the extremum points of the function subject to the constraint (two different examples in the figure) are points at which the gradient of function $f(x,y)$ points in the same direction as the gradient of function $h(x,y)$ (apart from a multiplication factor).

Then, we can impose for the solution: $$\nabla_{\mathbf{x}}f(\mathbf{x}) = -\lambda \nabla_{\mathbf{x}}h(\mathbf{x})$$ $$\nabla_{\mathbf{x}}f(\mathbf{x}) + \lambda \nabla_{\mathbf{x}}h(\mathbf{x}) = 0$$ This is the same result as taking the gradient of the Lagrange function with respect to $\mathbf{x}$.

Fig. 1 - Graphical explanation for the method of lagrange multipliers

However, there are still an infinite number of points where the equation above may be true. We still have to find the multiplication factor $\lambda$ in order to solve the problem. As before, we impose the constraints (equivalent to taking the gradient of the Lagrangian with respect to $\lambda$) to get more functions and have a determined system.

Let's say we want to minimize the $\ell_2$-norm of a variable, so long as it obeys to a constraint:
$$\mathbf{x^\star} = \underset{\mathbf{x}}{\textrm{argmin}} ||\mathbf{x}||^2 $$
$$\textrm{subject to } \mathbf{y} - A\mathbf{x} = 0.$$
Assuming that the number of observations in $\mathbf{y}$ are less than the number of variables in $\mathbf{x}$, the problem as an infinite number of solutions, and we are interested in finding the one with least $\ell_2$ norm. Setting the problem in Lagrangian form:
$$\mathbf{x^\star} = \underset{\mathbf{x}}{\textrm{argmin }} L(\mathbf{x}, \mathbf{\lambda}) = \underset{\mathbf{x}}{\textrm{argmin }} ||\mathbf{x}||^2 + \mathbf{\lambda}^T (\mathbf{y} - A\mathbf{x}),$$
we have a number of lagrange multipliers as the number of elements in $\mathbf{y}$. The gradient with respect to $\mathbf{x}$ is:
$$\begin{equation}2\mathbf{x} - A^T \mathbf{\lambda} = 0.\label{eq:1}\end{equation}$$
You can see $\mathbf{x}$ depending on $\mathbf{\lambda}$. From the gradient of the function with respect to $\mathbf{\lambda}$ we get:
$$\mathbf{y} = A\mathbf{x}$$
which in this case is pretty obvious. Premultiply $\eqref{eq:1}$ by $A$ to get:
$$2A\mathbf{x} = A A^T \mathbf{\lambda}$$
$$2\mathbf{y} = A A^T \mathbf{\lambda}$$
$$\mathbf{\lambda} = 2(A A^T)^{-1}\mathbf{y} $$
We now replace the lagrange multipliers back in $\eqref{eq:1}$ to get:
$$2\mathbf{x} = A^T 2(A A^T)^{-1}\mathbf{y}$$
$$\mathbf{x} = A^T (A A^T)^{-1}\mathbf{y}$$
It turns out that **the expression $A^T (A A^T)^{-1}$ is called the pseudo-inverse of A and is a closed-form solution to this problem**.

There are cases where the Lagrange multipliers simply do not exist. **We say that a feasible point is regular when the gradients of the constraints at that point are linearly independent**. If they are not, it may happen that at the optimal feasible point, the gradient of the function cannot be formed by a linear combination of the constraints, and therefore, there are no Lagrange multipliers that will meet the optimality conditions.

Let's look at this regularity issue with an example. Imagine the following problem: $$\min f(x_1, x_2) = x_1 + x_2$$ $$\mbox{subject to }h_1(x_1,x_2) = (x_1-1)^2 + x_2^2 -1 = 0$$ $$\mbox{subject to }h_1(x_1,x_2) = (x_1-2)^2 + x_2^2 -4 = 0$$ The constraints are circles with different radius, both crossing the origin $(x_1,x_2) = (0,0)$, which is the optimal solution. The optimality condition is: $$\nabla_{\mathbf{x}}f(\mathbf{x}) + \lambda_1 \nabla_{\mathbf{x}}h_1(\mathbf{x}) + \lambda_2 \nabla_{\mathbf{x}}h_2(\mathbf{x}) = 0$$ However, the gradients at the origin are $\nabla_{\mathbf{x}}f(0,0) = (1,1)$, $\nabla_{\mathbf{x}}h_1(0,0) = (-2,0)$, $\nabla_{\mathbf{x}}h_2(0,0) = (-4,0)$. Since the $x_2$ component is zero in both gradients of the constraints and not in the gradient of the function, the latter can never be formed as a linear combination of the formers. Therefore, there are no Lagrange multipliers that enforce the optimality condition.

But what if the function to be minimized is: $$\min f(x_1,x_2) = x_1?$$ In this case, the optimal solution still is $(0,0)$ and the gradient of $f(x_1,x_2)$ at the optimal point is $\nabla_{\mathbf{x}}f(0,0) = (1,0)$. Then, the gradient of the function can be formed as a linear combination of the gradients of the constraints and there are Lagrange multipliers that enforce the optimality condition.

As a recap:

- A feasible solution is regular? The Lagrange multipliers exist and they are unique.
- A feasible solution is not regular? The Lagrange multipliers may or may not exist, depending if the gradient of the function can be represented as a linear combination of the gradients of the constraints.

And what if we have inequalities as constraints, instead of equalities? The Karush-Kuhn-Tucker (KKT) conditions extend the method of Lagrange Multipliers to allow inequalities and the KKT conditions are the necessary conditions for optimality.