Proximal operator or mapping
What do you need to know to understand this topic?
Sections
What is the proximal operator?
The proximal operator is used to make an approximation to a value, while making a compromise between the accuracy of the approximation and a cost associated to the new value. Say we have the following minimization problem:
$$\underset{\mathbf{u}}{\min } \frac{1}{2}\left\Vert \mathbf{u}  \mathbf{x}\right\Vert _2^2.$$
It is clear that the best approximation to $\mathbf{x}$ is the $\mathbf{x}$ itself ($\mathbf{u} = \mathbf{x}$). Now we add a cost to the choice we make of $\mathbf{u}$:
$$\textrm{prox}_h(\mathbf{x}) = \underset{\mathbf{u}}{\textrm{argmin }} \frac{1}{2}\left\Vert \mathbf{u}  \mathbf{x}\right\Vert _2^2 + h(\mathbf{u}).$$
Now the best approximation depends how costly it is to choose $\mathbf{u} = \mathbf{x}$, given $h(\mathbf{u})$, and a compromise must be made. The first example is the case that $h(\mathbf{u}) = 0$. The proximal operator $\textrm{prox}_h(\mathbf{x})$ is the analytic solution to the previous minimization problem.
Examples

Consider $h(\mathbf{u}) = \lambda \left\Vert \mathbf{u} \right\Vert_1$.
The minimization function is
$$\frac{1}{2}\left\Vert\mathbf{u}  \mathbf{x}\right\Vert _2^2 + \lambda \left\Vert \mathbf{u} \right\Vert_1$$
To find the minimum we must solve for the root of the gradient of this function. You may note that both terms are separable in $\mathbf{u}$ and thus we can minimize each element of $\mathbf{u}$ individually. This simplifies the expression to:
$$\frac{1}{2} (u_i  x_i)^2 + \lambda u_i, \forall i$$
To find the derivative of $u_i$, we have two situations: either $u_i \gt 0$ or $u_i \lt 0$. For $u_i \gt 0$, the derivative is:
$$u_i  x_i + \lambda = 0$$
$$u_i = x_i  \lambda.$$
Since $u_i \gt 0$, this is only valid for $x_i \gt \lambda$. The same logic applies for $u_i \lt 0$, with which we get:
$$u_i = x_i + \lambda.$$
and it is only valid for $x_i \lt \lambda$. For $\lambda \lt x_i \lt \lambda$, $u_i = 0$ to make the derivative somewhere between $\lambda$ and $\lambda$. Summing all up:
$$u_{i}=\begin{cases} x_{i}\lambda & x_{i}>\lambda\\
x_{i}+\lambda & x_{i} \lt \lambda\\
0 & \lambda \le x_{i} \le \lambda\end{cases}$$
This is equivalent to the expression:
$$u_i = \textrm{prox}_h(x_i) = \max(x_i  \lambda, 0) \textrm{sign}(x_i) $$
This particular case of the proximal operator is known as the shrinkage operator and is used in iterative soft thresholding (IST) algorithms.

Consider $h(\mathbf{u}) = \lambda \left\Vert \Sigma\mathbf{u} \right\Vert_1$, where $\Sigma$ is a diagonal matrix with elements $\sigma_i$.
The minimization function is
$$\frac{1}{2}\left\Vert\mathbf{u}  \mathbf{x}\right\Vert _2^2 + \lambda \left\Vert \Sigma \mathbf{u} \right\Vert_1$$
To find the minimum we must solve for the root of the gradient of this function. Since the matrix $\Sigma$ is diagonal, the terms are separable in $\Sigma \mathbf{u}$ and thus we can minimize each element of $\mathbf{u}$ individually. This simplifies the expression to:
$$\frac{1}{2} (u_i  x_i)^2 + \lambda \sigma_iu_i, \forall i$$
To find the derivative of $u_i$, we have two situations: either $u_i \gt 0$ or $u_i \lt 0$. For $u_i \gt 0$, the derivative is:
$$u_i  x_i + \lambda \sigma_i= 0$$
$$u_i = x_i  \lambda \sigma_i.$$
Since $u_i \gt 0$, this is only valid for $x_i \gt \lambda \sigma_i$. The same logic applies for $u_i \lt 0$, with which we get:
$$u_i = x_i + \lambda \sigma_i.$$
and it is only valid for $x_i \lt \lambda \sigma_i$. For $\lambda \sigma_i \lt x_i \lt \lambda \sigma_i$, $u_i = 0$ to make the derivative somewhere between $\lambda \sigma_i$ and $\lambda \sigma_i$. Summing all up:
$$u_{i}=\begin{cases} x_{i}\lambda \sigma_i & x_{i}>\lambda \sigma_i\\
x_{i}+\lambda\sigma_i & x_{i} \lt \lambda\sigma_i\\
0 & \lambda\sigma_i \le x_{i} \le \lambda\sigma_i \end{cases}$$
This is equivalent to the expression:
$$u_i = \textrm{prox}_h(x_i) = \max(x_i  \lambda \sigma_i, 0) \textrm{sign}(x_i) $$
This expression may be useful when you want to give different costs to each element of $\mathbf{x}$.

Consider $h(\mathbf{u}) = I_C(\mathbf{u})$, where:
$$I_C(\mathbf{u}) = \begin{cases} 0 & \mathbf{u} \in C\\
+\infty & \mathbf{u} \notin C\end{cases}$$
It is easy to see that if $\mathbf{u}$ does not belong to a certain set C, the cost is infinite and therefore not an option. We can convert this cost function into the proximal operator:
$$\textrm{prox}_h(\mathbf{x}) = \underset{\mathbf{u}}{\textrm{argmin }} \frac{1}{2}\left\Vert \mathbf{u}  \mathbf{x}\right\Vert _2^2, \mathbf{u} \in C$$
The closer $\mathbf{u}$ to $\mathbf{x}$ belonging to the set C is just the projection of $\mathbf{x}$ in the set C.

Consider $h(\mathbf{u}) = I_{\left\Vert \mathbf{u} \right\Vert_\infty \le \lambda}(\mathbf{u})$, where.
$$I_{\left\Vert \mathbf{u} \right\Vert_\infty \le \lambda}(\mathbf{u}) = \begin{cases} 0 & \left\Vert\mathbf{u}\right\Vert_\infty = \max(\mathbf{u}) \le \lambda\\ +\infty & \textrm{otherwise}\end{cases}$$
and $\max(\mathbf{u})$ is the biggest element in $\mathbf{u}$.
This constraint assures that the largest element of the approximation $\mathbf{u}$ is lower than $\lambda$, otherwise the cost is infinite. The proximal operator for this constraint is:
$$\mathbf{u} = \textrm{prox}_h(\mathbf{x}) = \min\left(1, \frac{\lambda}{x_i}\right) x_i, x_i = \textrm{each element in }\mathbf{x}$$
Intuitively, the proximal operator leaves any element below $\lambda$ unchanged ($\lambda / x_i \gt 1$) and any element above $\lambda$ is shrunk to $\lambda$.