4.2 Descent methods: general principles
Principle of descent algorithms ¶ Consider the unconstrained optimization problem
minimize f 0 ( x ) subject to x ∈ R n \begin{array}{ll}
\minimize &\quad f_0(x)\\
\st & \quad x \in \mathbb{R}^n
\end{array} minimize subject to f 0 ( x ) x ∈ R n To solve (1) , we construct a sequence x ( 0 ) , x ( 1 ) , … x^{(0)}, x^{(1)}, \ldots x ( 0 ) , x ( 1 ) , … such that the k + 1 k+1 k + 1 -th iteration reads
x ( k + 1 ) = x ( k ) + α k d k x^{(k+1)} = x^{(k)} + \alpha_k d_k x ( k + 1 ) = x ( k ) + α k d k where α k ≥ 0 \alpha_k \geq 0 α k ≥ 0 is called the step size and d k d_k d k is a direction at iteration k k k .
The goal is to construct a sequence { x ( k ) } \lbrace x^{(k)} \rbrace { x ( k ) } that describes a descent algorithm, i.e.,
f 0 ( x ( 0 ) ) > f 0 ( x ( 1 ) ) > … > f 0 ( x ( k ) ) > f 0 ( x ( k + 1 ) ) > … f_0(x^{(0)}) > f_0(x^{(1)}) > \ldots > f_0(x^{(k)}) > f_0(x^{(k+1)}) > \ldots f 0 ( x ( 0 ) ) > f 0 ( x ( 1 ) ) > … > f 0 ( x ( k ) ) > f 0 ( x ( k + 1 ) ) > … with the goal that, in the limit of k → ∞ k\to \infty k → ∞ , x ( k ) → x ⋆ x^{(k)} \rightarrow x^\star x ( k ) → x ⋆ with x ⋆ x^\star x ⋆ being (at least) a local minimizer of f 0 f_0 f 0 .
Recall that the gradient ∇ f 0 ( x ( k ) ) \nabla f_0(x^{(k)}) ∇ f 0 ( x ( k ) ) gives, by definition, the direction of steepest ascent at point x ( k ) x^{(k)} x ( k ) . Therefore, in order to satisfy the decrease condition (3) , one must take d k d_k d k in a opposite direction, i.e.,
∇ f 0 ( x ( k ) ) ⊤ d k < 0 \nabla f_0(x^{(k)})^\top d_k < 0 ∇ f 0 ( x ( k ) ) ⊤ d k < 0 In other terms, d k d_k d k must be taken as a descent direction .
A zoo of descent algorithms ¶ Simply put, strategies for choosing the direction d k d_k d k define different categories of optimization algorithms.
Generally speaking, d k d_k d k is chosen as
d k = − B k − 1 ∇ f 0 ( x ( k ) ) d_k = -B_k^{-1}\nabla f_0(x^{(k)}) d k = − B k − 1 ∇ f 0 ( x ( k ) ) where B k B_k B k is symmetric and invertible. In the next sections, we will see several classical choices for the matrix B k B_k B k :
B k = I n B_k = I_n B k = I n (identity matrix). This is known as the Gradient Descent (GD) method .B k = ∇ 2 f 0 ( x ( k ) ) B_k = \nabla^2 f_0(x^{(k)}) B k = ∇ 2 f 0 ( x ( k ) ) (Hessian of f f f ). This is known as the Newton’s method .B k ≈ ∇ 2 f 0 ( x ( k ) ) B_k \approx \nabla^2 f_0(x^{(k)}) B k ≈ ∇ 2 f 0 ( x ( k ) ) (approx. Hessian). This corresponds to quasi-Newton methods, where many strategies for choosing B k B_k B k exist.How to define a stopping criterion ¶ Before going over these methods in more detail, it is important to address the following question: how to check that an algorithm has converged ? In other terms, when do we decide to stop the sequence (2) ?
This is a wide topic, but the main idea is stop the algorithm whenever at the current iteration, a given criterion becomes smaller than a pre-defined (small) tolerance ε \varepsilon ε (e.g., ε = 1 0 − 6 \varepsilon = 10^{-6} ε = 1 0 − 6 ).
Change in objective function: ∣ f 0 ( x ( k + 1 ) ) − f 0 ( x ( k ) ) ∣ ≤ ε \vert f_0(x^{(k+1)}) - f_0(x^{(k)})\vert \leq \varepsilon ∣ f 0 ( x ( k + 1 ) ) − f 0 ( x ( k ) ) ∣ ≤ ε Gradient norm: ∥ ∇ f 0 ( x ( k ) ) ∥ ≤ ε \Vert \nabla f_0(x^{(k)}) \Vert \leq \varepsilon ∥∇ f 0 ( x ( k ) ) ∥ ≤ ε Change in solution: ∥ x ( k + 1 ) − x ( k ) ∥ ≤ ε \Vert x^{(k+1)} - x^{(k)}\Vert \leq \varepsilon ∥ x ( k + 1 ) − x ( k ) ∥ ≤ ε To avoid scaling issues, it is common to use a criterion calculated in relative terms, for example:
∣ f 0 ( x ( k + 1 ) ) − f 0 ( x ( k ) ) ∣ ∣ f 0 ( x ( k ) ) ∣ ≤ ε or ∥ x ( k + 1 ) − x ( k ) ∥ ∥ x ( k ) ∥ ≤ ε \frac{\vert f_0(x^{(k+1)}) - f_0(x^{(k)})\vert}{\vert f_0(x^{(k)}) \vert} \leq \varepsilon
\quad \text{or} \quad
\frac{\Vert x^{(k+1)} - x^{(k)}\Vert}{\Vert x^{(k)}\Vert} \leq \varepsilon ∣ f 0 ( x ( k ) ) ∣ ∣ f 0 ( x ( k + 1 ) ) − f 0 ( x ( k ) ) ∣ ≤ ε or ∥ x ( k ) ∥ ∥ x ( k + 1 ) − x ( k ) ∥ ≤ ε Now, let us turn our attention to gradient descent.