6.2 Convergence properties for gradient descent
We consider solving the unconstrained optimization problem
minimize f ( x ) subject to x ∈ R n \begin{array}{ll}
\minimize &\quad f(x)\\
\st& \quad x \in \mathbb{R}^n
\end{array} minimize subject to f ( x ) x ∈ R n through gradient descent with constant step size α > 0 \alpha > 0 α > 0
x ( k + 1 ) = x ( k ) − α ∇ f ( x ( k ) ) , k = 0 , 1 , 2 , … x^{(k+1)} = x^{(k)} -\alpha \nabla f(x^{(k)}), \quad k=0, 1, 2, \ldots x ( k + 1 ) = x ( k ) − α ∇ f ( x ( k ) ) , k = 0 , 1 , 2 , … Motivations
simplest case, yet already illuminating
common strategy for large scale applications (e.g. machine learning)
results can be extended to backtracking / optimal step size without too much trouble.
Smooth gradient descent: first results ¶ Let us exploit the upper bound property of smooth functions. Setting y = x ( k + 1 ) y = x^{(k+1)} y = x ( k + 1 ) and x = x ( k ) x = x^{(k)} x = x ( k ) , one gets the inequality
f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) + ∇ f ( x ( k ) ) ⊤ ( x ( k + 1 ) − x ( k ) ) + L 2 ∥ x ( k + 1 ) − x ( k ) ∥ 2 2 f( x^{(k+1)}) \leq f( x^{(k)}) + \nabla f( x^{(k)})^\top ( x^{(k+1)}- x^{(k)}) + \frac{L}{2}\Vert x^{(k+1)}- x^{(k)}\Vert_2^2 f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) + ∇ f ( x ( k ) ) ⊤ ( x ( k + 1 ) − x ( k ) ) + 2 L ∥ x ( k + 1 ) − x ( k ) ∥ 2 2 Using the gradient descent update, this simplifies as
f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) + α ( α L 2 − 1 ) ∥ ∇ f ( x ( k ) ) ∥ 2 2 f( x^{(k+1)}) \leq f( x^{(k)}) + \alpha \left(\alpha\frac{L}{2}-1\right)\Vert \nabla f( x^{(k)})\Vert_2^2 f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) + α ( α 2 L − 1 ) ∥∇ f ( x ( k ) ) ∥ 2 2 The RHS is minimized for α = 1 / L \alpha = 1/L α = 1/ L . For this choice of stepsize, one gets
f ( x ( k + 1 ) ) = f ( x ( k ) − ( 1 / L ) ∇ f ( x ( k ) ) ) ≤ f ( x ( k ) ) − 1 2 L ∥ ∇ f ( x ( k ) ) ∥ 2 2 f( x^{(k+1)}) = f( x^{(k)} - (1/L)\nabla f(x^{(k)}))\leq f( x^{(k)}) - \frac{1}{2L} \Vert \nabla f( x^{(k)})\Vert_2^2 f ( x ( k + 1 ) ) = f ( x ( k ) − ( 1/ L ) ∇ f ( x ( k ) )) ≤ f ( x ( k ) ) − 2 L 1 ∥∇ f ( x ( k ) ) ∥ 2 2 in particular f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) f( x^{(k+1)}) \leq f( x^{(k)}) f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) for this choice of stepsize.
Now, summing the previous inequalities from k = 0 k=0 k = 0 to k + 1 = K k+1=K k + 1 = K one gets
f ( x ( K ) ) ≤ f ( x ( 0 ) ) − 1 2 L ∑ k = 0 K − 1 ∥ ∇ f ( x ( k ) ) ∥ 2 2 f(x^{(K)}) \leq f(x^{(0)}) - \frac{1}{2L}\sum_{k=0}^{K-1} \Vert \nabla f( x^{(k)})\Vert_2^2 f ( x ( K ) ) ≤ f ( x ( 0 ) ) − 2 L 1 k = 0 ∑ K − 1 ∥∇ f ( x ( k ) ) ∥ 2 2 Since f ( x ( K ) ) ≥ f ( x ⋆ ) f(x^{(K)}) \geq f(x^\star) f ( x ( K ) ) ≥ f ( x ⋆ ) ,
∑ k = 0 K − 1 ∥ ∇ f ( x ( k ) ) ∥ 2 2 ≤ 2 L [ f ( x ( 0 ) ) − f ( x ⋆ ) ] \sum_{k=0}^{K-1} \Vert \nabla f( x^{(k)})\Vert_2^2 \leq 2L \left[f(x^{(0)}) - f(x^\star)\right] k = 0 ∑ K − 1 ∥∇ f ( x ( k ) ) ∥ 2 2 ≤ 2 L [ f ( x ( 0 ) ) − f ( x ⋆ ) ] This shows that the quantity ∑ k = 0 K − 1 ∥ ∇ f ( x ( k ) ) ∥ 2 2 \sum_{k=0}^{K-1} \Vert \nabla f( x^{(k)})\Vert_2^2 ∑ k = 0 K − 1 ∥∇ f ( x ( k ) ) ∥ 2 2 is upper bounded by a positive constant 2 L [ f ( x ( 0 ) ) − f ( x ⋆ ) ] 2L \left[f(x^{(0)}) - f(x^\star)\right] 2 L [ f ( x ( 0 ) ) − f ( x ⋆ ) ] , for any value of K K K . Therefore, one has necessarily
lim K → ∞ ∥ ∇ f ( x ( K ) ) ∥ 2 2 = 0 \lim_{K\rightarrow \infty} \Vert \nabla f( x^{(K)})\Vert_2^2 = 0 K → ∞ lim ∥∇ f ( x ( K ) ) ∥ 2 2 = 0 Hence, gradient descent converges to a stationary point of f f f .
In addition, we get that
min 0 ≤ k ≤ K − 1 ∥ ∇ f ( x ( k ) ) ∥ 2 ≤ 2 L [ f ( x ( 0 ) ) − f ( x ( K ) ) ] K ≤ 2 L [ f ( x ( 0 ) ) − f ( x ⋆ ) ] K \min_{0 \leq k \leq K-1} \Vert \nabla f( x^{(k)})\Vert_2 \leq \sqrt{\frac{2L[f(x^{(0)})-f(x^{(K)})]}{K}} \leq \sqrt{\frac{2L[f(x^{(0)})-f(x^\star)]}{K}} 0 ≤ k ≤ K − 1 min ∥∇ f ( x ( k ) ) ∥ 2 ≤ K 2 L [ f ( x ( 0 ) ) − f ( x ( K ) )] ≤ K 2 L [ f ( x ( 0 ) ) − f ( x ⋆ )] After K K K steps of gradient descent, we can find a point x x x such that
∥ ∇ f ( x ) ∥ 2 ≤ 2 L [ f ( x ( 0 ) ) − f ( x ⋆ ) ] K \Vert\nabla f(x)\Vert_2 \leq \sqrt{\frac{2L[f(x^{(0)})-f(x^\star)]}{K}} ∥∇ f ( x ) ∥ 2 ≤ K 2 L [ f ( x ( 0 ) ) − f ( x ⋆ )] convergence can only be guaranteed on the norm of the gradient of iterates
the convergence rate is very slow, in K − 1 / 2 K^{-1/2} K − 1/2
we cannot say much more without additional assumptions on f f f .
Smooth convex case ¶ The objective f f f is L L L -smooth
The objective f f f is convex
a solution x ⋆ x^\star x ⋆ exists
With the addition of the convex assumption on f f f , we can establish the following theorem.
Under the above assumptions, the gradient descent method with constant stepsize α k = 1 / L \alpha_k = 1/L α k = 1/ L generates a sequence { x ( k ) } \lbrace x^{(k)}\rbrace { x ( k ) } such that, after K K K iterations,
f ( x ( K ) ) − f ( x ⋆ ) ≤ L 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 2 f(x^{(K)}) - f(x^\star) \leq \frac{L}{2K}\Vert x^{(0)} - x^\star\Vert_2^2 f ( x ( K ) ) − f ( x ⋆ ) ≤ 2 K L ∥ x ( 0 ) − x ⋆ ∥ 2 2 Comments
the convergence is rate has improved, in K − 1 K^{-1} K − 1
the theorem establishes convergence in cost values, but we can only show boundedness of iterates, i.e., : ∥ x ( k ) − x ⋆ ∥ ≤ ∥ x ( 0 ) − x ⋆ ∥ \Vert x^{(k)} - x^\star\Vert \leq \Vert x^{(0)} - x^\star\Vert ∥ x ( k ) − x ⋆ ∥ ≤ ∥ x ( 0 ) − x ⋆ ∥ ; see below for an illustration in the case of quadratic convex functions.
Proof 1 (Proof of
Theorem 1 )
By convexity of f f f , we have, for any iterate x ( k ) x^{(k)} x ( k )
f ( x ⋆ ) ≥ f ( x ( k ) ) + ∇ f ( x ( k ) ) ⊤ ( x ⋆ − x ( k ) ) f(x^\star) \geq f(x^{(k)})+ \nabla f(x^{(k)})^\top (x^\star-x^{(k)}) f ( x ⋆ ) ≥ f ( x ( k ) ) + ∇ f ( x ( k ) ) ⊤ ( x ⋆ − x ( k ) ) Recall that when f f f is L L L -smooth, gradient descent with constant stepsize α = 1 / L \alpha = 1/L α = 1/ L satisfies (see above for the proof)
f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) − 1 2 L ∥ ∇ f ( x ( k ) ) ∥ 2 f(x^{(k+1)}) \leq f(x^{(k)}) - \frac{1}{2L}\Vert \nabla f(x^{(k)})\Vert^2 f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) − 2 L 1 ∥∇ f ( x ( k ) ) ∥ 2 By substituting the two equations into one another, one obtains
f ( x ( k + 1 ) ) ≤ f ( x ⋆ ) + ∇ f ( x ( k ) ) ⊤ ( x ( k ) − x ⋆ ) − 1 2 L ∥ ∇ f ( x ( k ) ) ∥ 2 (convexity) = f ( x ⋆ ) + L 2 ( ∥ x ( k ) − x ⋆ ∥ 2 − ∥ x ( k ) − x ⋆ − ( 1 / L ) ∇ f ( x ( k ) ) ∥ 2 ) (completing the squares) = f ( x ⋆ ) + L 2 ( ∥ x ( k ) − x ⋆ ∥ 2 − ∥ x ( k + 1 ) − x ⋆ ∥ 2 ) (GD update definition) \begin{align*}
f(x^{(k+1)}) & \leq f(x^\star) + \nabla f(x^{(k)})^\top (x^{(k)}-x^\star)- \frac{1}{2L}\Vert \nabla f(x^{(k)})\Vert^2& \text{(convexity)}\\
&= f(x^\star) + \frac{L}{2}\left( \Vert x^{(k)}-x^\star\Vert^2 - \Vert x^{(k)}-x^\star-(1/L)\nabla f(x^{(k)})\Vert^2\right)&\text{(completing the squares)}\\
&= f(x^\star) + \frac{L}{2}\left( \Vert x^{(k)}-x^\star\Vert^2 - \Vert x^{(k+1)}-x^\star\Vert^2\right)&\text{(GD update definition)}
\end{align*} f ( x ( k + 1 ) ) ≤ f ( x ⋆ ) + ∇ f ( x ( k ) ) ⊤ ( x ( k ) − x ⋆ ) − 2 L 1 ∥∇ f ( x ( k ) ) ∥ 2 = f ( x ⋆ ) + 2 L ( ∥ x ( k ) − x ⋆ ∥ 2 − ∥ x ( k ) − x ⋆ − ( 1/ L ) ∇ f ( x ( k ) ) ∥ 2 ) = f ( x ⋆ ) + 2 L ( ∥ x ( k ) − x ⋆ ∥ 2 − ∥ x ( k + 1 ) − x ⋆ ∥ 2 ) (convexity) (completing the squares) (GD update definition) Summing over k = 0 , 1 , … , K − 1 k=0, 1, \ldots, K-1 k = 0 , 1 , … , K − 1 we get
∑ k = 0 K − 1 ( f ( x ( k + 1 ) ) − f ( x ⋆ ) ) ≤ L 2 ∑ k = 0 K − 1 ( ∥ x ( k ) − x ⋆ ∥ 2 − ∥ x ( k + 1 ) − x ⋆ ∥ 2 ) = L 2 ( ∥ x ( 0 ) − x ⋆ ∥ 2 − ∥ x ( K ) − x ⋆ ∥ 2 ) ≤ L 2 ∥ x ( 0 ) − x ⋆ ∥ 2 \begin{align*}
\sum_{k=0}^{K-1} \left(f(x^{(k+1)}) - f(x^\star)\right) &\leq \frac{L}{2}\sum_{k=0}^{K-1}\left( \Vert x^{(k)}-x^\star\Vert^2 - \Vert x^{(k+1)}-x^\star\Vert^2\right)\\
&= \frac{L}{2}\left( \Vert x^{(0)}-x^\star\Vert^2 - \Vert x^{(K)}-x^\star\Vert^2\right)\\
&\leq \frac{L}{2}\Vert x^{(0)}-x^\star\Vert^2
\end{align*} k = 0 ∑ K − 1 ( f ( x ( k + 1 ) ) − f ( x ⋆ ) ) ≤ 2 L k = 0 ∑ K − 1 ( ∥ x ( k ) − x ⋆ ∥ 2 − ∥ x ( k + 1 ) − x ⋆ ∥ 2 ) = 2 L ( ∥ x ( 0 ) − x ⋆ ∥ 2 − ∥ x ( K ) − x ⋆ ∥ 2 ) ≤ 2 L ∥ x ( 0 ) − x ⋆ ∥ 2 Recall that by L L L -smoothness and this choice of stepsize, { f ( x ( k + 1 ) ) } \lbrace
f(x^{(k+1)})\rbrace { f ( x ( k + 1 ) )} is non-increasing. Thus
f ( x ( K ) ) − f ( x ⋆ ) ≤ 1 K ∑ k = 0 K − 1 ( f ( x ( k + 1 ) ) − f ( x ⋆ ) ) ≤ L 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 f(x^{(K)}) - f(x^\star) \leq \frac{1}{K} \sum_{k=0}^{K-1} \left(f(x^{(k+1)}) - f(x^\star)\right) \leq \frac{L}{2K} \Vert x^{(0)}-x^\star\Vert^2 f ( x ( K ) ) − f ( x ⋆ ) ≤ K 1 k = 0 ∑ K − 1 ( f ( x ( k + 1 ) ) − f ( x ⋆ ) ) ≤ 2 K L ∥ x ( 0 ) − x ⋆ ∥ 2 as expected.
To illustrate why we can only boundedness of iterates, let us consider the special case (but important) of convex quadratic functions.
Let f : x ↦ 1 2 x ⊤ Q x − p ⊤ x f: x \mapsto \frac{1}{2} x^\top Qx - p^\top x f : x ↦ 2 1 x ⊤ Q x − p ⊤ x with Q ⪰ 0 Q \succeq 0 Q ⪰ 0 , hence f f f is convex. Recall that ∇ f ( x ) = Q x − p \nabla f(x) = Qx - p ∇ f ( x ) = Q x − p ; moreover over f f f is L L L -smooth with L = ∥ Q ∥ 2 = max ∥ x ∥ = 1 ∥ Q x ∥ 2 = max i ∣ λ i ( Q ) ∣ L = \Vert Q\Vert_2 = \max_{\Vert x \Vert = 1} \Vert Q x\Vert_2 = \max_{i} \vert\lambda_i(Q)\vert L = ∥ Q ∥ 2 = max ∥ x ∥ = 1 ∥ Q x ∥ 2 = max i ∣ λ i ( Q ) ∣ .
Since Q ⪰ 0 Q\succeq 0 Q ⪰ 0 a solution always exists (but not necessarily unique), given by x ⋆ = Q † p x^\star = Q^\dagger p x ⋆ = Q † p where Q † Q^\dagger Q † is the pseudo inverse of Q Q Q (Q † = Q − 1 Q^\dagger = Q^{-1} Q † = Q − 1 when Q ≻ 0 Q\succ 0 Q ≻ 0 ).
Iterates read
x ( k + 1 ) = x ( k ) − 1 L ( Q x ( k ) − p ) = x ( k ) − 1 L Q ( x ( k ) − x ⋆ ) x^{(k+1)} = x^{(k)} - \frac{1}{L}(Qx^{(k)} - p) = x^{(k)} - \frac{1}{L}Q(x^{(k)} -x^\star) x ( k + 1 ) = x ( k ) − L 1 ( Q x ( k ) − p ) = x ( k ) − L 1 Q ( x ( k ) − x ⋆ ) Therefore
x ( k + 1 ) − x ⋆ = [ I n − 1 L Q ] ( x ( k ) − x ⋆ ) = [ I n − 1 L Q ] k + 1 ( x ( 0 ) − x ⋆ ) x^{(k+1)} - x^\star = \left[I_n - \frac{1}{L}Q\right](x^{(k)} -x^\star) = \left[I_n - \frac{1}{L}Q\right]^{k+1}(x^{(0)} -x^\star) x ( k + 1 ) − x ⋆ = [ I n − L 1 Q ] ( x ( k ) − x ⋆ ) = [ I n − L 1 Q ] k + 1 ( x ( 0 ) − x ⋆ ) Since Q ⪰ 0 Q \succeq 0 Q ⪰ 0 , eigenvalues of I n − 1 L Q I_n - \frac{1}{L}Q I n − L 1 Q are in [ 0 , 1 ] [0, 1] [ 0 , 1 ] . Therefore ∥ I n − 1 L Q ∥ 2 ≤ 1 \Vert I_n - \frac{1}{L}Q \Vert_2 \leq 1 ∥ I n − L 1 Q ∥ 2 ≤ 1 and thus we have boundedness of the iterates
∥ x ( k + 1 ) − x ⋆ ∥ 2 ≤ ∥ x ( 0 ) − x ⋆ ∥ \Vert x^{(k+1)} - x^\star \Vert_2 \leq \Vert x^{(0)} - x^\star\Vert ∥ x ( k + 1 ) − x ⋆ ∥ 2 ≤ ∥ x ( 0 ) − x ⋆ ∥ Indeed, if Q Q Q has a zero eigenvalue, ∥ I − Q / L ∥ 2 = 1 \Vert I - Q/L\Vert_2 = 1 ∥ I − Q / L ∥ 2 = 1 and the algorithm will get “stuck” in directions corresponding to that eigenvalue.
Smooth and strongly convex case ¶ Strong convexity permits to refine the previous Theorem 1 with a better rate.
Under the above assumptions, the gradient descent method with α k = 1 / L \alpha_k = 1/L α k = 1/ L generates a sequence { x ( k ) } \lbrace x^{(k)}\rbrace { x ( k ) } such that, after K K K iterations,
f ( x ( K ) ) − f ( x ⋆ ) ≤ ( 1 − m L ) K ( f ( x ( 0 ) ) − f ( x ⋆ ) ) ∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( 1 − m L ) K ∥ x ( 0 ) − x ⋆ ∥ 2 2 \begin{align*}
f(x^{(K)}) - f(x^\star) &\leq \left(1-\frac{m}{L}\right)^K\left(f(x^{(0)}) - f(x^\star) \right)\\
\Vert x^{(K)} - x^\star \Vert_2^2&\leq \left(1-\frac{m}{L}\right)^K \Vert x^{(0)} - x^\star \Vert_2^2
\end{align*} f ( x ( K ) ) − f ( x ⋆ ) ∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( 1 − L m ) K ( f ( x ( 0 ) ) − f ( x ⋆ ) ) ≤ ( 1 − L m ) K ∥ x ( 0 ) − x ⋆ ∥ 2 2 Comments
convergence in objective and iterates;
exponential rate (also known as linear convergence )
rate depends on the ratio m / L m/L m / L (yet not the best in this case, see further below)
Before giving the proof of Theorem 2 , we can give a simple proof of this result in the case of convex quadratics.
Let f ( x ) = 1 2 x ⊤ Q x − p ⊤ x f(x) = \frac{1}{2} x^\top Qx - p^\top x f ( x ) = 2 1 x ⊤ Q x − p ⊤ x with Q ≻ 0 Q \succ 0 Q ≻ 0 . Note m = λ min ( Q ) m = \lambda_{\min} (Q) m = λ m i n ( Q ) and L = λ max ( Q ) L = \lambda_{\max} (Q) L = λ m a x ( Q ) . Hence f f f is m − m- m − strongly convex and L − L- L − smooth.
Then
x ( k + 1 ) − x ∗ = x ( k ) − 1 L ( Q x ( k ) − p ) − x ∗ = x ( k ) − x ∗ − 1 L Q ( x ( k ) − x ∗ ) = ( I n − 1 L Q ) ( x ( k ) − x ∗ ) \begin{align*}
x^{(k+1)} - x^* &= x^{(k)} - \frac{1}{L}(Qx^{(k)} - p) -x^*\\
&= x^{(k)}-x^* -\frac{1}{L}Q(x^{(k)}-x^*)\\
&=(I_n -\frac{1}{L}Q)(x^{(k)}-x^*)
\end{align*} x ( k + 1 ) − x ∗ = x ( k ) − L 1 ( Q x ( k ) − p ) − x ∗ = x ( k ) − x ∗ − L 1 Q ( x ( k ) − x ∗ ) = ( I n − L 1 Q ) ( x ( k ) − x ∗ ) Since Q ≻ 0 Q \succ 0 Q ≻ 0 , the eigenvalues of I n − 1 L Q I_n -\frac{1}{L}Q I n − L 1 Q are in [ 0 , 1 − ( m / L ) ] [0, 1-(m/L)] [ 0 , 1 − ( m / L )] .
Therefore, by recursion
∥ x ( K ) − x ∗ ∥ 2 2 ≤ ( 1 − m L ) K ∥ x ( 0 ) − x ∗ ∥ 2 2 . \Vert x^{(K)} - x^*\Vert_2^2 \leq \left(1-\frac{m}{L}\right)^K \Vert x^{(0)} - x^*\Vert_2^2. ∥ x ( K ) − x ∗ ∥ 2 2 ≤ ( 1 − L m ) K ∥ x ( 0 ) − x ∗ ∥ 2 2 . (Similar proof for objective values).
Proof 2 (Proof of
Theorem 2 , convergence in objective values)
Suppose that f f f is L L L -smooth and m m m -strongly convex. From the definition of strong convexity , we have
f ( z ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( z − x ) + m 2 ∥ z − x ∥ 2 for any z , x ∈ R n f(z) \geq f(x) + \nabla f(x)^\top (z-x) + \frac{m}{2}\Vert z -x \Vert^2\text{ for any } z, x \in \mathbb{R}^n f ( z ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( z − x ) + 2 m ∥ z − x ∥ 2 for any z , x ∈ R n Let us minimize this inequality with respect to z z z .
The LHS is clearly minimized for z = x ⋆ z = x^\star z = x ⋆ . On the other hand, the RHS is minimized for z = x − ( 1 / m ) ∇ f ( x ) z= x - (1/m)\nabla f(x) z = x − ( 1/ m ) ∇ f ( x ) . Thus,
f ( x ⋆ ) ≥ f ( x ) + ∇ f ( x ) ⊤ 1 m ∇ f ( x ) + m 2 ∥ 1 m ∇ f ( x ) ∥ 2 = f ( x ) − 1 2 m ∥ ∇ f ( x ) ∥ 2 f(x^\star) \geq f(x) + \nabla f(x)^\top \frac{1}{m}\nabla f(x) + \frac{m}{2}\Vert \frac{1}{m}\nabla f(x)\Vert^2 = f(x) - \frac{1}{2m}\Vert \nabla f(x)\Vert^2 f ( x ⋆ ) ≥ f ( x ) + ∇ f ( x ) ⊤ m 1 ∇ f ( x ) + 2 m ∥ m 1 ∇ f ( x ) ∥ 2 = f ( x ) − 2 m 1 ∥∇ f ( x ) ∥ 2 Rearranging terms, we obtain the famous Polyak-Lojasiewicz (PL) inequality
∥ ∇ f ( x ) ∥ 2 ≥ 2 m [ f ( x ) − f ( x ⋆ ) ] for all x ∈ R n (PL) \Vert \nabla f(x)\Vert^2 \geq 2m \left[f(x) - f(x^\star)\right]\text{ for all } x \in \mathbb{R}^n\tag{PL} ∥∇ f ( x ) ∥ 2 ≥ 2 m [ f ( x ) − f ( x ⋆ ) ] for all x ∈ R n ( PL ) Now, recall that for L L L -smooth functions, we have shown that (see above)
f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) − 1 2 L ∥ ∇ f ( x ( k ) ) ∥ 2 f(x^{(k+1)}) \leq f(x^{(k)}) - \frac{1}{2L}\Vert \nabla f(x^{(k)})\Vert^2 f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) − 2 L 1 ∥∇ f ( x ( k ) ) ∥ 2 Bounding the gradient norm through PL inequality, we get
f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) − 2 m 2 L [ f ( x ( k ) ) − f ( x ⋆ ) ] f(x^{(k+1)}) \leq f(x^{(k)}) - \frac{2m}{2L} \left[f(x^{(k)}) - f(x^\star)\right] f ( x ( k + 1 ) ) ≤ f ( x ( k ) ) − 2 L 2 m [ f ( x ( k ) ) − f ( x ⋆ ) ] substracting f ( x ⋆ ) f(x^\star) f ( x ⋆ ) from both sides gives
f ( x ( k + 1 ) ) − f ( x ⋆ ) ≤ ( 1 − m L ) [ f ( x ( k ) ) − f ( x ⋆ ) ] f(x^{(k+1)}) - f(x^\star)\leq \left(1 - \frac{m}{L}\right) \left[f(x^{(k)}) - f(x^\star)\right] f ( x ( k + 1 ) ) − f ( x ⋆ ) ≤ ( 1 − L m ) [ f ( x ( k ) ) − f ( x ⋆ ) ] showing linear convergence.
After K K K iterations, we get the result
f ( x ( K ) ) − f ( x ⋆ ) ≤ ( 1 − m L ) K [ f ( x ( 0 ) ) − f ( x ⋆ ) ] . f(x^{(K)}) - f(x^\star) \leq \left(1 - \frac{m}{L}\right)^K \left[f(x^{(0)}) - f(x^\star)\right]. f ( x ( K ) ) − f ( x ⋆ ) ≤ ( 1 − L m ) K [ f ( x ( 0 ) ) − f ( x ⋆ ) ] . Proof 3 (Proof of
Theorem 2 , convergence in iterates)
Let α = 1 / L \alpha = 1/L α = 1/ L .
Computing the distance between the current iterate and the optimum, we get
∥ x ( k + 1 ) − x ⋆ ∥ 2 2 = ∥ x ( k ) − α ∇ f ( x ( k ) ) − x ⋆ ∥ 2 2 (GD) = ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥ ∇ f ( x ( k ) ) ∥ 2 − 2 α ∇ f ( x ( k ) ) ⊤ ( x ( k ) − x ⋆ ) = ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥ ∇ f ( x ( k ) ) ∥ 2 + 2 α ∇ f ( x ( k ) ) ⊤ ( x ⋆ − x ( k ) ) ≤ ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥ ∇ f ( x ( k ) ) ∥ 2 + 2 α ( f ( x ⋆ ) − f ( x ( k ) ) ) − α m ∥ x ( k ) − x ⋆ ∥ (strong convexity) = ( 1 − α m ) ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥ ∇ f ( x ( k ) ) ∥ 2 + 2 α ( f ( x ⋆ ) − f ( x ( k ) ) ) ≤ ( 1 − α m ) ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥ ∇ f ( x ( k ) ) ∥ 2 − α L ∥ ∇ f ( x ( k ) ) ∥ 2 ( L -smooth property ) = ( 1 − α m ) ∥ x ( k ) − x ⋆ ∥ 2 + α ( α − 1 L ) ∥ ∇ f ( x ( k ) ) ∥ 2 ≤ ( 1 − α m ) ∥ x ( k ) − x ⋆ ∥ 2 ( α ( α − 1 L ) ≤ 0 for α ≤ 1 / L ) \begin{align*}
\Vert x^{(k+1)} - x^\star\Vert_2^2 &= \Vert x^{(k)} - \alpha\nabla f(x^{(k)}) - x^\star\Vert_2^2 & \text{ (GD)}\\
&= \Vert x^{(k)} - x^\star \Vert^2 +\alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 - 2\alpha\nabla f(x^{(k)})^\top (x^{(k)} - x^\star)& \\
&= \Vert x^{(k)} - x^\star \Vert^2 +\alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 + 2\alpha\nabla f(x^{(k)})^\top (x^\star-x^{(k)} )& \\
&\leq \Vert x^{(k)} - x^\star \Vert^2 + \alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 + 2\alpha(f(x^\star) - f(x^{(k)})) - \alpha m \Vert x^{(k)} - x^\star\Vert & \text{(strong convexity)}\\
&= (1-\alpha m)\Vert x^{(k)} - x^\star \Vert^2 + \alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 + 2\alpha(f(x^\star) - f(x^{(k)})) &\\
&\leq (1-\alpha m)\Vert x^{(k)} - x^\star \Vert^2 + \alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 - \frac{\alpha}{L}\Vert \nabla f(x^{(k)})\Vert^2 & (\text{$L$-smooth property})\\
&= (1-\alpha m)\Vert x^{(k)} - x^\star \Vert^2 + \alpha(\alpha - \frac{1}{L}) \Vert \nabla f(x^{(k)})\Vert^2 &\\
&\leq (1-\alpha m)\Vert x^{(k)} - x^\star \Vert^2 & (\alpha(\alpha-\frac{1}{L})\leq 0\text{ for } \alpha \leq 1/L)
\end{align*} ∥ x ( k + 1 ) − x ⋆ ∥ 2 2 = ∥ x ( k ) − α ∇ f ( x ( k ) ) − x ⋆ ∥ 2 2 = ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥∇ f ( x ( k ) ) ∥ 2 − 2 α ∇ f ( x ( k ) ) ⊤ ( x ( k ) − x ⋆ ) = ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥∇ f ( x ( k ) ) ∥ 2 + 2 α ∇ f ( x ( k ) ) ⊤ ( x ⋆ − x ( k ) ) ≤ ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥∇ f ( x ( k ) ) ∥ 2 + 2 α ( f ( x ⋆ ) − f ( x ( k ) )) − α m ∥ x ( k ) − x ⋆ ∥ = ( 1 − α m ) ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥∇ f ( x ( k ) ) ∥ 2 + 2 α ( f ( x ⋆ ) − f ( x ( k ) )) ≤ ( 1 − α m ) ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥∇ f ( x ( k ) ) ∥ 2 − L α ∥∇ f ( x ( k ) ) ∥ 2 = ( 1 − α m ) ∥ x ( k ) − x ⋆ ∥ 2 + α ( α − L 1 ) ∥∇ f ( x ( k ) ) ∥ 2 ≤ ( 1 − α m ) ∥ x ( k ) − x ⋆ ∥ 2 (GD) (strong convexity) ( L -smooth property ) ( α ( α − L 1 ) ≤ 0 for α ≤ 1/ L ) As a consequence we have that, for α = 1 / L \alpha = 1/L α = 1/ L
∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( 1 − m L ) K ∥ x ( 0 ) − x ⋆ ∥ 2 2 \Vert x^{(K)} - x^\star\Vert_2^2 \leq (1-\frac{m}{L})^K \Vert x^{(0)} - x^\star\Vert_2^2 ∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( 1 − L m ) K ∥ x ( 0 ) − x ⋆ ∥ 2 2 Refining the rate ¶ It is possible to refine the rate in Theorem 2 using a property of strongly convex functions, known as the coercivity of the gradient .
Let f f f be L L L -smooth and m m m -strongly convex on R n \mathbb{R}^n R n . Then for all x , y ∈ R n x, y \in \mathbb{R}^n x , y ∈ R n , one has
( ∇ f ( x ) − ∇ f ( y ) ) ⊤ ( x − y ) ≥ m L m + L ∥ x − y ∥ 2 + 1 L + m ∥ ∇ f ( x ) − ∇ f ( y ) ∥ 2 (\nabla f(x) - \nabla f(y))^\top(x-y) \geq \frac{m L}{m+L}\Vert x- y \Vert^2 + \frac{1}{L+m}\Vert \nabla f(x) - \nabla f(y)\Vert^2 ( ∇ f ( x ) − ∇ f ( y ) ) ⊤ ( x − y ) ≥ m + L m L ∥ x − y ∥ 2 + L + m 1 ∥∇ f ( x ) − ∇ f ( y ) ∥ 2 See for instance Bubeck (2015, Lemma 3.11) or Nesterov (2013, Theorem 2.1.12) for a proof of this property.
Under the above assumptions, the gradient descent method with α k = 2 / ( m + L ) \alpha_k = 2/(m+L) α k = 2/ ( m + L ) generates a sequence { x ( k ) } \lbrace x^{(k)}\rbrace { x ( k ) } such that, after K K K iterations,
f ( x ( K ) ) − f ( x ⋆ ) ≤ L 2 ( L − m L + m ) 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 2 ∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( L − m L + m ) 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 2 \begin{align*}
f(x^{(K)}) - f(x^\star) &\leq \frac{L}{2}\left(\frac{L-m}{L+m}\right)^{{2K}}\Vert x^{(0)} - x^\star \Vert_2^2\\
\Vert x^{(K)} - x^\star \Vert_2^2 &\leq {\left(\frac{L-m}{L+m}\right)^{{2K}}} \Vert x^{(0)} - x^\star \Vert_2^2
\end{align*} f ( x ( K ) ) − f ( x ⋆ ) ∥ x ( K ) − x ⋆ ∥ 2 2 ≤ 2 L ( L + m L − m ) 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 2 ≤ ( L + m L − m ) 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 2 Comments
improved rate: still linear, but better constant than in Theorem 2
stepsize incorporates our knowledge of strong convexity (compare with the stepsize α = 1 / L \alpha = 1/L α = 1/ L use in previous theorems)
The proof is adapted from Bubeck (2015, Theorem 3.12) .
It exploits directly the coercivity of the gradient property .
Fix x = x ( k ) x = x^{(k)} x = x ( k ) and y = x ⋆ y = x^\star y = x ⋆ in Property 1 we get the bound
( ∇ f ( x ( k ) ) ) ⊤ ( x ( k ) − x ⋆ ) ≥ m L m + L ∥ x k − x ⋆ ∥ 2 + 1 L + m ∥ ∇ f ( x ( k ) ) ∥ 2 (\nabla f(x^{(k)}))^\top(x^{(k)}-x^\star) \geq \frac{m L}{m+L}\Vert x^{{k}} - x^\star \Vert^2 + \frac{1}{L+m}\Vert \nabla f(x^{(k)})\Vert^2 ( ∇ f ( x ( k ) ) ) ⊤ ( x ( k ) − x ⋆ ) ≥ m + L m L ∥ x k − x ⋆ ∥ 2 + L + m 1 ∥∇ f ( x ( k ) ) ∥ 2 Let us consider the case of arbitrary stepsize α \alpha α .
Computing the distance between the current iterate and the optimum, we get
∥ x ( k + 1 ) − x ⋆ ∥ 2 2 = ∥ x ( k ) − α ∇ f ( x ( k ) ) − x ⋆ ∥ 2 2 (GD) = ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥ ∇ f ( x ( k ) ) ∥ 2 − 2 α ∇ f ( x ( k ) ) ⊤ ( x ( k ) − x ⋆ ) ≤ ( 1 − 2 α m L m + L ) ∥ x ( k ) − x ⋆ ∥ 2 + α ( α − 2 L + m ) ∥ ∇ f ( x ( k ) ) ∥ 2 (coercivity of the gradient) \begin{align*}
\Vert x^{(k+1)} - x^\star\Vert_2^2 &= \Vert x^{(k)} - \alpha\nabla f(x^{(k)}) - x^\star\Vert_2^2 & \text{ (GD)}\\
&= \Vert x^{(k)} - x^\star \Vert^2 +\alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 - 2\alpha\nabla f(x^{(k)})^\top (x^{(k)} - x^\star)& \\
&\leq \left(1-\frac{2\alpha m L}{m+L}\right) \Vert x^{(k)} - x^\star \Vert^2 + \alpha \left(\alpha-\frac{2}{L+m}\right)\Vert \nabla f(x^{(k)})\Vert^2 & \text{(coercivity of the gradient)}
\end{align*} ∥ x ( k + 1 ) − x ⋆ ∥ 2 2 = ∥ x ( k ) − α ∇ f ( x ( k ) ) − x ⋆ ∥ 2 2 = ∥ x ( k ) − x ⋆ ∥ 2 + α 2 ∥∇ f ( x ( k ) ) ∥ 2 − 2 α ∇ f ( x ( k ) ) ⊤ ( x ( k ) − x ⋆ ) ≤ ( 1 − m + L 2 α m L ) ∥ x ( k ) − x ⋆ ∥ 2 + α ( α − L + m 2 ) ∥∇ f ( x ( k ) ) ∥ 2 (GD) (coercivity of the gradient) The last term is negative if and only if α ≤ 2 / ( L + m ) \alpha \leq 2/(L+m) α ≤ 2/ ( L + m ) . Hence, for 0 < α ≤ 2 / ( L + m ) 0 < \alpha \leq 2/(L+m) 0 < α ≤ 2/ ( L + m ) , one has
∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( 1 − 2 α m L m + L ) K ∥ x ( 0 ) − x ⋆ ∥ 2 \Vert x^{(K)} - x^\star\Vert_2^2 \leq \left(1-\frac{2\alpha m L}{m+L}\right)^K \Vert x^{(0)} - x^\star \Vert^2 ∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( 1 − m + L 2 α m L ) K ∥ x ( 0 ) − x ⋆ ∥ 2 In particular for α = 2 / ( L + m ) \alpha = 2/(L+m) α = 2/ ( L + m ) , we get
∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( L − m L + m ) 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 \Vert x^{(K)} - x^\star\Vert_2^2 \leq \left(\frac{L-m}{L+m}\right)^{2K} \Vert x^{(0)} - x^\star \Vert^2 ∥ x ( K ) − x ⋆ ∥ 2 2 ≤ ( L + m L − m ) 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 which proves the convergence of iterates.
To show that the objective converges as well, use the fact that, by L L L -smoothness of f f f ,
f ( x ( K ) ) − f ( x ⋆ ) ≤ L 2 ∥ x ( K ) − x ⋆ ∥ 2 2 f(x^{(K)}) - f(x^\star)\leq \frac{L}{2}\Vert x^{(K)} - x^\star\Vert_2^2 f ( x ( K ) ) − f ( x ⋆ ) ≤ 2 L ∥ x ( K ) − x ⋆ ∥ 2 2 hence
f ( x ( K ) ) − f ( x ⋆ ) ≤ L 2 ( L − m L + m ) 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 . f(x^{(K)}) - f(x^\star)\leq \frac{L}{2}\left(\frac{L-m}{L+m}\right)^{2K} \Vert x^{(0)} - x^\star \Vert^2 . f ( x ( K ) ) − f ( x ⋆ ) ≤ 2 L ( L + m L − m ) 2 K ∥ x ( 0 ) − x ⋆ ∥ 2 . Summary ¶ For gradient descent with constant stepsize, convergence properties depend on assumptions on the function f f f .
assumption on f f f stepsize α \alpha α rate convergence of iterates? smooth α = 1 L \alpha = \frac{1}{L} α = L 1 ∥ ∇ f ( x ( K ) ) ∥ = O ( 1 / K ) \Vert\nabla f(x^{(K)})\Vert = \mathcal{O}(1/\sqrt{K}) ∥∇ f ( x ( K ) ) ∥ = O ( 1/ K ) – convex, smooth α = 1 L \alpha = \frac{1}{L} α = L 1 f ( x ( K ) ) − f ( x ⋆ ) = O ( 1 / K ) f(x^{(K)}) - f(x^\star) = \mathcal{O}(1/K) f ( x ( K ) ) − f ( x ⋆ ) = O ( 1/ K ) bounded strongly convex, smooth α = 1 L \alpha = \frac{1}{L} α = L 1 f ( x ( K ) ) − f ( x ⋆ ) = O ( c K ) f(x^{(K)}) - f(x^\star) = \mathcal{O}(c^{K}) f ( x ( K ) ) − f ( x ⋆ ) = O ( c K ) yes strongly convex, smooth α = 2 L + m \alpha = \frac{2}{L+m} α = L + m 2 f ( x ( K ) ) − f ( x ⋆ ) = O ( d 2 K ) f(x^{(K)}) - f(x^\star) = \mathcal{O}(d^{2K}) f ( x ( K ) ) − f ( x ⋆ ) = O ( d 2 K ) yes
Most of the results for constant step size can be adapted to other stepsize strategies. \bigskip
For instance, let f f f be L L L -smooth and m m m -strongly convex. We have:
Optimal step size same result as α = 1 L \alpha = \frac{1}{L} α = L 1 , i.e.,
f ( x ( K ) ) − f ( x ⋆ ) ≤ ( 1 − m L ) K ( f ( x ( 0 ) ) − f ( x ⋆ ) ) f(x^{(K)}) - f(x^\star) \leq \left(1-\frac{m}{L}\right)^K\left(f(x^{(0)}) - f(x^\star) \right) f ( x ( K ) ) − f ( x ⋆ ) ≤ ( 1 − L m ) K ( f ( x ( 0 ) ) − f ( x ⋆ ) ) Backtracking line search similar result, with a different constant
f ( x ( K ) ) − f ( x ⋆ ) ≤ c K ( f ( x ( 0 ) ) − f ( x ⋆ ) ) f(x^{(K)}) - f(x^\star) \leq c^K\left(f(x^{(0)}) - f(x^\star) \right) f ( x ( K ) ) − f ( x ⋆ ) ≤ c K ( f ( x ( 0 ) ) − f ( x ⋆ ) )
where c = 1 − min ( 2 m s , 2 η s m / L ) < 1 c = 1- \min(2ms, 2\eta s m/L) < 1 c = 1 − min ( 2 m s , 2 ηs m / L ) < 1 , with ( s , η ) (s, \eta) ( s , η ) backtracking parameters
See Boyd & Vandenberghe (2004, Section 9.3.1) for more detail.
Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning , 8 (3–4), 231–357. Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course (Vol. 87). Springer Science & Business Media. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization . Cambridge University Press.