Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

We consider solving the unconstrained optimization problem

minimizef(x)subject toxRn\begin{array}{ll} \minimize &\quad f(x)\\ \st& \quad x \in \mathbb{R}^n \end{array}

through gradient descent with constant step size α>0\alpha > 0

x(k+1)=x(k)αf(x(k)),k=0,1,2,x^{(k+1)} = x^{(k)} -\alpha \nabla f(x^{(k)}), \quad k=0, 1, 2, \ldots

Motivations

Smooth gradient descent: first results

Let us exploit the upper bound property of smooth functions. Setting y=x(k+1)y = x^{(k+1)} and x=x(k)x = x^{(k)}, one gets the inequality

f(x(k+1))f(x(k))+f(x(k))(x(k+1)x(k))+L2x(k+1)x(k)22f( x^{(k+1)}) \leq f( x^{(k)}) + \nabla f( x^{(k)})^\top ( x^{(k+1)}- x^{(k)}) + \frac{L}{2}\Vert x^{(k+1)}- x^{(k)}\Vert_2^2

Using the gradient descent update, this simplifies as

f(x(k+1))f(x(k))+α(αL21)f(x(k))22f( x^{(k+1)}) \leq f( x^{(k)}) + \alpha \left(\alpha\frac{L}{2}-1\right)\Vert \nabla f( x^{(k)})\Vert_2^2

The RHS is minimized for α=1/L\alpha = 1/L. For this choice of stepsize, one gets

f(x(k+1))=f(x(k)(1/L)f(x(k)))f(x(k))12Lf(x(k))22f( x^{(k+1)}) = f( x^{(k)} - (1/L)\nabla f(x^{(k)}))\leq f( x^{(k)}) - \frac{1}{2L} \Vert \nabla f( x^{(k)})\Vert_2^2

in particular f(x(k+1))f(x(k))f( x^{(k+1)}) \leq f( x^{(k)}) for this choice of stepsize. Now, summing the previous inequalities from k=0k=0 to k+1=Kk+1=K one gets

f(x(K))f(x(0))12Lk=0K1f(x(k))22f(x^{(K)}) \leq f(x^{(0)}) - \frac{1}{2L}\sum_{k=0}^{K-1} \Vert \nabla f( x^{(k)})\Vert_2^2

Since f(x(K))f(x)f(x^{(K)}) \geq f(x^\star),

k=0K1f(x(k))222L[f(x(0))f(x)]\sum_{k=0}^{K-1} \Vert \nabla f( x^{(k)})\Vert_2^2 \leq 2L \left[f(x^{(0)}) - f(x^\star)\right]

This shows that the quantity k=0K1f(x(k))22\sum_{k=0}^{K-1} \Vert \nabla f( x^{(k)})\Vert_2^2 is upper bounded by a positive constant 2L[f(x(0))f(x)]2L \left[f(x^{(0)}) - f(x^\star)\right], for any value of KK. Therefore, one has necessarily

limKf(x(K))22=0\lim_{K\rightarrow \infty} \Vert \nabla f( x^{(K)})\Vert_2^2 = 0

Hence, gradient descent converges to a stationary point of ff.

In addition, we get that

min0kK1f(x(k))22L[f(x(0))f(x(K))]K2L[f(x(0))f(x)]K\min_{0 \leq k \leq K-1} \Vert \nabla f( x^{(k)})\Vert_2 \leq \sqrt{\frac{2L[f(x^{(0)})-f(x^{(K)})]}{K}} \leq \sqrt{\frac{2L[f(x^{(0)})-f(x^\star)]}{K}}

After KK steps of gradient descent, we can find a point xx such that

f(x)22L[f(x(0))f(x)]K\Vert\nabla f(x)\Vert_2 \leq \sqrt{\frac{2L[f(x^{(0)})-f(x^\star)]}{K}}

Smooth convex case

With the addition of the convex assumption on ff, we can establish the following theorem.

Comments

Proof 1 (Proof of Theorem 1)

By convexity of ff, we have, for any iterate x(k)x^{(k)}

f(x)f(x(k))+f(x(k))(xx(k))f(x^\star) \geq f(x^{(k)})+ \nabla f(x^{(k)})^\top (x^\star-x^{(k)})

Recall that when ff is LL-smooth, gradient descent with constant stepsize α=1/L\alpha = 1/L satisfies (see above for the proof)

f(x(k+1))f(x(k))12Lf(x(k))2f(x^{(k+1)}) \leq f(x^{(k)}) - \frac{1}{2L}\Vert \nabla f(x^{(k)})\Vert^2

By substituting the two equations into one another, one obtains

f(x(k+1))f(x)+f(x(k))(x(k)x)12Lf(x(k))2(convexity)=f(x)+L2(x(k)x2x(k)x(1/L)f(x(k))2)(completing the squares)=f(x)+L2(x(k)x2x(k+1)x2)(GD update definition)\begin{align*} f(x^{(k+1)}) & \leq f(x^\star) + \nabla f(x^{(k)})^\top (x^{(k)}-x^\star)- \frac{1}{2L}\Vert \nabla f(x^{(k)})\Vert^2& \text{(convexity)}\\ &= f(x^\star) + \frac{L}{2}\left( \Vert x^{(k)}-x^\star\Vert^2 - \Vert x^{(k)}-x^\star-(1/L)\nabla f(x^{(k)})\Vert^2\right)&\text{(completing the squares)}\\ &= f(x^\star) + \frac{L}{2}\left( \Vert x^{(k)}-x^\star\Vert^2 - \Vert x^{(k+1)}-x^\star\Vert^2\right)&\text{(GD update definition)} \end{align*}

Summing over k=0,1,,K1k=0, 1, \ldots, K-1 we get

k=0K1(f(x(k+1))f(x))L2k=0K1(x(k)x2x(k+1)x2)=L2(x(0)x2x(K)x2)L2x(0)x2\begin{align*} \sum_{k=0}^{K-1} \left(f(x^{(k+1)}) - f(x^\star)\right) &\leq \frac{L}{2}\sum_{k=0}^{K-1}\left( \Vert x^{(k)}-x^\star\Vert^2 - \Vert x^{(k+1)}-x^\star\Vert^2\right)\\ &= \frac{L}{2}\left( \Vert x^{(0)}-x^\star\Vert^2 - \Vert x^{(K)}-x^\star\Vert^2\right)\\ &\leq \frac{L}{2}\Vert x^{(0)}-x^\star\Vert^2 \end{align*}

Recall that by LL-smoothness and this choice of stepsize, {f(x(k+1))}\lbrace f(x^{(k+1)})\rbrace is non-increasing. Thus

f(x(K))f(x)1Kk=0K1(f(x(k+1))f(x))L2Kx(0)x2f(x^{(K)}) - f(x^\star) \leq \frac{1}{K} \sum_{k=0}^{K-1} \left(f(x^{(k+1)}) - f(x^\star)\right) \leq \frac{L}{2K} \Vert x^{(0)}-x^\star\Vert^2

as expected.

To illustrate why we can only boundedness of iterates, let us consider the special case (but important) of convex quadratic functions.

Smooth and strongly convex case

Strong convexity permits to refine the previous Theorem 1 with a better rate.

Comments

Before giving the proof of Theorem 2, we can give a simple proof of this result in the case of convex quadratics.

Proof 2 (Proof of Theorem 2, convergence in objective values)

Suppose that ff is LL-smooth and mm-strongly convex. From the definition of strong convexity, we have

f(z)f(x)+f(x)(zx)+m2zx2 for any z,xRnf(z) \geq f(x) + \nabla f(x)^\top (z-x) + \frac{m}{2}\Vert z -x \Vert^2\text{ for any } z, x \in \mathbb{R}^n

Let us minimize this inequality with respect to zz. The LHS is clearly minimized for z=xz = x^\star. On the other hand, the RHS is minimized for z=x(1/m)f(x)z= x - (1/m)\nabla f(x). Thus,

f(x)f(x)+f(x)1mf(x)+m21mf(x)2=f(x)12mf(x)2f(x^\star) \geq f(x) + \nabla f(x)^\top \frac{1}{m}\nabla f(x) + \frac{m}{2}\Vert \frac{1}{m}\nabla f(x)\Vert^2 = f(x) - \frac{1}{2m}\Vert \nabla f(x)\Vert^2

Rearranging terms, we obtain the famous Polyak-Lojasiewicz (PL) inequality

f(x)22m[f(x)f(x)] for all xRn(PL)\Vert \nabla f(x)\Vert^2 \geq 2m \left[f(x) - f(x^\star)\right]\text{ for all } x \in \mathbb{R}^n\tag{PL}

Now, recall that for LL-smooth functions, we have shown that (see above)

f(x(k+1))f(x(k))12Lf(x(k))2f(x^{(k+1)}) \leq f(x^{(k)}) - \frac{1}{2L}\Vert \nabla f(x^{(k)})\Vert^2

Bounding the gradient norm through PL inequality, we get

f(x(k+1))f(x(k))2m2L[f(x(k))f(x)]f(x^{(k+1)}) \leq f(x^{(k)}) - \frac{2m}{2L} \left[f(x^{(k)}) - f(x^\star)\right]

substracting f(x)f(x^\star) from both sides gives

f(x(k+1))f(x)(1mL)[f(x(k))f(x)]f(x^{(k+1)}) - f(x^\star)\leq \left(1 - \frac{m}{L}\right) \left[f(x^{(k)}) - f(x^\star)\right]

showing linear convergence. After KK iterations, we get the result

f(x(K))f(x)(1mL)K[f(x(0))f(x)].f(x^{(K)}) - f(x^\star) \leq \left(1 - \frac{m}{L}\right)^K \left[f(x^{(0)}) - f(x^\star)\right].
Proof 3 (Proof of Theorem 2, convergence in iterates)

Let α=1/L\alpha = 1/L. Computing the distance between the current iterate and the optimum, we get

x(k+1)x22=x(k)αf(x(k))x22 (GD)=x(k)x2+α2f(x(k))22αf(x(k))(x(k)x)=x(k)x2+α2f(x(k))2+2αf(x(k))(xx(k))x(k)x2+α2f(x(k))2+2α(f(x)f(x(k)))αmx(k)x(strong convexity)=(1αm)x(k)x2+α2f(x(k))2+2α(f(x)f(x(k)))(1αm)x(k)x2+α2f(x(k))2αLf(x(k))2(L-smooth property)=(1αm)x(k)x2+α(α1L)f(x(k))2(1αm)x(k)x2(α(α1L)0 for α1/L)\begin{align*} \Vert x^{(k+1)} - x^\star\Vert_2^2 &= \Vert x^{(k)} - \alpha\nabla f(x^{(k)}) - x^\star\Vert_2^2 & \text{ (GD)}\\ &= \Vert x^{(k)} - x^\star \Vert^2 +\alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 - 2\alpha\nabla f(x^{(k)})^\top (x^{(k)} - x^\star)& \\ &= \Vert x^{(k)} - x^\star \Vert^2 +\alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 + 2\alpha\nabla f(x^{(k)})^\top (x^\star-x^{(k)} )& \\ &\leq \Vert x^{(k)} - x^\star \Vert^2 + \alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 + 2\alpha(f(x^\star) - f(x^{(k)})) - \alpha m \Vert x^{(k)} - x^\star\Vert & \text{(strong convexity)}\\ &= (1-\alpha m)\Vert x^{(k)} - x^\star \Vert^2 + \alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 + 2\alpha(f(x^\star) - f(x^{(k)})) &\\ &\leq (1-\alpha m)\Vert x^{(k)} - x^\star \Vert^2 + \alpha^2 \Vert \nabla f(x^{(k)})\Vert^2 - \frac{\alpha}{L}\Vert \nabla f(x^{(k)})\Vert^2 & (\text{$L$-smooth property})\\ &= (1-\alpha m)\Vert x^{(k)} - x^\star \Vert^2 + \alpha(\alpha - \frac{1}{L}) \Vert \nabla f(x^{(k)})\Vert^2 &\\ &\leq (1-\alpha m)\Vert x^{(k)} - x^\star \Vert^2 & (\alpha(\alpha-\frac{1}{L})\leq 0\text{ for } \alpha \leq 1/L) \end{align*}

As a consequence we have that, for α=1/L\alpha = 1/L

x(K)x22(1mL)Kx(0)x22\Vert x^{(K)} - x^\star\Vert_2^2 \leq (1-\frac{m}{L})^K \Vert x^{(0)} - x^\star\Vert_2^2

Refining the rate

It is possible to refine the rate in Theorem 2 using a property of strongly convex functions, known as the coercivity of the gradient.

See for instance Bubeck (2015, Lemma 3.11) or Nesterov (2013, Theorem 2.1.12) for a proof of this property.

Comments

Summary

For gradient descent with constant stepsize, convergence properties depend on assumptions on the function ff.

assumption on ffstepsize α\alpharateconvergence of iterates?
smoothα=1L\alpha = \frac{1}{L}f(x(K))=O(1/K)\Vert\nabla f(x^{(K)})\Vert = \mathcal{O}(1/\sqrt{K})
convex, smoothα=1L\alpha = \frac{1}{L}f(x(K))f(x)=O(1/K)f(x^{(K)}) - f(x^\star) = \mathcal{O}(1/K)bounded
strongly convex, smoothα=1L\alpha = \frac{1}{L}f(x(K))f(x)=O(cK)f(x^{(K)}) - f(x^\star) = \mathcal{O}(c^{K})yes
strongly convex, smoothα=2L+m\alpha = \frac{2}{L+m}f(x(K))f(x)=O(d2K)f(x^{(K)}) - f(x^\star) = \mathcal{O}(d^{2K})yes
References
  1. Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3–4), 231–357.
  2. Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course (Vol. 87). Springer Science & Business Media.
  3. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.