Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In this section, we will consider the following square system of nn equations

Qx=p,Q0Q x = p, \quad Q \succ 0

where xRnx \in \mathbb{R}^n, QRn×nQ \in \mathbb{R}^{n\times n} and pRnp\in \mathbb{R}^n.

First, we recall that (1) is closely related to the study of least squares problems.

Preliminaries

Conjugacy property

Conjugate gradient methods rely on the conjugacy property.

Note that if d1,,dd_1, \ldots, d_\ell are conjugate with respect to QQ, then they are linearly independent. (Prove it).

This property motivates a first algorithm, known as the conjugate direction method.

Conjugate directions method

Computation of the optimal stepsize αk\alpha_k

Let ϕ(α)=12t(α)Qt(α)pt(α)\phi(\alpha) = \frac{1}{2} t(\alpha)^\top Qt(\alpha) - p^\top t(\alpha) where t(α)=x(k)+αdkt(\alpha) = x^{(k)}+\alpha d_k. Hence αk=arg minαϕ(α)\alpha_k = \argmin_{\alpha} \phi(\alpha).

It is a quadratic convex function of α\alpha, so it suffices to cancel out the derivative to get αk\alpha_k.

ϕ(α)=dkQ[x(k)+αdk]pdk=αdkQdk+dk[Qx(k)p]\begin{align*} \phi'(\alpha) &= d_k^\top Q\left[x^{(k)}+\alpha d_k\right] - p^\top d_k\\ &= \alpha d_k^\top Qd_k + d_k^\top\left[Qx^{(k)}-p\right] \end{align*}

Define rk=Qx(k)pr_k = Qx^{(k)}-p the residual at iteration kk. Then

αk=dkrkdkQdk\alpha_k = -\frac{{d_k^\top r_k}}{d_k^\top Qd_k}
Proof

Let us consider nn conjugate directions. This implies that the directions are linearly independent and thus they must span Rn\mathbb{R}^n. In particular, there exist σ0,,σN1R\sigma_0, \ldots, \sigma_{N-1} \in \mathbb{R} such that

xx(0)=i=0n1σidi.x^\star - x^{(0)} = \sum_{i=0}^{n-1} \sigma_i d_i.

By using the conjugacy property, we can obtain the value of σk\sigma_k as

dkQ[xx(0)]=dkQ[i=0n1σidi]=σkdkQdk\begin{align*} d_k^\top Q\left[x^\star - x^{(0)}\right] &= d_k^\top Q\left[\sum_{i=0}^{n-1} \sigma_i d_i\right]= \sigma_k d_k^\top Qd_k \end{align*}

hence σk=dkQ[xx(0)]dkQdk\sigma_k = \frac{ d_k^\top Q\left[x^\star - x^{(0)}\right]}{d_k^\top Qd_k}. Next we prove that αk=σk\alpha_k = \sigma_k.

By the recursion formula, we have for k>0,x(k)=x(0)+i=0k1αidik>0, x^{(k)} = x^{(0)} + \sum_{i=0}^{k-1}\alpha_i d_i. Then by conjugacy

dkQx(k)=dkQx(0)dkQ[x(k)x(0)]=0\begin{align*} d_k^\top Q x^{(k)} &= d_k^\top Q x^{(0)} \Leftrightarrow d_k^\top Q\left[ x^{(k)} - x^{(0)}\right] = 0 \end{align*}

and as a result

dkQ[xx(0)]=dkQ[xx(k)]=dk[pQx(k)]=dkrkd_k^\top Q\left[x^\star - x^{(0)}\right] = d_k^\top Q\left[x^\star - x^{(k)}\right] = d_k^\top \left[p - Qx^{(k)}\right] = - d_k^\top r_k

which permits to conclude that αk=σk\alpha_k = \sigma_k.

In addition, we have the following additional properties.

Proofs can be found in Nocedal & Wright (2006, p. 106), if interested.

The linear conjugate gradient (CG) method

The conjugate directions method of Algorithm 1 requires a set of conjugate directions {dk}\lbrace d_k\rbrace, which can be determined in advance using, e.g.,

Main idea and preliminary CG version

Choose the direction dkd_k as a linear combination of rk-r_k (negative residual) and dk1d_{k-1} (previous direction)

dk=rk+βkdk1d_k = -r_k + \beta_k d_{k-1}

where βk\beta_k is set to impose conjugation between dkd_k and dk1d_{k-1}:

dk1Qdk=0βk=dk1Qrkdk1Qdk1d_{k-1}^\top Q d_k = 0 \Rightarrow \beta_k = \frac{d_{k-1}^\top Qr_{k}}{d_{k-1}^\top Qd_{k-1}}

This gives us a preliminary version of CG. Starting from x(0)Rnx^{(0)}\in \mathbb{R}^n and d0=r0d_0 = - r_0, iterate until convergence

x(k+1)=x(k)+αkdk, where αk=dkrkdkQdkdk+1=rk+1+βk+1dk, where βk+1=dkQrk+1dkQdk\begin{align*} x^{(k+1)} &= x^{(k)}+\alpha_k d_k, \text{ where } \alpha_k = -\frac{{d_k^\top r_k}}{d_k^\top Qd_k}\\ d_{k+1} &= -r_{k+1}+\beta_{k+1} d_k, \text{ where } \beta_{k+1} = \frac{d_{k}^\top Qr_{k+1}}{d_{k}^\top Qd_{k}} \end{align*}

This first version is practical for studying properties of CG. To do this, we need to introduce the notion of Krylov subspace, which is very useful in numerical linear algebra..

Proof

See Nocedal & Wright (2006, pp. 109-111).

Here are some important takeaways from the theorem:

Practical CG algorithm

Using Property 1 and Theorem 2, we can improve the computations of αk\alpha_k and βk+1\beta_{k+1} in CG:

αk=dkrkdkQdk=[rk+βkdk1]rkdkQdk=rk22dkQdk\begin{align*} \alpha_k = -\frac{d_k^\top r_k}{{d_k^\top Qd_k}} &= -\frac{\left[-r_k + \beta_kd_{k-1}\right]^\top r_k}{d_k^\top Qd_k} = \frac{\Vert r_k\Vert_2^2}{{d_k^\top Qd_k}} \end{align*}

Moreover, observe that rk+1rk=αkQdkr_{k+1}-r_k = \alpha_k Qd_k so that

βk+1=dkQrk+1dkQdk=dkQdkrk22[rk+1rk]rk+1dkQdk=rk+122rk22\beta_{k+1} = \frac{d_{k}^\top Qr_{k+1}}{d_{k}^\top Qd_{k}} = \frac{{d_k^\top Qd_k}}{\Vert r_k\Vert_2^2} \frac{\left[r_{k+1}-r_k\right]^\top r_{k+1}}{d_k^\top Qd_k} = \frac{\Vert r_{k+1}\Vert_2^2}{\Vert r_{k}\Vert_2^2}

where we used the orthogonality of residuals. This leads to the following practical algorithm.

Numerical illustration

Theorem 2 tells us convergence of CG to xx^\star is guaranteed after at most nn iterations. However, it can be much faster!

In fact,

The following example shows how CG convergence speed can be affected by the distrubution of eigenvalues.

import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse.linalg import cg
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

# Function to run CG on a system with matrix A and show convergence rates
def run_cg(A, b):
    x0 = np.zeros_like(b)
    residuals = []

    # Custom callback function to record residuals at each iteration
    def callback(xk):
        residuals.append(np.linalg.norm(b - A @ xk))

    # Run CG algorithm
    x, _ = cg(A, b, x0=x0, rtol=1e-14, callback=callback, maxiter=x0.size)
    return residuals

# Problem setup
n = 100
np.random.seed(10)
b = np.random.randn(n)

# Case 1: Well-clustered eigenvalues
eigenvalues_clustered = 10 + 10 * np.random.rand(n)
A_clustered = np.diag(eigenvalues_clustered)
residuals_clustered = run_cg(A_clustered, b)

# Case 2: Spread-out eigenvalues
eigenvalues_spread = 100 * np.random.rand(n)
A_spread = np.diag(eigenvalues_spread)
residuals_spread = run_cg(A_spread, b)

# Plot convergence rates
fig, ax = plt.subplots(figsize=(10, 6))
ax.semilogy(residuals_clustered, label="Clustered Eigenvalues")
ax.semilogy(residuals_spread, label="Spread Eigenvalues")
ax.set_xlabel("Iteration")
ax.set_ylabel("Residual Norm (log scale)")
ax.set_title("Convergence of Conjugate Gradient for Different Eigenvalue Distributions")
ax.grid(True)
ax.legend()

# Add inset for eigenvalue distributions
axins = inset_axes(ax, width="35%", height="35%", loc='upper right')
axins.hist(eigenvalues_clustered, bins=20, alpha=0.6, label="Clustered")
axins.hist(eigenvalues_spread, bins=20, alpha=0.6, label="Spread")
#axins.set_title("Eigenvalue Distribution", fontsize=10)
axins.set_xlabel("Eigenvalue", fontsize=8)
axins.set_ylabel("Count", fontsize=8)
axins.tick_params(axis='both', which='major', labelsize=8)
axins.legend(fontsize=8)

plt.show()
<Figure size 1000x600 with 2 Axes>

Summary

References
  1. Nocedal, J., & Wright, S. J. (2006). Numerical optimization (Second Edition). Springer.