Linear conjugate gradient method - Mines Nancy

In this section, we will consider the following square system of $n$ equations

Q x = p, \quad Q \succ 0

(1)

where $x \in \mathbb{R}^n$ , $Q \in \mathbb{R}^{n\times n}$ and $p\in \mathbb{R}^n$ .

First, we recall that (1) is closely related to the study of least squares problems.

Reminder: least squares normal equations

We already encountered systems of the form (1) in the study of least squares problems. Consider the least squares problem

\begin{array}{ll} \minimize& \Vert y - A x\Vert_2^2\\ \st & x \in \mathbb{R}^n \end{array}\quad \quad A \in \mathbb{R}^{m\times n}, y \in \mathbb{R}^m

(2)

where we assume that $\rank{A} = n$ . The normal equations for this optimization problem are

\nabla f(x) = 0 \Leftrightarrow A^\top A x = A^\top y

(3)

Now, pose $Q = A^\top A$ and $p = A^\top y$ , we obtain the linear system (1). (Recall that $Q \succ 0$ since $\rank A = n$ ). Hence the two problems are equivalent!

Preliminaries¶

Conjugacy property¶

Conjugate gradient methods rely on the conjugacy property.

Note that if $d_1, \ldots, d_\ell$ are conjugate with respect to $Q$ , then they are linearly independent. (Prove it).

This property motivates a first algorithm, known as the conjugate direction method.

Conjugate directions method¶

Computation of the optimal stepsize $\alpha_k$

Let $\phi(\alpha) = \frac{1}{2} t(\alpha)^\top Qt(\alpha) - p^\top t(\alpha)$ where $t(\alpha) = x^{(k)}+\alpha d_k$ . Hence $\alpha_k = \argmin_{\alpha} \phi(\alpha)$ .

It is a quadratic convex function of $\alpha$ , so it suffices to cancel out the derivative to get $\alpha_k$ .

\begin{align*} \phi'(\alpha) &= d_k^\top Q\left[x^{(k)}+\alpha d_k\right] - p^\top d_k\\ &= \alpha d_k^\top Qd_k + d_k^\top\left[Qx^{(k)}-p\right] \end{align*}

(6)

Define $r_k = Qx^{(k)}-p$ the residual at iteration $k$ . Then

\alpha_k = -\frac{{d_k^\top r_k}}{d_k^\top Qd_k}

(7)

Proof

Let us consider $n$ conjugate directions. This implies that the directions are linearly independent and thus they must span $\mathbb{R}^n$ . In particular, there exist $\sigma_0, \ldots, \sigma_{N-1} \in \mathbb{R}$ such that

x^\star - x^{(0)} = \sum_{i=0}^{n-1} \sigma_i d_i.

(9)

By using the conjugacy property, we can obtain the value of $\sigma_k$ as

\begin{align*} d_k^\top Q\left[x^\star - x^{(0)}\right] &= d_k^\top Q\left[\sum_{i=0}^{n-1} \sigma_i d_i\right]= \sigma_k d_k^\top Qd_k \end{align*}

(10)

hence $\sigma_k = \frac{ d_k^\top Q\left[x^\star - x^{(0)}\right]}{d_k^\top Qd_k}$ . Next we prove that $\alpha_k = \sigma_k$ .

By the recursion formula, we have for $k>0, x^{(k)} = x^{(0)} + \sum_{i=0}^{k-1}\alpha_i d_i$ . Then by conjugacy

\begin{align*} d_k^\top Q x^{(k)} &= d_k^\top Q x^{(0)} \Leftrightarrow d_k^\top Q\left[ x^{(k)} - x^{(0)}\right] = 0 \end{align*}

(11)

and as a result

d_k^\top Q\left[x^\star - x^{(0)}\right] = d_k^\top Q\left[x^\star - x^{(k)}\right] = d_k^\top \left[p - Qx^{(k)}\right] = - d_k^\top r_k

(12)

which permits to conclude that $\alpha_k = \sigma_k$ .

In addition, we have the following additional properties.

Proofs can be found in Nocedal & Wright (2006, p. 106), if interested.

The linear conjugate gradient (CG) method¶

The conjugate directions method of Algorithm 1 requires a set of conjugate directions $\lbrace d_k\rbrace$ , which can be determined in advance using, e.g.,

the eigenvalue decomposition of $Q$ ;
a modification of the Gram-Schmidt process to produce a set of conjugate directions. This is, however, too costly for large scale applications.

Main idea and preliminary CG version¶

Choose the direction $d_k$ as a linear combination of $-r_k$ (negative residual) and $d_{k-1}$ (previous direction)

d_k = -r_k + \beta_k d_{k-1}

(13)

where $\beta_k$ is set to impose conjugation between $d_k$ and $d_{k-1}$ :

d_{k-1}^\top Q d_k = 0 \Rightarrow \beta_k = \frac{d_{k-1}^\top Qr_{k}}{d_{k-1}^\top Qd_{k-1}}

(14)

This gives us a preliminary version of CG. Starting from $x^{(0)}\in \mathbb{R}^n$ and $d_0 = - r_0$ , iterate until convergence

\begin{align*} x^{(k+1)} &= x^{(k)}+\alpha_k d_k, \text{ where } \alpha_k = -\frac{{d_k^\top r_k}}{d_k^\top Qd_k}\\ d_{k+1} &= -r_{k+1}+\beta_{k+1} d_k, \text{ where } \beta_{k+1} = \frac{d_{k}^\top Qr_{k+1}}{d_{k}^\top Qd_{k}} \end{align*}

(15)

This first version is practical for studying properties of CG. To do this, we need to introduce the notion of Krylov subspace, which is very useful in numerical linear algebra..

Theorem 2 (Nocedal & Wright (2006, Theorem 5.3))

Consider the $k$ -th iteration of the preliminary CG method (15). Then

\begin{align} r_k^\top r_i = 0 \text{ for } i=0, 1, \ldots, k-1\\ \operatorname{Span}\lbrace r_0, r_1, \dots, r_k\rbrace = \mathcal{K}(r_0;k)\\ \operatorname{Span}\lbrace d_0, d_1, \dots, d_k\rbrace = \mathcal{K}(r_0;k)\\ d_k^\top Qd_i = 0 \text{ for } i=0, 1, \ldots, k-1 \end{align}

(16)

and thus the sequence $\lbrace x^{(k)}\rbrace$ converges to $x^\star$ in at most $n$ iterations.

Proof

See Nocedal & Wright (2006, pp. 109-111).

Here are some important takeaways from the theorem:

residuals $\lbrace r_k\rbrace$ are mutually orthogonal;
Residuals $r_k$ and search directions $d_k$ all belong to the Krylov subspace of order $k$ associated with $r_0$ ;
the search directions $d_0, d_1, \ldots, d_{N-1}$ are conjugate wrt $Q$
by the conjugate direction theorem, this implies termination in at most $n$ steps.
these results all depend on the choice of $d_0$ ! In fact, if one chooses $d_0 \neq - r_0$ (i.e., different from the steepest direction at $x^{(0)}$ ) then the theorem does not longer holds.

Practical CG algorithm¶

Using Property 1 and Theorem 2, we can improve the computations of $\alpha_k$ and $\beta_{k+1}$ in CG:

\begin{align*} \alpha_k = -\frac{d_k^\top r_k}{{d_k^\top Qd_k}} &= -\frac{\left[-r_k + \beta_kd_{k-1}\right]^\top r_k}{d_k^\top Qd_k} = \frac{\Vert r_k\Vert_2^2}{{d_k^\top Qd_k}} \end{align*}

(17)

Moreover, observe that $r_{k+1}-r_k = \alpha_k Qd_k$ so that

\beta_{k+1} = \frac{d_{k}^\top Qr_{k+1}}{d_{k}^\top Qd_{k}} = \frac{{d_k^\top Qd_k}}{\Vert r_k\Vert_2^2} \frac{\left[r_{k+1}-r_k\right]^\top r_{k+1}}{d_k^\top Qd_k} = \frac{\Vert r_{k+1}\Vert_2^2}{\Vert r_{k}\Vert_2^2}

(18)

where we used the orthogonality of residuals. This leads to the following practical algorithm.

Numerical illustration¶

Theorem 2 tells us convergence of CG to $x^\star$ is guaranteed after at most $n$ iterations. However, it can be much faster!

In fact,

convergence of CG usually depends on the \alert{distribution of eigenvalues of $Q$ }
it can be shown that if $Q$ has $r$ distinct eigenvalues, then CG converges in at most $r$ iterations.

The following example shows how CG convergence speed can be affected by the distrubution of eigenvalues.

import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse.linalg import cg
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

# Function to run CG on a system with matrix A and show convergence rates
def run_cg(A, b):
    x0 = np.zeros_like(b)
    residuals = []

    # Custom callback function to record residuals at each iteration
    def callback(xk):
        residuals.append(np.linalg.norm(b - A @ xk))

    # Run CG algorithm
    x, _ = cg(A, b, x0=x0, rtol=1e-14, callback=callback, maxiter=x0.size)
    return residuals

# Problem setup
n = 100
np.random.seed(10)
b = np.random.randn(n)

# Case 1: Well-clustered eigenvalues
eigenvalues_clustered = 10 + 10 * np.random.rand(n)
A_clustered = np.diag(eigenvalues_clustered)
residuals_clustered = run_cg(A_clustered, b)

# Case 2: Spread-out eigenvalues
eigenvalues_spread = 100 * np.random.rand(n)
A_spread = np.diag(eigenvalues_spread)
residuals_spread = run_cg(A_spread, b)

# Plot convergence rates
fig, ax = plt.subplots(figsize=(10, 6))
ax.semilogy(residuals_clustered, label="Clustered Eigenvalues")
ax.semilogy(residuals_spread, label="Spread Eigenvalues")
ax.set_xlabel("Iteration")
ax.set_ylabel("Residual Norm (log scale)")
ax.set_title("Convergence of Conjugate Gradient for Different Eigenvalue Distributions")
ax.grid(True)
ax.legend()

# Add inset for eigenvalue distributions
axins = inset_axes(ax, width="35%", height="35%", loc='upper right')
axins.hist(eigenvalues_clustered, bins=20, alpha=0.6, label="Clustered")
axins.hist(eigenvalues_spread, bins=20, alpha=0.6, label="Spread")
#axins.set_title("Eigenvalue Distribution", fontsize=10)
axins.set_xlabel("Eigenvalue", fontsize=8)
axins.set_ylabel("Count", fontsize=8)
axins.tick_params(axis='both', which='major', labelsize=8)
axins.legend(fontsize=8)

plt.show()

Summary¶

References¶

Nocedal, J., & Wright, S. J. (2006). Numerical optimization (Second Edition). Springer.