Skip to article frontmatterSkip to article content

We deal with the unconstrained optimization problem

minimizef0(x)subject toxRn\begin{array}{ll} \minimize &\quad f_0(x)\\ \st & \quad x \in \mathbb{R}^n \end{array}

We assume that f0f_0 is twice continuously differentiable. The Newton’s method (or Newton’s algorithm) attempts at solving (1) using the following strategy.

Newton’s method requires the function to have positive definite Hessian at iterates x(k)x^{(k)}: indeed it requires to compute [2f0(x(k))]1\left[\nabla^2 f_0(x^{(k)})\right]^{-1}, which is only defined when the Hessian is invertible (or equivalently, positive definite since it is symmetric). When it is the case, the direction dk=[2f0(x(k))]1f0(x(k))d_k = -\left[\nabla^2 f_0(x^{(k)})\right]^{-1}\nabla f_0(x^{(k)}) is indeed a descent direction since

dkf0(x(k))=f0(x(k))[2f0(x(k))]1f0(x(k))<0d_k^\top \nabla f_0(x^{(k)}) = - \nabla f_0(x^{(k)})^\top \left[\nabla^2 f_0(x^{(k)})\right]^{-1}\nabla f_0(x^{(k)}) < 0

invoking positive definiteness of the Hessian at x(k)x^{(k)}.

Example for quadratic functions

Consider the quadratic function

f0(x)=12xQxpx+cf_0(x) = \frac{1}{2} x^\top Q x - p^\top x + c

where QRn×nQ \in \mathbb{R}^{n \times n} is symmetric positive definite, pRnp \in \mathbb{R}^n, and cRc \in \mathbb{R}. The gradient and Hessian are:

f0(x)=Qxp,2f0(x)=Q\nabla f_0(x) = Qx - p, \qquad \nabla^2 f_0(x) = Q

Note that since Q0Q \succ 0 by assumption, f0f_0 is strictly convex and has a unique minimizer given by x=Q1px^\star = Q^{-1} p.

Applying Newton’s method with constant step size αk=1\alpha_k = 1

x(k+1)=x(k)Q1(Qx(k)p)=Q1p=xx^{(k+1)} = x^{(k)} - Q^{-1}(Qx^{(k)} -p) = Q^{-1}p = x^\star

Newton’s method finds the minimizer in a single iteration, regardless of the starting point x(0)x^{(0)}! On the other hand, if αk=α(0,1)\alpha_k = \alpha \in (0, 1), one has

x(k+1)=x(k)αQ1(Qx(k)p)=(1α)x(k)+αQ1p.x^{(k+1)} = x^{(k)} - \alpha Q^{-1}(Qx^{(k)} -p) = (1-\alpha) x^{(k)} + \alpha Q^{-1}p.

Hence, by recurrence, we get

x(k+1)=x+(1α)k(x(0)x)x^{(k+1)} = x^\star + (1-\alpha)^k (x^{(0)} - x^\star)

which converges to xx^\star as kk\to \infty.

Newton step as minimizer of the local quadratic approximation

Newton’s method can be motivated by considering a local quadratic approximation of the objective function f0f_0. Around a point x(k)x^{(k)}, the second-order Taylor expansion of ff is

f0(x)f0(x(k))+f0(x(k))(xx(k))+12(xx(k))2f0(x(k))(xx(k)).f_0(x) \approx f_0(x^{(k)}) + \nabla f_0(x^{(k)})^\top (x - x^{(k)}) + \tfrac{1}{2} (x - x^{(k)})^\top \nabla^2 f_0(x^{(k)}) (x - x^{(k)}).

where the linear term captures first-order behavior and the quadratic term incorporates curvature information through the Hessian 2f0(x(k))\nabla^2 f_0(x^{(k)}).

Let p=xx(k)p = x - x^{(k)} denote the step direction. Define the local quadratic model:

mk(p)=f0(x(k))+f0(x(k))p+12p2f0(x(k))p,m_k(p) = f_0(x^{(k)}) + \nabla f_0(x^{(k)})^\top p + \tfrac{1}{2} p^\top \nabla^2 f_0(x^{(k)}) p,

which is a quadratic function of pp. Assuming the Hessian is positive definite, this function is strictly quadratic and thus the (unique) minimizer of mk(p)m_k(p) satisfies the first-order optimality condition

mk(p)=f0(x(k))+2f0(x(k))p=0.\nabla m_k(p) = \nabla f_0(x^{(k)}) + \nabla^2 f_0(x^{(k)}) p = 0.

Solving for pp gives

p(k)=[2f0(x(k))]1f0(x(k)).p^{(k)} = - [\nabla^2 f_0(x^{(k)})]^{-1} \nabla f_0(x^{(k)}).

Thus, the update rule is

x(k+1)=x(k)+p(k)=x(k)[2f0(x(k))]1f0(x(k)).x^{(k+1)} = x^{(k)} + p^{(k)} = x^{(k)} - [\nabla^2 f_0(x^{(k)})]^{-1} \nabla f_0(x^{(k)}).

which is exactly the Newton step of Algorithm 1 above, for αk=α=1\alpha_k = \alpha=1.

Implications and discussion

The fact that the Newton step can be interpreted as the minimization of the local quadratic approximation has several important implications:

There are two important shortcomings to keep in mind:

These issues are addressed (in part) by quasi-Newton methods, as discussed in the next section.

Example

We consider the minimization problem of the Rosenbrock function in 2D.

minimize(1x1)2+100(x2x12)2subject toxRn\begin{array}{ll} \minimize &\quad (1-x_1)^2 + 100(x_2-x_1^2)^2\\ \st & \quad x \in \mathbb{R}^n \end{array}

which has a clear (global) optima at x=[1,1]x^\star = [1, 1]^\top with optimal value p=0p^\star = 0.

Source
import numpy as np
import matplotlib.pyplot as plt

# Rosenbrock function
def f(v):
    x, y = v
    return (1 - x)**2 + 100*(y - x**2)**2

def grad(v):
    x, y = v
    return np.array([
        -2*(1-x) - 400*x*(y - x**2),
        200*(y - x**2)
    ])

def hess(v):
    x, y = v
    return np.array([
        [2 - 400*y + 1200*x**2, -400*x],
        [-400*x, 200]
    ])

# Gradient Descent
def gradient_descent(x0, alpha=0.002, max_iter=5000):
    x = x0.copy()
    xs = [x.copy()]
    for _ in range(max_iter):
        x -= alpha * grad(x)
        xs.append(x.copy())
        if np.linalg.norm(grad(x)) < 1e-6:
            break
    return np.array(xs)

# Newton's Method with damping (line search)
def newton_method(x0, max_iter=50):
    x = x0.copy()
    xs = [x.copy()]
    for _ in range(max_iter):
        g = grad(x)
        H = hess(x)

        p = np.linalg.pinv(H) @ g
        x -=  p
        xs.append(x.copy())
    return np.array(xs)

# Run both methods
x0 = np.array([-0.5, 0.1])  # common starting point
gd_path = gradient_descent(x0)
newton_path = newton_method(x0)

# Contour plot
xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 400), np.linspace(-1, 2, 400))
zz = (1-xx)**2 + 100*(yy-xx**2)**2

plt.figure(figsize=(8,6))
plt.contourf(xx, yy, zz, levels=40, cmap="viridis")

plt.plot(gd_path[:,0], gd_path[:,1], 'o--', label="Gradient Descent ($\\alpha=0.002$)")
plt.plot(newton_path[:,0], newton_path[:,1], 's--', label="Newton's Method ($\\alpha=1$)")

plt.plot(1, 1, 'r*', markersize=12, label="Minimum")
plt.colorbar()
plt.title("Newton vs Gradient Descent on the Rosenbrock Function")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

# Plot convergence in objective values
plt.figure(figsize=(8, 4))
plt.semilogy([f(x) for x in gd_path], label="Gradient Descent ($\\alpha=0.002$)")
plt.semilogy([f(x) for x in newton_path], label="Newton's Method ($\\alpha=1$)")
plt.xlabel("Iteration")
plt.ylabel("Objective value $f(x)$")
plt.title("Convergence of objective values")
plt.legend()
plt.grid(True)
plt.show()

# Zoom on the first 20 iterations
plt.figure(figsize=(8, 4))
plt.semilogy(range(20), [f(x) for x in gd_path[:20]], 'o-', label="Gradient Descent (first 20)")
plt.semilogy(range(20), [f(x) for x in newton_path[:20]], 's-', label="Newton's Method (first 20)")
plt.xlabel("Iteration")
plt.ylabel("Objective value $f(x)$")
plt.title("Zoom: First 20 Iterations")
plt.legend()
plt.grid(True)
plt.show()
<Figure size 800x600 with 2 Axes><Figure size 800x400 with 1 Axes><Figure size 800x400 with 1 Axes>

On this example, the difference between gradient descent and Newton’s method is illuminating. While gradient descent requires many small steps to navigate the curved valley of the Rosenbrock function, Newton’s method quickly adjusts its trajectory and reaches the optimum in far fewer iterations, thanks to its use of second-order information. This illustrates the practical advantage of Newton’s method when the Hessian is available and well defined.