Algorithms for nonlinear least-squares problems - Mines Nancy

An important special case of (unconstrained) optimization problems are non-linear least squares:

\begin{array}{ll} \minimize &\quad f_0(x):= \dfrac{1}{2} \sum_{j=1}^m r_j(x)^2\\ \st & \quad x \in \mathbb{R}^n \end{array}

(1)

where $r_j: \mathbb{R}^n\to \mathbb{R}$ are residual functions. Equivalently, define

R(x)= \left[r_1(x),r_2(x), \ldots, r_m(x)\right]^\top,

(2)

such that the optimization problem (1) can be written as

\begin{array}{ll} \minimize &\quad f_0(x):= \Vert R(x)\Vert_2^2, \quad R: \mathbb{R}^n \to \mathbb{R}^m\\ \st & \quad x \in \mathbb{R}^n \end{array}

(3)

Exploiting the structure of the problem¶

Problems (1) and (3) have a specific structure that can be exploited to devise efficient algorithms.

Key tool: the Jacobian matrix

Let $R: \mathbb{R}^n \to \mathbb{R}^m$ . The Jacobian matrix of $R$ at point $x$ is defined as

J(x) = \begin{bmatrix} \dfrac{\partial r_1}{\partial x_1} & \ldots & \dfrac{\partial r_1}{\partial x_n}\\ \dfrac{\partial r_2}{\partial x_1} & \ldots & \dfrac{\partial r_2}{\partial x_n}\\ \vdots & & \vdots\\ \dfrac{\partial r_m}{\partial x_1} & \ldots & \dfrac{\partial r_m}{\partial x_n}\\ \end{bmatrix} = \begin{bmatrix} \nabla r_1(x)^\top\\ \nabla r_2(x)^\top\\ \vdots\\ \nabla r_m(x)^\top\\ \end{bmatrix} \in \mathbb{R}^{m\times n}

(4)

Using the Jacobian matrix allows for a simple expression of the gradient and Hessian of $f_0(x) = \frac{1}{2}\Vert R(x)\Vert^2_2$ . Let us compute the gradient and Hessian of $f_0$ :

\begin{align*} \nabla f_0(x) & = \sum_{j=1}^m r_j(x)\nabla r_j(x) = J(x)^\top R(x)\\ \nabla^2 f_0(x) & = \sum_{j=1}^m \nabla r_j(x)\nabla r_j(x)^\top + \sum_{j=1}^m r_j(x)\nabla^2 r_j(x) \\ & = J(x)^\top J(x) + \sum_{j=1}^m r_j(x)\nabla^2 r_j(x) \end{align*}

(5)

Why are these expression interesting? Because, in many applications:

Computing the Jacobian $J(x)$ is inexpensive (or easy to obtain)
The gradient is directly obtained from $\nabla f_0(x) = J(x)^\top R(x)$
Often, the second term in the Hessian can be neglected (especially near the solution), so that
$\nabla^2 f_0(x) \approx J(x)^\top J(x)$
(6)
meaning the Hessian can be well-approximated without second derivatives.

We exploit these properties to construct dedicated nonlinear least squares algorithms, known as the Gauss-Newton method and the Levenberg-Marquardt method, respectively.

Algorithms for nonlinear least squares¶

The Gauss-Newton method can be seen as a quasi-Newton method with matrix $B_k = J(x^{(k)})^\top J(x^{(k)})$ and constant step-size $\alpha_k = 1$ .

The Levenberg-Marquardt method can also be interpreted as a quasi-Newton method with constant step-size $\alpha_k = 1$ . The method introduces a regularization parameter $\lambda_k$ , which is tuned throughout iterations.

The Levenberg-Marquardt method is one the standard algorithms for non-linear least squares problems. It is implemented in many libraries, such as scipy.optimize.least_squares (use option method='lm').

References¶

Nocedal, J., & Wright, S. J. (2006). Numerical optimization (Second Edition). Springer.

Mines Nancy - Optimization course

Quasi-Newton methods

Mines Nancy - Optimization course

Choosing an algorithm: convergence rates and computational cost tradeoffs

4.6Algorithms for nonlinear least-squares problems

Exploiting the structure of the problem¶

Algorithms for nonlinear least squares¶