4.6 Algorithms for nonlinear least-squares problems
An important special case of (unconstrained) optimization problems are non-linear least squares :
minimize f 0 ( x ) : = 1 2 ∑ j = 1 m r j ( x ) 2 subject to x ∈ R n \begin{array}{ll}
\minimize &\quad f_0(x):= \dfrac{1}{2} \sum_{j=1}^m r_j(x)^2\\
\st & \quad x \in \mathbb{R}^n
\end{array} minimize subject to f 0 ( x ) := 2 1 ∑ j = 1 m r j ( x ) 2 x ∈ R n where r j : R n → R r_j: \mathbb{R}^n\to \mathbb{R} r j : R n → R are residual functions. Equivalently, define
R ( x ) = [ r 1 ( x ) , r 2 ( x ) , … , r m ( x ) ] ⊤ , R(x)= \left[r_1(x),r_2(x), \ldots, r_m(x)\right]^\top, R ( x ) = [ r 1 ( x ) , r 2 ( x ) , … , r m ( x ) ] ⊤ , such that the optimization problem (1) can be written as
minimize f 0 ( x ) : = ∥ R ( x ) ∥ 2 2 , R : R n → R m subject to x ∈ R n \begin{array}{ll}
\minimize &\quad f_0(x):= \Vert R(x)\Vert_2^2, \quad R: \mathbb{R}^n \to \mathbb{R}^m\\
\st & \quad x \in \mathbb{R}^n
\end{array} minimize subject to f 0 ( x ) := ∥ R ( x ) ∥ 2 2 , R : R n → R m x ∈ R n The term nonlinear least-squares comes from the fact that R R R may be an arbitrary, non-linear function of x x x ; if R R R is linear (or affine), it can be written as R ( x ) = A x − b R(x) = Ax - b R ( x ) = A x − b and ones recovers a standard (linear) least square problem ; see Chapter 3 .
Exploiting the structure of the problem ¶ Problems (1) and (3) have a specific structure that can be exploited to devise efficient algorithms.
Using the Jacobian matrix allows for a simple expression of the gradient and Hessian of f 0 ( x ) = 1 2 ∥ R ( x ) ∥ 2 2 f_0(x) = \frac{1}{2}\Vert R(x)\Vert^2_2 f 0 ( x ) = 2 1 ∥ R ( x ) ∥ 2 2 .
Let us compute the gradient and Hessian of f 0 f_0 f 0 :
∇ f 0 ( x ) = ∑ j = 1 m r j ( x ) ∇ r j ( x ) = J ( x ) ⊤ R ( x ) ∇ 2 f 0 ( x ) = ∑ j = 1 m ∇ r j ( x ) ∇ r j ( x ) ⊤ + ∑ j = 1 m r j ( x ) ∇ 2 r j ( x ) = J ( x ) ⊤ J ( x ) + ∑ j = 1 m r j ( x ) ∇ 2 r j ( x ) \begin{align*}
\nabla f_0(x) & = \sum_{j=1}^m r_j(x)\nabla r_j(x) = J(x)^\top R(x)\\
\nabla^2 f_0(x) & = \sum_{j=1}^m \nabla r_j(x)\nabla r_j(x)^\top + \sum_{j=1}^m r_j(x)\nabla^2 r_j(x) \\
& = J(x)^\top J(x) + \sum_{j=1}^m r_j(x)\nabla^2 r_j(x)
\end{align*} ∇ f 0 ( x ) ∇ 2 f 0 ( x ) = j = 1 ∑ m r j ( x ) ∇ r j ( x ) = J ( x ) ⊤ R ( x ) = j = 1 ∑ m ∇ r j ( x ) ∇ r j ( x ) ⊤ + j = 1 ∑ m r j ( x ) ∇ 2 r j ( x ) = J ( x ) ⊤ J ( x ) + j = 1 ∑ m r j ( x ) ∇ 2 r j ( x ) Why are these expression interesting? Because, in many applications:
Computing the Jacobian J ( x ) J(x) J ( x ) is inexpensive (or easy to obtain) The gradient is directly obtained from ∇ f 0 ( x ) = J ( x ) ⊤ R ( x ) \nabla f_0(x) = J(x)^\top R(x) ∇ f 0 ( x ) = J ( x ) ⊤ R ( x ) Often, the second term in the Hessian can be neglected (especially near the solution), so that
∇ 2 f 0 ( x ) ≈ J ( x ) ⊤ J ( x ) \nabla^2 f_0(x) \approx J(x)^\top J(x) ∇ 2 f 0 ( x ) ≈ J ( x ) ⊤ J ( x )
meaning the Hessian can be well-approximated without second derivatives. We exploit these properties to construct dedicated nonlinear least squares algorithms, known as the Gauss-Newton method and the Levenberg-Marquardt method , respectively.
Algorithms for nonlinear least squares ¶ The Gauss-Newton method can be seen as a quasi-Newton method with matrix B k = J ( x ( k ) ) ⊤ J ( x ( k ) ) B_k = J(x^{(k)})^\top J(x^{(k)}) B k = J ( x ( k ) ) ⊤ J ( x ( k ) ) and constant step-size α k = 1 \alpha_k = 1 α k = 1 .
The Levenberg-Marquardt method can also be interpreted as a quasi-Newton method with constant step-size α k = 1 \alpha_k = 1 α k = 1 . The method introduces a regularization parameter λ k \lambda_k λ k , which is tuned throughout iterations.
Given initial point x ( 0 ) x^{(0)} x ( 0 ) , the k k k -th iteration reads
set C k = J ( x ( k ) ) ⊤ J ( x ( k ) ) C_k = J(x^{(k)})^\top J(x^{(k)}) C k = J ( x ( k ) ) ⊤ J ( x ( k ) ) pick λ k \lambda_k λ k (see discussion below) compute
x ( k + 1 ) = x ( k ) − [ C k + λ k diag ( C k ) ] − 1 ∇ f 0 ( x ( k ) ) x^{(k+1)} = x^{(k)} - \left[C_k + \lambda_k \operatorname{diag}(C_k) \right]^{-1} \nabla f_0(x^{(k)}) x ( k + 1 ) = x ( k ) − [ C k + λ k diag ( C k ) ] − 1 ∇ f 0 ( x ( k ) )
where diag ( C k ) \diag(C_k) diag ( C k ) extracts the main diagonal of the matrix C k C_k C k . Repeat steps 1-2-3 until a suitable stopping criterion is met.
The Levenberg-Marquardt method is one the standard algorithms for non-linear least squares problems. It is implemented in many libraries, such as scipy.optimize.least_squares (use option method='lm').
This is critical (but very wide question) in itself.
Here are some useful guidelines:
Start safe : initialize with a moderately large λ 0 \lambda_0 λ 0 Adapt dynamically :If objective decreases → decrease λ k \lambda_k λ k (move toward Gauss–Newton) If objective increases → increase λ k \lambda_k λ k (move toward gradient descent) Balance :Small λ k \lambda_k λ k → faster but less stable Large λ k \lambda_k λ k → slower but robust Safeguard bounds : keep within λ min ≤ λ k ≤ λ max \lambda_{\min} \leq \lambda_k \leq \lambda_{\max} λ m i n ≤ λ k ≤ λ m a x For more details, one can consult Nocedal & Wright, 2006, Chapters 10-11 .
Nocedal, J., & Wright, S. J. (2006). Numerical optimization (Second Edition). Springer.