2.1 The mathematical toolbox
This section reviews essential mathematical concepts used throughout optimization. We focus on differentiability, gradients, Hessians, and related tools that form the foundation for analyzing and solving optimization problems.
Differentiability ¶ Let f : U → R f: U \to \mathbb{R} f : U → R be a function defined on an open subset U U U of R n \mathbb{R}^n R n .
If f f f is differentiable at a ∈ U a \in U a ∈ U , the differential d f a df_{a} d f a can be expressed in terms of partial derivatives of f f f at a a a :
d f a ( h ) = ∑ i = 1 n h i ∂ f ∂ x i ( a ) df_{a}(h) = \sum_{i=1}^n h_i \frac{\partial f}{\partial x_i}(a) d f a ( h ) = i = 1 ∑ n h i ∂ x i ∂ f ( a ) and as the limit, for any h ∈ R n h \in \mathbb{R}^n h ∈ R n ,
d f a ( h ) = lim ε → 0 f ( a + ε h ) − f ( a ) ε df_{a}(h) = \lim_{\varepsilon \rightarrow 0} \frac{f(a+\varepsilon h) - f(a)}{\varepsilon} d f a ( h ) = ε → 0 lim ε f ( a + ε h ) − f ( a ) If f ∈ C 1 f \in C^1 f ∈ C 1 , we say it is continuously differentiable . If f ∈ C 2 f\in C^2 f ∈ C 2 , we say it is twice continuously differentiable .
Keep in mind that the converse is not true, as shown by the following example.
The function f ( x 1 , x 2 ) = ∣ x 1 x 2 ∣ f(x_1, x_2) = |x_1 x_2| f ( x 1 , x 2 ) = ∣ x 1 x 2 ∣ is differentiable at ( 0 , 0 ) (0, 0) ( 0 , 0 ) but its partial derivatives do not exist everywhere around the origin. (Prove it!)
Gradient and Hessian ¶ Probably two of the most important tools for this lecture!
Assume f : R n → R f:\mathbb{R}^n \rightarrow \mathbb{R} f : R n → R is at least of class C 1 C^1 C 1 .
For a ∈ R n a \in \mathbb{R}^n a ∈ R n , the vector of partial derivatives
∇ f ( a ) = [ ∂ f ∂ x 1 ( a ) ∂ f ∂ x 2 ( a ) ⋮ ∂ f ∂ x N ( a ) ] ∈ R n \nabla f (a) = \begin{bmatrix}
\dfrac{\partial f}{\partial x_1}(a)\\
\dfrac{\partial f}{\partial x_2}(a)\\
\vdots\\
\dfrac{\partial f}{\partial x_N}(a)
\end{bmatrix}
\in \mathbb{R}^n ∇ f ( a ) = ⎣ ⎡ ∂ x 1 ∂ f ( a ) ∂ x 2 ∂ f ( a ) ⋮ ∂ x N ∂ f ( a ) ⎦ ⎤ ∈ R n is called the gradient of f f f at point a a a .
Let f : R n → R f: \mathbb{R}^n \rightarrow \mathbb{R} f : R n → R be of class C 2 C^2 C 2 .
For a ∈ R n a \in \mathbb{R}^n a ∈ R n , the matrix
∇ 2 f ( a ) = [ ∂ 2 f ∂ x 1 2 ( a ) ∂ 2 f ∂ x 1 ∂ x 2 ( a ) ⋯ ∂ 2 f ∂ x 1 ∂ x N ( a ) ∂ 2 f ∂ x 2 ∂ x 1 ( a ) ∂ 2 f ∂ x 2 2 ( a ) ⋯ ∂ 2 f ∂ x 2 ∂ x N ( a ) ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x N ∂ x 1 ( a ) ∂ 2 f ∂ x N ∂ x 2 ( a ) ⋯ ∂ 2 f ∂ x N 2 ( a ) ] ∈ R N × N \nabla^2 f(a) = \begin{bmatrix}
\frac{\partial^2 f}{{\partial x_1}^2}(a) & \frac{\partial^2 f}{\partial x_1\partial x_2}(a) & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_N}(a) \\
\frac{\partial^2 f}{\partial x_2\partial x_1}(a) & \frac{\partial^2 f}{{\partial x_2}^2}(a) & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_N}(a) \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_N\partial x_1}(a) & \frac{\partial^2 f}{\partial x_N\partial x_2}(a) & \cdots & \frac{\partial^2 f}{{\partial x_N}^2}(a)
\end{bmatrix} \in \mathbb{R}^{N\times N} ∇ 2 f ( a ) = ⎣ ⎡ ∂ x 1 2 ∂ 2 f ( a ) ∂ x 2 ∂ x 1 ∂ 2 f ( a ) ⋮ ∂ x N ∂ x 1 ∂ 2 f ( a ) ∂ x 1 ∂ x 2 ∂ 2 f ( a ) ∂ x 2 2 ∂ 2 f ( a ) ⋮ ∂ x N ∂ x 2 ∂ 2 f ( a ) ⋯ ⋯ ⋱ ⋯ ∂ x 1 ∂ x N ∂ 2 f ( a ) ∂ x 2 ∂ x N ∂ 2 f ( a ) ⋮ ∂ x N 2 ∂ 2 f ( a ) ⎦ ⎤ ∈ R N × N is called the Hessian matrix of f f f at point a a a . Sometimes the notation H e s s f ( a ) \mathsf{Hess}\,f (a) Hess f ( a ) is also used.
Since f f f is of class C 2 C^2 C 2 on R n \mathbb{R}^n R n , Schwarz’s theorem ensures that for every a ∈ R n a \in \mathbb{R}^n a ∈ R n :
∂ 2 f ∂ x i ∂ x j ( a ) = ∂ 2 f ∂ x j ∂ x i ( a ) for all ( i , j ) . \frac{\partial^2 f}{\partial x_i \partial x_j}(a) = \frac{\partial^2 f}{\partial x_j \partial x_i}(a) \quad \text{for all } (i, j). ∂ x i ∂ x j ∂ 2 f ( a ) = ∂ x j ∂ x i ∂ 2 f ( a ) for all ( i , j ) . Hence the Hessian ∇ 2 f ( a ) \nabla^2 f(a) ∇ 2 f ( a ) is a symmetric matrix.
Positive definiteness ¶ Let A ∈ R n × n A \in \mathbb{R}^{n\times n} A ∈ R n × n be a symmetric matrix (A ⊤ = A A^\top = A A ⊤ = A ).
A A A is positive semi-definite (PSD) if for every x ∈ R n x \in \mathbb{R}^n x ∈ R n , x ⊤ A x ≥ 0 x^\top A x \geq 0 x ⊤ A x ≥ 0 .A A A is positive definite (PD) if for every x ∈ R n ∖ { 0 } x \in \mathbb{R}^n \setminus \lbrace 0\rbrace x ∈ R n ∖ { 0 } , x ⊤ A x > 0 x^\top A x > 0 x ⊤ A x > 0 .We write A ⪰ 0 A \succeq 0 A ⪰ 0 when it is PSD, and A ≻ 0 A \succ 0 A ≻ 0 when it is PD.
A ⪰ 0 A \succeq 0 A ⪰ 0 if and only if all its eigenvalues are non-negative.A ≻ 0 A \succ 0 A ≻ 0 if and only if all its eigenvalues are strictly positive.Proof
Let A ∈ R n × n A \in \mathbb{R}^{n\times n} A ∈ R n × n be symmetric. By the spectral theorem, A A A is diagonizable by a orthogonal matrix Q ∈ R n × n Q\in \mathbb{R}^{n\times n} Q ∈ R n × n such that A = Q D Q ⊤ A = Q D Q^\top A = Q D Q ⊤ , where D = diag ( λ 1 , λ 2 , … , λ N ) D = \diag(\lambda_1, \lambda_2, \ldots, \lambda_N) D = diag ( λ 1 , λ 2 , … , λ N ) is a real diagonal matrix.
Then let v ∈ R n v \in \mathbb{R}^n v ∈ R n be arbitrary and denote by w = Q T v w = Q^T v w = Q T v its projection into the orthonormal basis defined by Q Q Q . Compute
v ⊤ A v = v T Q D Q T v = w ⊤ D w = ∑ i = 1 n λ i w i 2 v^\top A v = v^T Q D Q^T v = w^\top D w = \sum_{i=1}^n \lambda_i w_i^2 v ⊤ A v = v T Q D Q T v = w ⊤ D w = i = 1 ∑ n λ i w i 2 Clearly, this expression is non-negative for all v v v if and only if λ i ≥ 0 \lambda_i \geq 0 λ i ≥ 0 for every i = 1 , … , n i=1, \ldots, n i = 1 , … , n , and strictly positive if and only λ i > 0 \lambda_i > 0 λ i > 0 for every i = 1 , … , n i=1, \ldots, n i = 1 , … , n .
As we shall see, the positive (semi)-definiteness of the Hessian matrix is crucial to characterize solutions of optimization problems.
Taylor’s theorem ¶ Among different versions of Taylor’s theorem, the following will be useful for proofs later on.
Suppose f : R n → R f:\mathbb{R}^n\rightarrow \mathbb{R} f : R n → R is continuously differentiable and p ∈ R n p\in \mathbb{R}^n p ∈ R n . Then for some t ∈ ( 0 , 1 ) t\in (0,1) t ∈ ( 0 , 1 )
f ( x + p ) = f ( x ) + ∇ f ( x + t p ) ⊤ p f(x+p) = f(x) + \nabla f(x + t p)^\top p f ( x + p ) = f ( x ) + ∇ f ( x + tp ) ⊤ p Moreover, if f f f is twice continuously differentiable,
∇ f ( x + p ) = ∇ f ( x ) + ∫ 0 1 ∇ 2 f ( x + t p ) p d t \nabla f(x+ p) = \nabla f(x) + \int_0^1 \nabla^2 f(x + t p) p \,dt ∇ f ( x + p ) = ∇ f ( x ) + ∫ 0 1 ∇ 2 f ( x + tp ) p d t and
f ( x + p ) = f ( x ) + ∇ f ( x ) ⊤ p + 1 2 p ⊤ ∇ 2 f ( x + t p ) p f(x+p) = f(x) + \nabla f(x)^\top p + \frac{1}{2} p^\top \nabla^2 f(x+t p)p f ( x + p ) = f ( x ) + ∇ f ( x ) ⊤ p + 2 1 p ⊤ ∇ 2 f ( x + tp ) p for some t ∈ ( 0 , 1 ) t \in (0, 1) t ∈ ( 0 , 1 ) .
Nocedal, J., & Wright, S. J. (2006). Numerical optimization (Second Edition). Springer.