Launch in Colab Download .ipynb file
Just like in LW1 and LW2, this notebook relies on classical Python libraries:
NumPy: For efficient numerical computations, matrix operations, and solving least-squares problems using built-in linear algebra functions.
Matplotlib: For visualizing the data and displaying / interpreting results.
We will also need scikit-learn and pandas to be installed (Part II only).
Running the Notebook¶
This notebook can be executed in the following environments:
Google Colab: A convenient, cloud-based option that requires no setup. Simply open the notebook in Colab, and you’re ready to run the code.
Locally: You can run the notebook on your local machine using environments like JupyterLab or Jupyter Notebook. You can also run it directly in Visual Studio Code if you have the Jupyter extension. In all cases, ensure you have Python installed along with the required libraries,
NumPyandMatplotlib, among others.
What is non-negative matrix factorization (NMF)?¶
Non-negative matrix factorization (NMF) is a dimensionality reduction technique that decomposes a non-negative matrix into two non-negative matrices:
where:
is the basis matrix (or dictionary)
is the coefficient matrix (or encoding)
is the rank of the factorization
The goal is to find and that minimize the reconstruction error, typically measured by the Frobenius norm:
The non-negativity constraints make NMF particularly useful for applications where the data has a natural non-negative interpretation, such as image processing, text mining, and audio signal processing. Unlike PCA, NMF produces additive, parts-based representations that are often more interpretable.
Key References
N. Gillis, Nonnegative Matrix Factorization, SIAM, 2020. [Comprehensive textbook covering theory and algorithms]
D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization”, Nature, vol. 401, pp. 788-791, 1999. [Seminal paper which has popularized NMF] PDF
D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization”, Advances in Neural Information Processing Systems (NIPS), pp. 556-562, 2001. [introduced multiplicative update algorithms]. PDF
N. Gillis, “The why and how of nonnegative matrix factorization,” arXiv preprint arXiv:1401.5226, 2014. [a very nice introduction to NMF] PDF
Highlights Compared to the course content so far, the NMF problem incorporates two main changes:
Matrix-valued variables: the decision variables are matrices, ans . Operations involve matrix products and the Frobenius norm, with elementwise non-negativity constraints—rather than vector variables and vector norms used previously. In particular, we have to adapt our definitions of gradient accordingly.
Multiple decision variables. we optimize over the pair , where are called blocks of . To tackle such problem, one of the common practice is to use block coordinate descent (alternating minimization), i.e., the -th iteration reads
Fix and update
Fix and update
and repeat until convergence.
NMF is a difficult problem to solve, notably due to the non-convexity of the objective in the variable pair ; However it is convex in individual variables and (taken separately). This is why block coordinate descent methods are very popular in NMF.
Preliminary questions
Recall that the Frobenius norm of a matrix is given by , where is the trace operator. Deduce the explicit expression of the NMF objective function using trace operations.
Compute the gradient of the objective with respect to ( being fixed) and ( being fixed). Check out the matrix cookbook if in doubt! We note
the respective gradients.
Let us drop the non-negativity constraints for this question. Give the solutions of the least squares problem in individual variables
( constant)
( constant)
Part 1: comparison of several algorithms for NMF.¶
In this part, you will implement and compare several algorithms for performing NMF on a simulated dataset such that , with and . This part is inspired from experiments conducted in N. Gillis paper.
Questions
Construct a function
objective(V, W, H)that returns the objective values for data matrix and arbitrary .Implement gradient functions
grad_W(V, W, H)andgrad_H(V, W, H)using your calculations above.Implement a projection operator
project_nonnegative(X)that maps a matrix (X) to the nonnegative orthant (elementwiseGenerate some initial points and (e.g. using the
np.random.randfunction). Compute unconstrained least-squares solutions for and and show explicitly that the updates introduce negative entries and may behave unstably.Implement a projected ALS scheme: solve least squares for (W) (fix (H)), project onto the non-negative orthant; then solve for (H) (fix (W)), project onto the nonnegative orthant;
Iterate and monitor the objective. Display your results. What do you observe?
Implement a projected block coordinate descent (BCD) method: take gradient descent steps on and alternately, followed by projection;
Explore the choice of step-sizes ( and give typical values for which convergence holds.
Implement multiplicative updates (MU) for and :
where is the entry-wise multiplication. Confirm that the updates preserve nonnegativity without explicit projection.
Compare projected ALS, projected BCD, and MU in terms of convergence trajectory, reconstruction error, and sensitivity to initialization.
Analyze and summarize the main differences in behavior among the three methods when applied to the synthetic dataset, including
the impact of noise (change the values of
noise_levelingenerate_dataset_NMF)the impact of a mispecication of the rank: the data has been generated with rank 25. What happens if one ‘fits’ the data with lower or higher rank?.
import numpy as np
import matplotlib.pyplot as plt
def generate_dataset_NMF(noise_level=0.1):
# Parameters
n_samples = 200
n_features = 100
n_components = 25
random_state = 2026
# Generate random non-negative data matrix V
np.random.seed(random_state)
W = np.abs(np.random.randn(n_samples, n_components))
H = np.abs(np.random.randn(n_components, n_features))
V = W @ H + noise_level * np.random.randn(n_samples, n_features)
V = np.clip(V, a_min=0, a_max=None) # Ensure non-negativity
return V, W, H
V, W0, H0 = generate_dataset_NMF(noise_level=0)
# display the original components
plt.figure(figsize=(10, 4))
plt.imshow(V, aspect='auto', cmap='viridis')
plt.title('Data Matrix V')
plt.colorbar()
plt.xlabel('Features')
plt.ylabel('Samples')
# display the original W0 components
plt.figure(figsize=(10, 4))
plt.imshow(W0, aspect='auto', cmap='viridis')
plt.title('Original W0 Components')
plt.colorbar()
plt.xlabel('Components')
plt.ylabel('Samples')
# display the original H0 components
plt.figure(figsize=(10, 4))
plt.imshow(H0, aspect='auto', cmap='viridis')
plt.title('Original H0 Components')
plt.colorbar()
plt.xlabel('Features')
plt.ylabel('Components')



Part 2: using NMF on the MNIST dataset¶
The MNIST dataset is a collection of 70,000 handwritten digit images (0-9), each represented as a 28×28 pixel grayscale image. It is widely used as a benchmark for testing machine learning and image processing algorithms.
Questions:
Load the MNIST data using the function below. What are the dimensions of the data matrix
V_mnist? What represents each row of the data matrix?Plot, for three rows, the corresponding digit image by reshaping the vector as 28x28 array. Compare your observations with the corresponding label, stored in
y_mnistApply your best NMF algorithm from Part 1 to the MNIST dataset (
V_mnist). Choose an appropriate rank . Display the convergence curve of the algorithm.Visualize the learned basis matrix as a collection of images of size 28×28. What patterns do you observe?
For a given test image, examine its coefficient vector in . Which components are most active? Can you relate these components to visual features of the digit? Repeat the analysis for randomly selected images.
Investigate the effect of rank on reconstruction quality and interpretability. What is the trade-off?
Compare results obtained by NMF with principal component analysis (PCA). To compute PCA, you can adapt the scikit-learn example to the present situation.
What qualitative differences do you see between PCA and NMF components?
Which representation seems more human-interpretable and why?
Does PCA’s lower reconstruction error necessarily imply better interpretability?
# load MNIST dataset
from sklearn.datasets import fetch_openml
import pandas as pd
def load_mnist_data(truncate=1000):
mnist = fetch_openml('mnist_784', version=1)
X = mnist.data / 255.0 # Normalize to [0, 1], contains images as flattened arrays
y = mnist.target.astype(int) # contains labels as integers
return X[:truncate], y[:truncate]
V_mnist, y_mnist = load_mnist_data(truncate=1000)Part 3: Hierarchical Alternating Least Squares (HALS) (Bonus)¶
HALS is a powerful and (often) very efficient algorithm for solving NMF. It relies on the column wise update of matrices and . The goal of this section is to implement the HALS algorithm from its description in N. Gillis paper; see Section 3.1.5.
Implement the algorithm and compare its performance on synthetic and MNIST datasets, against projected ALS, BCD and MU algorithms. Comment your results.