Dimensionality Reduction

Dimensionality Reduction#

In this tutorial, we will explore dimensionality reduction using a very common approach, Principal Component Analysis (PCA), and how it can be applied to spatio-temporal data to discover patterns in climate data.

Principal Component Analysis#

PCA is a linear dimensionality reduction technique. It is used to linearly transform data to a new coordinate system, based on the principal components of the data. The principal components capture the directions of greatest variance in the data, in decreasing order. That is, the first principal component will explain the greatest variance in the data, and the second principal component will explain the next greatest variance after the first principal component has been removed. To illustrate this, we will generate some fake data; this follows the implementation in the Dimensionality Reduction notebook from Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow by Aurélien Géron.

This code creates a 3D data set, which consists of an oval shape in 3D with points distributed unevenly and with a lot of noise:

import numpy as np
from scipy.spatial.transform import Rotation

m = 60
X = np.zeros((m, 3))  # initialize 3D dataset
np.random.seed(42)
angles = (np.random.rand(m) ** 3 + 0.5) * 2 * np.pi  # uneven distribution
X[:, 0], X[:, 1] = np.cos(angles), np.sin(angles) * 0.5  # oval
X += 0.28 * np.random.randn(m, 3)  # add more noise
X = Rotation.from_rotvec([np.pi / 29, -np.pi / 20, np.pi / 4]).apply(X)
X += [0.2, 0, 0.2]  # shift a bit

X.shape

(60, 3)

If we take the first 2 principal components of this data set, we are effectively learning a projection from 3D down to 2D. This figures shows a plot of the original 3D dataset, projected onto a 2D plane using PCA.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)  # dataset reduced to 2D
X3D_inv = pca.inverse_transform(X2D)  # 3D position of the projected samples
X_centered = X - X.mean(axis=0)

U, s, Vt = np.linalg.svd(X_centered)

axes = [-1.4, 1.4, -1.4, 1.4, -1.1, 1.1]
x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 10),
                     np.linspace(axes[2], axes[3], 10))
w1, w2 = np.linalg.solve(Vt[:2, :2], Vt[:2, 2])  # projection plane coefs
z = w1 * (x1 - pca.mean_[0]) + w2 * (x2 - pca.mean_[1]) - pca.mean_[2]  # plane
X3D_above = X[X[:, 2] >= X3D_inv[:, 2]]  # samples above plane
X3D_below = X[X[:, 2] < X3D_inv[:, 2]]  # samples below plane

fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection="3d")

# plot samples and projection lines below plane first
ax.plot(X3D_below[:, 0], X3D_below[:, 1], X3D_below[:, 2], "ro", alpha=0.3)
for i in range(m):
    if X[i, 2] < X3D_inv[i, 2]:
        ax.plot([X[i][0], X3D_inv[i][0]],
                [X[i][1], X3D_inv[i][1]],
                [X[i][2], X3D_inv[i][2]], ":", color="#F88")

ax.plot_surface(x1, x2, z, alpha=0.1, color="b")  # projection plane
ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], "b+",label="2D projections")  # projected samples
ax.plot(X3D_inv[:, 0], X3D_inv[:, 1], X3D_inv[:, 2], "b.")

# now plot projection lines and samples above plane
for i in range(m):
    if X[i, 2] >= X3D_inv[i, 2]:
        ax.plot([X[i][0], X3D_inv[i][0]],
                [X[i][1], X3D_inv[i][1]],
                [X[i][2], X3D_inv[i][2]], "r--")

ax.plot(X3D_above[:, 0], X3D_above[:, 1], X3D_above[:, 2], "ro",label="Original 3D points")

def set_xyz_axes(ax, axes):
    ax.xaxis.set_rotate_label(False)
    ax.yaxis.set_rotate_label(False)
    ax.zaxis.set_rotate_label(False)
    ax.set_xlabel("$x_1$", labelpad=8, rotation=0)
    ax.set_ylabel("$x_2$", labelpad=8, rotation=0)
    ax.set_zlabel("$x_3$", labelpad=8, rotation=0)
    ax.set_xlim(axes[0:2])
    ax.set_ylim(axes[2:4])
    ax.set_zlim(axes[4:6])

set_xyz_axes(ax, axes)
ax.set_zticks([-1, -0.5, 0, 0.5, 1])
plt.legend(bbox_to_anchor=(1.1, 0.8), loc='upper left')

plt.show()

../../_images/5d89916fe0d95a9cd427455f781e737685ae47853f4ef1ae19f26eb058cd1fb1.png

Our data is in a data matrix $\mathbf{X}$, which has the dimensions $n_{samples}$ by $n_{features}$.

To perform PCA analysis, you would follow these steps:

Mean-center the data: First, subtract the mean of each feature from the data set. This centers the data around the origin. $$ \mathbf{X}_{c} = \mathbf{X} - \mathbf{X}_{mean} $$
Next, apply Singular Value Decomposition (SVD) to the centered data matrix, which is a matrix factorization approach: $$ \mathbf{X}_{c} = \mathbf{U}\Sigma\mathbf{V^{T}} $$

The columns of $\mathbf{V}$ (the rows of $\mathbf{V^{T}}$) are the principal directions (eigenvectors of the covariance matrix). The singular values in $\Sigma$ relate to the explained variance of each principal component.

We can implement PCA directly by using the SVD factorization algorithm implemented in the numpy library.

import numpy as np

X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt[0]
c2 = Vt[1]

The principal directions for our data are then:

print(c1)
print(c2)

[0.67857588 0.70073508 0.22023881]
[-0.72817329  0.6811147   0.07646185]

We can also use sci-kit learn to do this instead, where it is very easy to implement PCA. In this case, it is not necessary to subtract the mean, since this is done automatically.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

The principal directions in the data can be found by printing the pca.components_

pca.components_

array([[ 0.67857588,  0.70073508,  0.22023881],
       [ 0.72817329, -0.6811147 , -0.07646185]])

The explained variance ratio tells us how much variance in our original 3D data each component explains.

pca.explained_variance_ratio_

array([0.7578477 , 0.15186921])

This means that the first dimension explains about 76% of the variance, while the second explains 15%

1 - pca.explained_variance_ratio_.sum() 

0.09028309326742046

By projecting from 3D to 2D we have lost ~9% of the variance in our original data set.

This next code uses the dataset MNIST to explore how we can figure out how many dimensions we should keep in our data set. MNIST is a data set of small images of (labeled) handwritten digits (0,1,2,3,4,5,6,7,8,9), and is used to benchmark many machine learning models.

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False, parser="auto")
X_train, y_train = mnist.data[:60_000], mnist.target[:60_000]
X_test, y_test = mnist.data[60_000:], mnist.target[60_000:]

X_train.shape

(60000, 784)

The data set contains 28x28 pixel images, but here the images are flattened into a single vector of 28*28 = 784 points. To visualize an example, we can reshape the data to 28x28 pixels and plot the image.

imgexample = X_train[20,:].reshape(28,28)

plt.imshow(imgexample)

<matplotlib.image.AxesImage at 0x134dd8460>

../../_images/4f9ab2966fc662d6a4fb2c2b971b49c58055ad66338f3033fb572de208d2d2e3.png

For the sake of this example, we are using MNIST as a high dimensional data set that we might want to apply dimensionality reduction to. We’ll start by applying PCA to the training data set.

pca = PCA()
pca.fit(X_train)

PCA()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

If we don’t specify how many components to keep, PCA will return a number that is the same size as our original number of features:

pca.n_components_

We can determine how many principal components that we need in order to retain 95% of the original variance in the MNIST dataset:

cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1  

This indicates that we need to keep 154 principal components.

If we use the following lines of code, we can tell PCA to automatically choose the number of components such that at least 95% of the variance in the original data set is retained. We can then transform the data to this new, lower dimensional space.

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

pca.n_components_

pca.explained_variance_ratio_.sum()  # extra code

0.950196019261303

We can look at the cumulative sum of the explained variance as we increase the number of components that we keep:

plt.figure(figsize=(6, 4))
plt.plot(cumsum, linewidth=3)
plt.axis([0, 400, 0, 1])
plt.xlabel("Dimensions")
plt.ylabel("Explained Variance")
plt.plot([d, d], [0, 0.95], "k:")
plt.plot([0, d], [0.95, 0.95], "k:")
plt.plot(d, 0.95, "ko")
plt.annotate("Elbow", xy=(65, 0.85), xytext=(70, 0.7),
             arrowprops=dict(arrowstyle="->"))
plt.grid(True)

plt.show()

../../_images/f194c90c07beabc68c177c7fcc96dcc1f443116881a130eca21d24249ac382bd.png

PCA can be used to compress data, because we can keep the majority of information in our original data set but with many fewer variables. If we look at X_reduced, it contains 154 features (while our original $\mathbf{X}$ contains 784 features)

X_reduced.shape

(60000, 154)

We can use the inverse transform to reconstruct the original data set from our reduced feature matrix:

X_recovered = pca.inverse_transform(X_reduced)

X_recovered.shape

(60000, 784)

The lines below show examples of the original images, and the ones that have been reconstructed from the 154 features.

plt.figure(figsize=(7, 4))
for idx, X in enumerate((X_train[::2100], X_recovered[::2100])):
    plt.subplot(1, 2, idx + 1)
    plt.title(["Original", "Compressed"][idx])
    for row in range(5):
        for col in range(5):
            plt.imshow(X[row * 5 + col].reshape(28, 28), cmap="binary",
                       vmin=0, vmax=255, extent=(row, row + 1, col, col + 1))
            plt.axis([0, 5, 0, 5])
            plt.axis("off")

../../_images/31a4d90c9fc9864b20835faa8d4bf3bf02a63c4a8fe57fc0802147916e0122a3.png

This type of compression is one way that we can do feature engineering for machine learning; if we have an original high dimensional data set, we might first apply PCA (or some other dimensionality reduction method) to our data, and then use the first few principal components as input to our machine learning algorithm. This approach is useful because it ensures that we both focus on the most significant sources of variance describing the high dimensional data set, while also reducing the total number of features.

Dimensionality Reduction

Contents

Dimensionality Reduction#

Principal Component Analysis#

Using PCA to discover the El Niño/Southern Oscillation (ENSO) pattern#