Assignment 3: Using Unsupervised Machine Learning to Discover Climate Zones#

In this tutorial, we will explore some common clustering methods, including K-Means clustering and Gaussian Mixture Models.

We’ll apply these clustering algorithms to the problem of classifying climate zones over the continental United States. The Köppen-Geiger Climate Zones are a climate classification system based on precipitation and temperature (https://www.noaa.gov/jetstream/global/climate-zones/jetstream-max-addition-k-ppen-geiger-climate-subdivisions). We’ll use unsupervised machine learning to discover similar climate zones using the climatological averages for temperature and precipitation.

Download the data sets#

We’ll use climatological data over the continental US, specifically average monthly precipitation and temperature records for the years 1901 to 2000 from NOAA that can be found here. To download the NetCDF files, you can run the following lines of code in a jupyter cell (the exclamation point before wget tells the notebook to execute this command in the shell). Alternatively, you can run the same wget commands directly in a Unix terminal without the exclamation mark.

!wget "https://www.ncei.noaa.gov/data/oceans/archive/arc0196/0248762/1.1/data/0-data/tavg-1901_2000-monthly-normals-v1.0.nc"
!wget "https://www.ncei.noaa.gov/data/oceans/archive/arc0196/0248762/1.1/data/0-data/prcp-1901_2000-monthly-normals-v1.0.nc"

Part 1: Load and prepare the climatological data#

  1. Use xarray to open the NetCDF files for the climatological temperature and precipitation datasets for the continental US.

  1. Make a plot of the monthly average precipitation in January over the Continental US.

  1. Make a plot of the monthly average temperature in August over the Continent US.

To identify climatological zones across the continental US, we will use the 12 monthly average temperature values and the 12 monthly average precipitation values as input features for clustering algorithms. Each latitude and longitude point will be treated as a single sample with 24 features (12 for temperature and 12 for precipitation).

  1. First, extract the values for “mlytavg_norm” from the temperature and precipitation NetCDF files and store them in NumPy arrays named avgtemp and avgprec, respectively.

  1. Put the latitude and longitude values into numpy arrays, and use the np.meshgrid function to create a 2D array that gives the latitude and longitude value at each point on the map:

lat_grid, lon_grid = np.meshgrid(lats, lons, indexing="ij")

We’ll use these arrays later to create labels for the latitude and longitude points associated with each sample.

  1. Use the np.isnan function to create a mask that gives a value of 1 where there is data and a value of 0 where there is no data in the 2D array of the precipitation and temperature maps over the continental US. This mask should have the same dimensions as latitude by longitude. Hint: np.isnan returns 1 (True) if the value is NaN.

  1. Use the mask to index the numpy arrays of avgtemp, avgprec, lat_grid, and lon_grid to create arrays called avgtemp_masked, avgprec_masked, lat_masked, and lon_masked. These arrays now should no longer have any NaN’s.

  1. sci-kit learn functions assume that the sample number (\(n_{sample}\)) is the first dimension in an array, and the features associated with a sample are in the second dimension in an array. Transpose the avgprec_masked and avgtemp_masked to get the correct ordering of dimensions. Check the shape of the arrays are now \(n_{sample}\) x 12.

Part 2: Pre-process the data#

  1. Scale the precipitation and temperature arrays between -1 and 1. We want to do this so that all 12 months are scaled relative to the same minimum and maximum values for precipitation or temperature, respectively. You can use the MinMaxScaler from sci-kit learn (you will have to reshape the arrays to do this) or alternatively write your own function to do this scaling.

  1. Create a feature array X that is \(n_{samples}\) by \(n_{features}\), where \(n_{features}\)=24 (i.e. it combines the two arrays that contain the scaled temperature and precipitation monthly averages associated with each sample).

Part 3: Use K-Means Clustering to Label Climate Zones#

  1. Use the KMeans method from sklearn.cluster to fit 8 clusters to the X feature matrix.

  1. Make a scatter plot of lat_masked vs. lon_masked and color each point by its kmeans label. Set the size of the points in the scatter points to 0.1 and use a colormap that is discrete (rather than continuous).

You can compare the climate zones discovered by the K-Means clustering approach with the map here.

  1. Repeat parts 11. and 12. but choose a different number of clusters to fit Kmeans.

Part 4: Use a Gaussian Mixture Model to Label Climate Zones#

  1. Use the GaussianMixture method from sklearn.mixture to fit 8 components to the X feature matrix.

  1. Make a scatter plot of lat_masked vs. lon_masked and color by the components learned by the Gaussian Mixture Model. Set the size of the scatter points to 0.1 and use a colormap that is discrete (rather than continuous).