Assignment 3: Using Unsupervised Machine Learning to Discover Climate Zones#
In this tutorial, we will explore some common clustering methods, including K-Means clustering and Gaussian Mixture Models.
We’ll apply these clustering algorithms to the problem of classifying climate zones over the continental United States. The Köppen-Geiger Climate Zones are a climate classification system based on precipitation and temperature (https://www.noaa.gov/jetstream/global/climate-zones/jetstream-max-addition-k-ppen-geiger-climate-subdivisions). We’ll use unsupervised machine learning to discover similar climate zones using the climatological averages for temperature and precipitation.
Download the data sets#
We’ll use climatological data over the continental US, specifically average monthly precipitation and temperature records for the years 1901 to 2000 from NOAA that can be found here. To download the NetCDF files, you can run the following lines of code in a jupyter cell (the exclamation point before wget tells the notebook to execute this command in the shell). Alternatively, you can run the same wget commands directly in a Unix terminal without the exclamation mark.
!wget "https://www.ncei.noaa.gov/data/oceans/archive/arc0196/0248762/1.1/data/0-data/tavg-1901_2000-monthly-normals-v1.0.nc"
!wget "https://www.ncei.noaa.gov/data/oceans/archive/arc0196/0248762/1.1/data/0-data/prcp-1901_2000-monthly-normals-v1.0.nc"
Part 1: Load and prepare the climatological data#
Use
xarrayto open the NetCDF files for the climatological temperature and precipitation datasets for the continental US.
Make a plot of the monthly average precipitation in January over the Continental US.
Make a plot of the monthly average temperature in August over the Continent US.
To identify climatological zones across the continental US, we will use the 12 monthly average temperature values and the 12 monthly average precipitation values as input features for clustering algorithms. Each latitude and longitude point will be treated as a single sample with 24 features (12 for temperature and 12 for precipitation).
First, extract the values for “mlytavg_norm” from the temperature and precipitation NetCDF files and store them in NumPy arrays named
avgtempandavgprec, respectively.
Put the latitude and longitude values into numpy arrays, and use the
np.meshgridfunction to create a 2D array that gives the latitude and longitude value at each point on the map:
lat_grid, lon_grid = np.meshgrid(lats, lons, indexing="ij")
We’ll use these arrays later to create labels for the latitude and longitude points associated with each sample.
Use the
np.isnanfunction to create a mask that gives a value of 1 where there is data and a value of 0 where there is no data in the 2D array of the precipitation and temperature maps over the continental US. This mask should have the same dimensions as latitude by longitude. Hint:np.isnanreturns 1 (True) if the value isNaN.
Use the mask to index the numpy arrays of
avgtemp,avgprec,lat_grid, andlon_gridto create arrays calledavgtemp_masked,avgprec_masked,lat_masked, andlon_masked. These arrays now should no longer have any NaN’s.
sci-kit learnfunctions assume that the sample number (\(n_{sample}\)) is the first dimension in an array, and the features associated with a sample are in the second dimension in an array. Transpose theavgprec_maskedandavgtemp_maskedto get the correct ordering of dimensions. Check the shape of the arrays are now \(n_{sample}\) x 12.
Part 2: Pre-process the data#
Scale the precipitation and temperature arrays between -1 and 1. We want to do this so that all 12 months are scaled relative to the same minimum and maximum values for precipitation or temperature, respectively. You can use the
MinMaxScalerfromsci-kit learn(you will have to reshape the arrays to do this) or alternatively write your own function to do this scaling.
Create a feature array
Xthat is \(n_{samples}\) by \(n_{features}\), where \(n_{features}\)=24 (i.e. it combines the two arrays that contain the scaled temperature and precipitation monthly averages associated with each sample).
Part 3: Use K-Means Clustering to Label Climate Zones#
Use the
KMeansmethod fromsklearn.clusterto fit 8 clusters to theXfeature matrix.
Make a scatter plot of
lat_maskedvs.lon_maskedand color each point by its kmeans label. Set the size of the points in the scatter points to 0.1 and use a colormap that is discrete (rather than continuous).
You can compare the climate zones discovered by the K-Means clustering approach with the map here.
Repeat parts 11. and 12. but choose a different number of clusters to fit Kmeans.
Part 4: Use a Gaussian Mixture Model to Label Climate Zones#
Use the
GaussianMixturemethod fromsklearn.mixtureto fit 8 components to theXfeature matrix.
Make a scatter plot of
lat_maskedvs.lon_maskedand color by the components learned by the Gaussian Mixture Model. Set the size of the scatter points to 0.1 and use a colormap that is discrete (rather than continuous).