Assignment 1: Data Pre-Processing of Historical Tropical Cyclone Records#
In this assignment, you will explore how to do data exploration and pre-processing with a .csv file containing a global collection of tropical cyclone records, the International Best Track Archive for Climate Stewardship (IBTrACS). The column variable descriptions are given here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load and aggregate the data set#
Using the following code, load in the data set (this takes a few seconds to run):
url = 'https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.ALL.list.v04r00.csv'
df = pd.read_csv(url, parse_dates=['ISO_TIME'], usecols=range(12),
skiprows=[1], na_values=[' ', 'NOT_NAMED'],
keep_default_na=False, dtype={'NAME': str})
This data set includes cyclone tracks (so it has multiple entries per named cyclone). We’ll use the code below to create an aggregated data set for only the named cyclones, which has one entry per cyclone. You will use the data set dfnamed for the rest of the assignment.
dfnamed = df.groupby("NAME").agg(MAX_WIND=('WMO_WIND','max'),
MIN_PRES=('WMO_PRES','min'),
MEAN_LAT=('LAT','mean'),
MEAN_LON=('LON','mean'),
BASIN=('BASIN','first'),
SUBBASIN=('SUBBASIN','first'),
NATURE=('NATURE','first'),
SEASON=('SEASON','first')).reset_index()
dfnamed.head()
# these lines of code remove the initial dataframe (we won't need it anymore)
import gc
del df
gc.collect()
0
Part 1: Data Exploration#
How many named cyclones are there?
len(dfnamed)
1884
Use the
pandashistmethod to plot the marginal distributions of the variables in the dataframedfnamed
Use the
seabornPairgrid function to create a scatterplot of all of the variables
Using
matplotlib, create a scatter plot of the minimum pressure vs. the maximum wind speed and color by the year the cyclone took place.
Part 2: Handle missing data#
How many non-null values does each variable have?
Create a new dataframe called
dfdropwhere you have discarded the rows with NaN values indfnamed
Create a new dataframe called
dfimputedwhere you have imputed the missing values indfnamedwith 0.0.
Part 3: Feature scaling#
Scale the
MAX_WINDusing the StandardScaler and put it into an array calledy, since its the target we want to predict
Create a copy of your
dfdropdataframe calleddffeatures, and drop theNAMEandMAX_WINDcolumns from yourdffeaturesdataframe, since theMAX_WINDvariable is going to be your target variable and we won’t need the cyclone names any longer
Part 4: Encode Categorical Variables#
Check which unique categories each of the variables
BASIN,SUBBASIN, andNATUREtake, using thedffeaturesdataframe.
Print out the number of cyclones in each basin and subbasin. Also print out how many of each storm type there is.
Encode the
BASIN,SUBBASIN, andNATUREvariables indffeaturesusing One Hot Encoding, and standardizeMIN_PRES,MEAN_LAT, andMEAN_LON, andSEASONusing theStandardScaler.
Hint: you can do this in one step using the sci-kit learn ColumnTransformer. Create an array called X that contains the encoded categorical variables and the scaled numerical variables.
Print out the feature names associated with the columns in your
Xarray.
Part 5: Train, Validation, & Test Split#
Split your data set into a training and test/validation data set using the
train_test_splitfunction with 80% of your original data for training and 20% for the testing and validation data sets.
Split your
X_test_valandy_test_valagain into separate validation and test data sets, that are 10% each of the original data set. Double check that the size of your final training, validation, and test data sets are correct by printing out the shape of each array.
Save the Training, validation, and test data sets and labels as numpy arrays using np.save()
Part 6: Create your Assignments Repository#
To turn in this and other homework assignments for this course, you will create an assignments github repository.
Create a new directory called
ml4climate2025in your home directory.Create a
Readme.mdmarkdown file that contains your name.Initialize a new git repository
Add the file and make your first commit
Create a new private repository on Github called
ml4climate2025. (Call it exactly like that. Do not vary the spelling, capitalization, or punctuation.)Push your
ml4climate2025repository to Github.On Github, go to “settings”->”collaborators”, and add
kdlambandChhaviDixitPush new commits to this repository whenever you are ready to hand in your assignments.