Assignment 1: Data Pre-Processing of Historical Tropical Cyclone Records#
In this assignment, you will explore how to do data exploration and pre-processing with a .csv file containing a global collection of tropical cyclone records, the International Best Track Archive for Climate Stewardship (IBTrACS). The column variable descriptions are given here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load and aggregate the data set#
Using the following code, load in the data set (this takes a few seconds to run):
url = 'https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.ALL.list.v04r00.csv'
df = pd.read_csv(url, parse_dates=['ISO_TIME'], usecols=range(12),
skiprows=[1], na_values=[' ', 'NOT_NAMED'],
keep_default_na=False, dtype={'NAME': str})
This data set includes cyclone tracks (so it has multiple entries per named cyclone). We’ll use the code below to create an aggregated data set for only the named cyclones, which has one entry per cyclone. You will use the data set dfnamed
for the rest of the assignment.
dfnamed = df.groupby("NAME").agg(MAX_WIND=('WMO_WIND','max'),
MIN_PRES=('WMO_PRES','min'),
MEAN_LAT=('LAT','mean'),
MEAN_LON=('LON','mean'),
BASIN=('BASIN','first'),
SUBBASIN=('SUBBASIN','first'),
NATURE=('NATURE','first'),
SEASON=('SEASON','first')).reset_index()
dfnamed.head()
# these lines of code remove the initial dataframe (we won't need it anymore)
import gc
del df
gc.collect()
0
Part 1: Data Exploration#
How many named cyclones are there?
len(dfnamed)
1884
Use the
pandas
hist
method to plot the marginal distributions of the variables in the dataframedfnamed
Use the
seaborn
Pairgrid function to create a scatterplot of all of the variables
Using
matplotlib
, create a scatter plot of the minimum pressure vs. the maximum wind speed and color by the year the cyclone took place.
Part 2: Handle missing data#
How many non-null values does each variable have?
Create a new dataframe called
dfdrop
where you have discarded the rows with NaN values indfnamed
Create a new dataframe called
dfimputed
where you have imputed the missing values indfnamed
with 0.0.
Part 3: Feature scaling#
Scale the
MAX_WIND
using the StandardScaler and put it into an array calledy
, since its the target we want to predict
Create a copy of your
dfdrop
dataframe calleddffeatures
, and drop theNAME
andMAX_WIND
columns from yourdffeatures
dataframe, since theMAX_WIND
variable is going to be your target variable and we won’t need the cyclone names any longer
Part 4: Encode Categorical Variables#
Check which unique categories each of the variables
BASIN
,SUBBASIN
, andNATURE
take, using thedffeatures
dataframe.
Print out the number of cyclones in each basin and subbasin. Also print out how many of each storm type there is.
Encode the
BASIN
,SUBBASIN
, andNATURE
variables indffeatures
using One Hot Encoding, and standardizeMIN_PRES
,MEAN_LAT
, andMEAN_LON
, andSEASON
using theStandardScaler
.
Hint: you can do this in one step using the sci-kit learn
ColumnTransformer
. Create an array called X
that contains the encoded categorical variables and the scaled numerical variables.
Print out the feature names associated with the columns in your
X
array.
Part 5: Train, Validation, & Test Split#
Split your data set into a training and test/validation data set using the
train_test_split
function with 80% of your original data for training and 20% for the testing and validation data sets.
Split your
X_test_val
andy_test_val
again into separate validation and test data sets, that are 10% each of the original data set. Double check that the size of your final training, validation, and test data sets are correct by printing out the shape of each array.
Save the Training, validation, and test data sets and labels as numpy arrays using np.save()
Part 6: Create your Assignments Repository#
To turn in this and other homework assignments for this course, you will create an assignments github
repository.
Create a new directory called
ml4climate2025
in your home directory.Create a
Readme.md
markdown file that contains your name.Initialize a new git repository
Add the file and make your first commit
Create a new private repository on Github called
ml4climate2025
. (Call it exactly like that. Do not vary the spelling, capitalization, or punctuation.)Push your
ml4climate2025
repository to Github.On Github, go to “settings”->”collaborators”, and add
kdlamb
andChhaviDixit
Push new commits to this repository whenever you are ready to hand in your assignments.