Assignment 1: Data Pre-Processing of Historical Tropical Cyclone Records#

In this assignment, you will explore how to do data exploration and pre-processing with a .csv file containing a global collection of tropical cyclone records, the International Best Track Archive for Climate Stewardship (IBTrACS). The column variable descriptions are given here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load and aggregate the data set#

Using the following code, load in the data set (this takes a few seconds to run):

url = 'https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.ALL.list.v04r00.csv'
df = pd.read_csv(url, parse_dates=['ISO_TIME'], usecols=range(12),
                 skiprows=[1], na_values=[' ', 'NOT_NAMED'],
                 keep_default_na=False, dtype={'NAME': str})

This data set includes cyclone tracks (so it has multiple entries per named cyclone). We’ll use the code below to create an aggregated data set for only the named cyclones, which has one entry per cyclone. You will use the data set dfnamed for the rest of the assignment.

dfnamed = df.groupby("NAME").agg(MAX_WIND=('WMO_WIND','max'),
                               MIN_PRES=('WMO_PRES','min'),
                               MEAN_LAT=('LAT','mean'),
                               MEAN_LON=('LON','mean'),
                               BASIN=('BASIN','first'),
                               SUBBASIN=('SUBBASIN','first'),
                               NATURE=('NATURE','first'),
                               SEASON=('SEASON','first')).reset_index()
dfnamed.head()

# these lines of code remove the initial dataframe (we won't need it anymore)
import gc
del df
gc.collect()
0

Part 1: Data Exploration#

How many named cyclones are there?

len(dfnamed)
1884
  1. Use the pandas hist method to plot the marginal distributions of the variables in the dataframe dfnamed

  1. Use the seaborn Pairgrid function to create a scatterplot of all of the variables

  1. Using matplotlib, create a scatter plot of the minimum pressure vs. the maximum wind speed and color by the year the cyclone took place.

Part 2: Handle missing data#

  1. How many non-null values does each variable have?

  1. Create a new dataframe called dfdrop where you have discarded the rows with NaN values in dfnamed

  1. Create a new dataframe called dfimputed where you have imputed the missing values in dfnamed with 0.0.

Part 3: Feature scaling#

  1. Scale the MAX_WINDusing the StandardScaler and put it into an array called y, since its the target we want to predict

  1. Create a copy of your dfdrop dataframe called dffeatures, and drop the NAME and MAX_WIND columns from your dffeatures dataframe, since the MAX_WIND variable is going to be your target variable and we won’t need the cyclone names any longer

Part 4: Encode Categorical Variables#

  1. Check which unique categories each of the variables BASIN, SUBBASIN, and NATURE take, using the dffeatures dataframe.

  1. Print out the number of cyclones in each basin and subbasin. Also print out how many of each storm type there is.

  1. Encode the BASIN, SUBBASIN, and NATURE variables in dffeatures using One Hot Encoding, and standardize MIN_PRES, MEAN_LAT, and MEAN_LON, and SEASON using the StandardScaler.

Hint: you can do this in one step using the sci-kit learn ColumnTransformer. Create an array called X that contains the encoded categorical variables and the scaled numerical variables.

  1. Print out the feature names associated with the columns in your X array.

Part 5: Train, Validation, & Test Split#

  1. Split your data set into a training and test/validation data set using the train_test_split function with 80% of your original data for training and 20% for the testing and validation data sets.

  1. Split your X_test_val and y_test_val again into separate validation and test data sets, that are 10% each of the original data set. Double check that the size of your final training, validation, and test data sets are correct by printing out the shape of each array.

  1. Save the Training, validation, and test data sets and labels as numpy arrays using np.save()

Part 6: Create your Assignments Repository#

To turn in this and other homework assignments for this course, you will create an assignments github repository.

  • Create a new directory called ml4climate2025 in your home directory.

  • Create a Readme.md markdown file that contains your name.

  • Initialize a new git repository

  • Add the file and make your first commit

  • Create a new private repository on Github called ml4climate2025. (Call it exactly like that. Do not vary the spelling, capitalization, or punctuation.)

  • Push your ml4climate2025 repository to Github.

  • On Github, go to “settings”->”collaborators”, and add kdlamb and ChhaviDixit

  • Push new commits to this repository whenever you are ready to hand in your assignments.