Tutorial on Data Preprocessing (US Wildfires)#

In this lesson, we will go through an example of how to do initial exploratory data analysis and data preprocessing for machine learning. To do this, we will use a data set of US Wildfires from 1990 - 2016. This data set includes the location and time of 50,000 recent wildfires, as well as information about the type of vegetation and co-located meteorological data during the time when the fires occurred. It is a subset of a much larger data set of 1.8 million US Wildfires.

The data set is available on Kaggle: US Wildfires and other attributes

We’ll start by downloading the data set and then we will explore the variables in the data set and pre-process the data for use with machine learning using pandas and the sci-kit learn python packages.

Download the data set#

# To facilitate downloading data from Kaggle, we can install this python package
!pip install kagglehub
Requirement already satisfied: kagglehub in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (0.2.9)
Requirement already satisfied: packaging in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (24.1)
Requirement already satisfied: requests in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (2.32.3)
Requirement already satisfied: tqdm in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (4.66.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (2024.8.30)
import kagglehub
import os
datapath = kagglehub.dataset_download("capcloudcoder/us-wildfire-data-plus-other-attributes")

print("Path to dataset files:", datapath)
Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.12)
Path to dataset files: /Users/karalamb/.cache/kagglehub/datasets/capcloudcoder/us-wildfire-data-plus-other-attributes/versions/4
os.listdir(datapath)
['FW_Veg_Rem_Combined.csv', 'Wildfire_att_description.txt']

The data set that we downloaded contains two files. The first is a csv that includes the variables and data, and the txt file provides metadata that described the variables that are contained in the csv file. We will start by loading and printing the metadata from the txt file.

with open(os.path.join(datapath,'Wildfire_att_description.txt'), 'r') as file:
    content = file.read()
print(content)
fire_name            ,Name of Fire
fire_size           ,Size of Fire 
fire_size_class      ,Class of Fire Size (A-G)
stat_cause_descr     ,Cause of Fire
latitude            ,Latitude of Fire
longitude           ,Longitude of Fire
state                ,State of Fire
discovery_month      ,Month in which Fire was discovered
putout_time         ,time it took to putout the fire
disc_pre_year         ,year in which the fire was discovered

Vegetation            ,Dominant vegetation in the areas (1:Tropical Evergreen Broadleaf Forest,2:Tropical Deciduous Broadleaf Forest,3:Temperate Evergreen Broadleaf Forest ,
4:Temperate Evergreen Needleleaf Forest TmpENF,5:Temperate Deciduous Broadleaf Forest,6:Boreal Evergreen Needleleaf Forest,7:Boreal Deciduous Needleleaf Forest, 8:Savanna , 9:C3 Grassland/Steppe, 10:C4 Grassland/Steppe, 11:Dense Shrubland 12:Open Shrubland, 13:Tundra Tundra, 14:Desert,15:Polar Desert/Rock/Ice, 16:Secondary Tropical Evergreen Broadleaf Forest, 17:Secondary Tropical Deciduous Broadleaf Forest, 18:Secondary Temperate Evergreen Broadleaf Forest, 19:Secondary Temperate Evergreen Needleleaf Forest
20:Secondary Temperate Deciduous Broadleaf Forest, 21:Secondary Boreal Evergreen Needleleaf Forest, 22:Secondary Boreal Deciduous Needleleaf Forest, 23:Water/Rivers Water
24:C3 Cropland, 25:C4 Cropland, 26:C3 Pastureland, 27:C4 Pastureland, 28:Urban land)

fire_mag              ,magnitude of fire intensity (scaled version of fire_size)
Temp_pre_30         ,temperature in deg C at the location of fire upto 30 days prior
Temp_pre_15         ,temperature in deg C at the location of fire upto 15 days prior
Temp_pre_7          ,temperature in deg C at the location of fire upto 7 days prior
Temp_cont           ,temperature in deg C at the location of fire upto day the fire was contained
Wind_pre_30         ,wind in m/s at the location of fire upto 30 days prior
Wind_pre_15         ,wind in m/s at the location of fire upto 15 days prior
Wind_pre_7          ,wind in m/s at the location of fire upto 7 days prior
Wind_cont           ,wind in m/s at the location of fire upto day the fire was contained
Hum_pre_30          ,humidity in % at the location of fire upto 30 days prior
Hum_pre_15          ,humidity in % at the location of fire upto 15 days prior
Hum_pre_7           ,humidity in % at the location of fire upto 7 days prior
Hum_cont            ,humidity in % at the location of fire upto day the fire was contained
Prec_pre_30         ,precipitation in mm at the location of fire upto 30 days prior
Prec_pre_15         ,precipitation in mm at the location of fire upto 15 days prior
Prec_pre_7          ,precipitation in mm at the location of fire upto 7 days prior
Prec_cont           ,precipitation in mm at the location of fire upto day the fire was contained
remoteness          ,non-dimensional distance to closest city

We will use the pandas package to read in the csv file.

import pandas as pd

wildfiresdb = pd.read_csv(os.path.join(datapath,'FW_Veg_Rem_Combined.csv'))

Explore the .csv file#

wildfiresdb.head()
Unnamed: 0.1 Unnamed: 0 fire_name fire_size fire_size_class stat_cause_descr latitude longitude state disc_clean_date ... Wind_cont Hum_pre_30 Hum_pre_15 Hum_pre_7 Hum_cont Prec_pre_30 Prec_pre_15 Prec_pre_7 Prec_cont remoteness
0 0 0 NaN 10.0 C Missing/Undefined 18.105072 -66.753044 PR 2/11/2007 ... 3.250413 78.216590 76.793750 76.381579 78.724370 0.0 0.0 0.0 0.0 0.017923
1 1 1 NaN 3.0 B Arson 35.038330 -87.610000 TN 12/11/2006 ... 2.122320 70.840000 65.858911 55.505882 81.682678 59.8 8.4 0.0 86.8 0.184355
2 2 2 NaN 60.0 C Arson 34.947800 -88.722500 MS 2/29/2004 ... 3.369050 75.531629 75.868613 76.812834 65.063800 168.8 42.2 18.1 124.5 0.194544
3 3 3 WNA 1 1.0 B Debris Burning 39.641400 -119.308300 NV 6/6/2005 ... 0.000000 44.778429 37.140811 35.353846 0.000000 10.4 7.2 0.0 0.0 0.487447
4 4 4 NaN 2.0 B Miscellaneous 30.700600 -90.591400 LA 9/22/1999 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.0 -1.0 -1.0 -1.0 0.214633

5 rows × 43 columns

wildfiresdb.columns
Index(['Unnamed: 0.1', 'Unnamed: 0', 'fire_name', 'fire_size',
       'fire_size_class', 'stat_cause_descr', 'latitude', 'longitude', 'state',
       'disc_clean_date', 'cont_clean_date', 'discovery_month',
       'disc_date_final', 'cont_date_final', 'putout_time', 'disc_date_pre',
       'disc_pre_year', 'disc_pre_month', 'wstation_usaf', 'dstation_m',
       'wstation_wban', 'wstation_byear', 'wstation_eyear', 'Vegetation',
       'fire_mag', 'weather_file', 'Temp_pre_30', 'Temp_pre_15', 'Temp_pre_7',
       'Temp_cont', 'Wind_pre_30', 'Wind_pre_15', 'Wind_pre_7', 'Wind_cont',
       'Hum_pre_30', 'Hum_pre_15', 'Hum_pre_7', 'Hum_cont', 'Prec_pre_30',
       'Prec_pre_15', 'Prec_pre_7', 'Prec_cont', 'remoteness'],
      dtype='object')
wildfiresdb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55367 entries, 0 to 55366
Data columns (total 43 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0.1      55367 non-null  int64  
 1   Unnamed: 0        55367 non-null  int64  
 2   fire_name         25913 non-null  object 
 3   fire_size         55367 non-null  float64
 4   fire_size_class   55367 non-null  object 
 5   stat_cause_descr  55367 non-null  object 
 6   latitude          55367 non-null  float64
 7   longitude         55367 non-null  float64
 8   state             55367 non-null  object 
 9   disc_clean_date   55367 non-null  object 
 10  cont_clean_date   27477 non-null  object 
 11  discovery_month   55367 non-null  object 
 12  disc_date_final   28708 non-null  object 
 13  cont_date_final   25632 non-null  object 
 14  putout_time       27477 non-null  object 
 15  disc_date_pre     55367 non-null  object 
 16  disc_pre_year     55367 non-null  int64  
 17  disc_pre_month    55367 non-null  object 
 18  wstation_usaf     55367 non-null  object 
 19  dstation_m        55367 non-null  float64
 20  wstation_wban     55367 non-null  int64  
 21  wstation_byear    55367 non-null  int64  
 22  wstation_eyear    55367 non-null  int64  
 23  Vegetation        55367 non-null  int64  
 24  fire_mag          55367 non-null  float64
 25  weather_file      55367 non-null  object 
 26  Temp_pre_30       55367 non-null  float64
 27  Temp_pre_15       55367 non-null  float64
 28  Temp_pre_7        55367 non-null  float64
 29  Temp_cont         55367 non-null  float64
 30  Wind_pre_30       55367 non-null  float64
 31  Wind_pre_15       55367 non-null  float64
 32  Wind_pre_7        55367 non-null  float64
 33  Wind_cont         55367 non-null  float64
 34  Hum_pre_30        55367 non-null  float64
 35  Hum_pre_15        55367 non-null  float64
 36  Hum_pre_7         55367 non-null  float64
 37  Hum_cont          55367 non-null  float64
 38  Prec_pre_30       55367 non-null  float64
 39  Prec_pre_15       55367 non-null  float64
 40  Prec_pre_7        55367 non-null  float64
 41  Prec_cont         55367 non-null  float64
 42  remoteness        55367 non-null  float64
dtypes: float64(22), int64(7), object(14)
memory usage: 18.2+ MB
wildfiresdb.describe()
Unnamed: 0.1 Unnamed: 0 fire_size latitude longitude disc_pre_year dstation_m wstation_wban wstation_byear wstation_eyear ... Wind_cont Hum_pre_30 Hum_pre_15 Hum_pre_7 Hum_cont Prec_pre_30 Prec_pre_15 Prec_pre_7 Prec_cont remoteness
count 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 ... 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000 55367.000000
mean 27683.000000 27683.000000 2104.645161 36.172866 -94.757971 2003.765474 40256.474678 61029.607311 1979.341900 2015.480990 ... 1.132284 40.781796 38.453935 37.001865 25.056738 26.277046 11.654253 4.689920 15.590440 0.236799
std 15983.220514 15983.220514 14777.005364 6.724348 15.878194 6.584889 25272.081410 40830.393541 23.372803 6.767851 ... 2.030611 31.086856 31.042541 30.827885 31.187638 112.050198 56.920510 31.205327 59.757113 0.144865
min 0.000000 0.000000 0.510000 17.956533 -165.936000 1991.000000 6.166452 100.000000 1931.000000 1993.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.000000
25% 13841.500000 13841.500000 1.200000 32.265960 -102.541513 1999.000000 21373.361515 13927.000000 1973.000000 2010.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.137800
50% 27683.000000 27683.000000 4.000000 34.600000 -91.212359 2005.000000 35621.334820 73803.000000 1978.000000 2020.000000 ... 0.000000 55.657480 51.753846 48.230769 0.000000 0.000000 0.000000 0.000000 0.000000 0.202114
75% 41524.500000 41524.500000 20.000000 38.975235 -82.847500 2009.000000 53985.904315 99999.000000 2004.000000 2020.000000 ... 2.848603 67.384352 65.911469 64.645296 60.193606 18.900000 3.600000 0.000000 0.000000 0.284782
max 55366.000000 55366.000000 606945.000000 69.849500 -65.285833 2015.000000 224153.661800 99999.000000 2014.000000 2020.000000 ... 24.200000 96.000000 94.000000 96.000000 94.000000 13560.800000 2527.000000 1638.000000 2126.000000 1.000000

8 rows × 29 columns

Explore the distributions of the variables in our data set#

Since we have 43 variables, we will focus on a subset of these variables for now. We want to understand how the variables are distributed. We’ll focus on the size of the fire, its location, how long it took to put out, the year it was discovered, the type of vegetation, and historical metereological variables (Temperature, Wind, Humidity, and Precipitation) averaged over the time it took to put out the fire.

variables = ["fire_size","fire_size_class","latitude","longitude","putout_time","disc_pre_year","Vegetation","Temp_cont","Wind_cont","Hum_cont","Prec_cont"]

First we will put this subsection of the data into a DataFrame object by using the variable names we have chosen.

df = wildfiresdb[variables].copy()
df.head()
fire_size fire_size_class latitude longitude putout_time disc_pre_year Vegetation Temp_cont Wind_cont Hum_cont Prec_cont
0 10.0 C 18.105072 -66.753044 NaN 2007 12 24.527961 3.250413 78.724370 0.0
1 3.0 B 35.038330 -87.610000 NaN 2006 15 10.448298 2.122320 81.682678 86.8
2 60.0 C 34.947800 -88.722500 NaN 2004 16 13.696600 3.369050 65.063800 124.5
3 1.0 B 39.641400 -119.308300 0 days 00:00:00.000000000 2005 0 0.000000 0.000000 0.000000 0.0
4 2.0 B 30.700600 -90.591400 NaN 1999 12 -1.000000 -1.000000 -1.000000 -1.0

You can explore the data in various ways.

First, how many samples are in our data set?

print("Number of rows: " + str(len(df)))
Number of rows: 55367

We can use matplotlib to visualize the marginal distributions of our variables using the hist function:

import matplotlib.pyplot as plt

# extra code – the next 5 lines define the default font sizes
plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

df.hist(bins=50, figsize=(12, 8))
#save_fig("attribute_histogram_plots")  # extra code
plt.show()
../../_images/7871715a7b49f48cefd9d1f27ea85eb043f855a505896ecea260834ba1b111cb.png

Another useful python package that we can use is seaborn.

Warning! Seaborn can typically be quite slow, especially if you have a large data set or many variables. One simple way to deal with this is to randomly subsample the data, and visualize only a fraction of the data set.

import seaborn as sns

We can look at how the variables are correlated with one another using the PairGrid function

We need both of the lines below to create the plot. The first line just constructs a blank grid of subplots with each row and column corresponding to a numeric value in the data set. The second line draws a bivariate plot on every axis.

g = sns.PairGrid(df)
g.map(sns.scatterplot)
<seaborn.axisgrid.PairGrid at 0x1514523d0>
../../_images/a941c8129f41ebbdb18a825eee31ef601a853953e021170dc9de26c63c803454.png

We can also look for correlations between the variables in our data set, using the pandas DataFrame corr method. The numeric_only argument needs to be set to True to avoid an error. This makes sure it only uses the variables in our dataframe that have numeric values.

corr_matrix = df.corr(numeric_only=True)
# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6)) # Adjust figure size as needed
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.show()
../../_images/e9acaf9ef3f66f5135a57252e25d09159a31c968280cfc95519fa215e45719b0.png

Visualize the data spatially#

Since we have latitude and longitude, we can also visualize the spatial distribution of the points in our data set. Looking at fire size, it’s clear that the largest fires occur for the most part in the Western US.

import numpy as np

df.plot(kind='scatter',x='longitude',y='latitude',grid=True,alpha=0.2,
                s = np.log(df["fire_size"]),label = "Fire Size",
                c = np.log(df["fire_size"]),cmap="jet",colorbar = True, legend = True, sharex=False,figsize=(10,7))
plt.show()
/opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages/matplotlib/collections.py:963: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor
../../_images/92f15a7d8200485bd660ee1014a8b16e11875220d81ebadf7fc7f4f3967e6c9c.png

Handling missing/irregular data#

The column putout_time contains many NaN values, because data on how long it took to put out the fire is not available for every fire.

First, we can try to understand why data might be missing.

no_putout_time = pd.isna(df['putout_time'])

Let’s compare the mean size of the fires which have no data for putout_time to the mean size of the fires which do have data:

print("Mean size of fires with NaN for putout_time:")
(df['fire_size'][no_putout_time]).mean()
Mean size of fires with NaN for putout_time:
415.81547412190747
print("Mean size of fires with values for putout_time:")
(df['fire_size'][~no_putout_time]).mean()
Mean size of fires with values for putout_time:
3818.859229537432

This suggests that smaller fires typically don’t have data about putout_time associated with them.

putout_time is currently formatted as string timestamps, rather than floating point numbers, and they are unfortunately not all in the same format. We can see this by printing out the unique values from this column.

unique = df['putout_time'].unique()
unique
array([nan, '0 days 00:00:00.000000000', '1 days 00:00:00.000000000',
       '2 days 00:00:00.000000000', '3 days 00:00:00.000000000',
       '26 days 00:00:00.000000000', '9 days 00:00:00.000000000',
       '4 days 00:00:00.000000000', '18 days 00:00:00.000000000',
       '5 days 00:00:00.000000000', '16 days 00:00:00.000000000',
       '7 days 00:00:00.000000000', '19 days 00:00:00.000000000',
       '12 days 00:00:00.000000000', '6 days 00:00:00.000000000',
       '167 days 00:00:00.000000000', '8 days 00:00:00.000000000',
       '69 days 00:00:00.000000000', '31 days 00:00:00.000000000',
       '36 days 00:00:00.000000000', '29 days 00:00:00.000000000',
       '15 days 00:00:00.000000000', '92 days 00:00:00.000000000',
       '91 days 00:00:00.000000000', '10 days 00:00:00.000000000',
       '11 days 00:00:00.000000000', '63 days 00:00:00.000000000',
       '118 days 00:00:00.000000000', '13 days 00:00:00.000000000',
       '59 days 00:00:00.000000000', '93 days 00:00:00.000000000',
       '28 days 00:00:00.000000000', '127 days 00:00:00.000000000',
       '50 days 00:00:00.000000000', '30 days 00:00:00.000000000',
       '152 days 00:00:00.000000000', '54 days 00:00:00.000000000',
       '25 days 00:00:00.000000000', '47 days 00:00:00.000000000',
       '55 days 00:00:00.000000000', '108 days 00:00:00.000000000',
       '39 days 00:00:00.000000000', '104 days 00:00:00.000000000',
       '17 days 00:00:00.000000000', '80 days 00:00:00.000000000',
       '22 days 00:00:00.000000000', '20 days 00:00:00.000000000',
       '97 days 00:00:00.000000000', '84 days 00:00:00.000000000',
       '81 days 00:00:00.000000000', '67 days 00:00:00.000000000',
       '14 days 00:00:00.000000000', '60 days 00:00:00.000000000',
       '62 days 00:00:00.000000000', '103 days 00:00:00.000000000',
       '24 days 00:00:00.000000000', '193 days 00:00:00.000000000',
       '117 days 00:00:00.000000000', '116 days 00:00:00.000000000',
       '41 days 00:00:00.000000000', '61 days 00:00:00.000000000',
       '145 days 00:00:00.000000000', '106 days 00:00:00.000000000',
       '94 days 00:00:00.000000000', '70 days 00:00:00.000000000',
       '43 days 00:00:00.000000000', '90 days 00:00:00.000000000',
       '35 days 00:00:00.000000000', '40 days 00:00:00.000000000',
       '21 days 00:00:00.000000000', '65 days 00:00:00.000000000',
       '23 days 00:00:00.000000000', '89 days 00:00:00.000000000',
       '114 days 00:00:00.000000000', '57 days 00:00:00.000000000',
       '27 days 00:00:00.000000000', '46 days 00:00:00.000000000',
       '87 days 00:00:00.000000000', '74 days 00:00:00.000000000',
       '37 days 00:00:00.000000000', '42 days 00:00:00.000000000',
       '73 days 00:00:00.000000000', '148 days 00:00:00.000000000',
       '141 days 00:00:00.000000000', '122 days 00:00:00.000000000',
       '66 days 00:00:00.000000000', '44 days 00:00:00.000000000',
       '88 days 00:00:00.000000000', '161 days 00:00:00.000000000',
       '34 days 00:00:00.000000000', '132 days 00:00:00.000000000',
       '140 days 00:00:00.000000000', '45 days 00:00:00.000000000',
       '102 days 00:00:00.000000000', '121 days 00:00:00.000000000',
       '51 days 00:00:00.000000000', '261 days 00:00:00.000000000',
       '77 days 00:00:00.000000000', '48 days 00:00:00.000000000',
       '53 days 00:00:00.000000000', '71 days 00:00:00.000000000',
       '56 days 00:00:00.000000000', '79 days 00:00:00.000000000',
       '85 days 00:00:00.000000000', '222 days 00:00:00.000000000',
       '76 days 00:00:00.000000000', '58 days 00:00:00.000000000',
       '72 days 00:00:00.000000000', '83 days 00:00:00.000000000',
       '32 days 00:00:00.000000000', '131 days 00:00:00.000000000',
       '52 days 00:00:00.000000000', '99 days 00:00:00.000000000',
       '112 days 00:00:00.000000000', '213 days 00:00:00.000000000',
       '135 days 00:00:00.000000000', '168 days 00:00:00.000000000',
       '68 days 00:00:00.000000000', '33 days 00:00:00.000000000',
       '101 days 00:00:00.000000000', '64 days 00:00:00.000000000',
       '38 days 00:00:00.000000000', '115 days 00:00:00.000000000',
       '165 days 00:00:00.000000000', '119 days 00:00:00.000000000',
       '146 days 00:00:00.000000000', '136 days 00:00:00.000000000',
       '110 days 00:00:00.000000000', '255 days 00:00:00.000000000',
       '86 days 00:00:00.000000000', '49 days 00:00:00.000000000',
       '184 days 00:00:00.000000000', '78 days 00:00:00.000000000',
       '96 days 00:00:00.000000000', '3287 days 00:00:00.000000000',
       '211 days 00:00:00.000000000', '178 days 00:00:00.000000000', '21',
       '11', '77', '75', '87', '67', '66', '22', '10', '2', '9', '1', '3',
       '6', '18', '12', '5', '13', '27', '8', '7', '4', '0', '82', '32',
       '80', '49', '60', '99', '91', '84', '78', '54', '71', '51', '30',
       '14', '94', '45', '42', '56', '58', '55', '33', '53', '17', '31',
       '23', '29', '50', '92', '79', '47', '20', '15', '115', '28', '46',
       '73', '72', '83', '90', '88', '89', '64', '81', '19', '37', '39',
       '43', '38', '101', '24', '112', '86', '74', '26', '110', '108',
       '35', '44', '36', '59', '68', '70', '96', '34', '48', '116', '63',
       '40', '98', '85', '16', '102', '131', '142', '128', '76', '25',
       '52', '93', '118', '107', '97', '104', '65', '129', '111', '106',
       '103', '114', '61', '62', '122', '121', '41', '144', '155', '117',
       '57', '105', '119', '113', '138', '100', '312', '125', '95', '132',
       '126', '164', '120', '163', '184', '177', '145', '134', '109',
       '127', '147', '124', '141', '139', '150', '123', '69', '137',
       '148', '136', '151', '140', '133', '146', '204', '178', '143',
       '237', '153', '371', '170'], dtype=object)

Since the information that we are interested in is the first “word” in each string, we can use the string split method to get the number of days it took to put out a fire, and put this into a new column in our data frame. We also want the variable to be a floating point number (not a string) so that we can use it in our (future) machine learning model.

df['putout_time_float'] = df['putout_time'].str.split().str[0].astype(float)
df['putout_time_float']
0         NaN
1         NaN
2         NaN
3         0.0
4         NaN
         ... 
55362     NaN
55363    22.0
55364     NaN
55365    43.0
55366     NaN
Name: putout_time_float, Length: 55367, dtype: float64
df['putout_time_float'].describe()
count    27477.000000
mean         6.033592
std         27.803757
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max       3287.000000
Name: putout_time_float, dtype: float64
df
fire_size fire_size_class latitude longitude putout_time disc_pre_year Vegetation Temp_cont Wind_cont Hum_cont Prec_cont putout_time_float
0 10.0 C 18.105072 -66.753044 NaN 2007 12 24.527961 3.250413 78.724370 0.0 NaN
1 3.0 B 35.038330 -87.610000 NaN 2006 15 10.448298 2.122320 81.682678 86.8 NaN
2 60.0 C 34.947800 -88.722500 NaN 2004 16 13.696600 3.369050 65.063800 124.5 NaN
3 1.0 B 39.641400 -119.308300 0 days 00:00:00.000000000 2005 0 0.000000 0.000000 0.000000 0.0 0.0
4 2.0 B 30.700600 -90.591400 NaN 1999 12 -1.000000 -1.000000 -1.000000 -1.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
55362 6289.0 G 39.180000 -96.784167 NaN 2015 0 13.242324 3.804803 55.042092 249.0 NaN
55363 70868.0 G 38.342719 -120.695967 22 2015 0 -1.000000 -1.000000 -1.000000 -1.0 22.0
55364 5702.0 G 37.262607 -119.511139 NaN 2015 0 27.646067 2.529158 35.924406 0.0 NaN
55365 3261.0 F 40.604300 -123.080450 43 2015 15 -1.000000 -1.000000 -1.000000 -1.0 43.0
55366 76067.0 G 38.843988 -122.759707 NaN 2015 0 19.016883 1.208073 58.063802 18.8 NaN

55367 rows × 12 columns

We’ll drop the original column now, since we have fixed the irregular data

df = df.drop(['putout_time'],axis=1)
df.head()
fire_size fire_size_class latitude longitude disc_pre_year Vegetation Temp_cont Wind_cont Hum_cont Prec_cont putout_time_float
0 10.0 C 18.105072 -66.753044 2007 12 24.527961 3.250413 78.724370 0.0 NaN
1 3.0 B 35.038330 -87.610000 2006 15 10.448298 2.122320 81.682678 86.8 NaN
2 60.0 C 34.947800 -88.722500 2004 16 13.696600 3.369050 65.063800 124.5 NaN
3 1.0 B 39.641400 -119.308300 2005 0 0.000000 0.000000 0.000000 0.0 0.0
4 2.0 B 30.700600 -90.591400 1999 12 -1.000000 -1.000000 -1.000000 -1.0 NaN

If we look at the meteorological variables (Prec_cont, Wind_cont, Temp_cont, Hum_cont), we also notice that the distributions look strange. This is because they have 0 and -1 for missing values. We can guess this because it’s not physically reasonable that these values are negative, or that the Wind_cont, Temp_cont, Hum_cont are exactly 0.0000, especially during fire season.

fig, axs = plt.subplots(1, 4, figsize=(12, 3), sharey=True)
axs[0].hist(df["Prec_cont"], bins=50)
axs[1].hist(df["Wind_cont"], bins=50)
axs[2].hist(df["Temp_cont"], bins=50)
axs[3].hist(df["Hum_cont"], bins=50)
axs[0].set_xlabel("Prec_cont")
axs[1].set_xlabel("Wind_cont")
axs[2].set_xlabel("Temp_cont")
axs[3].set_xlabel("Hum_cont")
axs[0].set_ylabel("Number of Fires")

plt.show()
../../_images/d9300e293d36f200e125ced3cefdde1a89b3d51ff3ab09b0265153932abae07c.png
df["Temp_cont"]
0        24.527961
1        10.448298
2        13.696600
3         0.000000
4        -1.000000
           ...    
55362    13.242324
55363    -1.000000
55364    27.646067
55365    -1.000000
55366    19.016883
Name: Temp_cont, Length: 55367, dtype: float64

We can replace these suspicious values with NaN’s using the replace method.

df["Temp_cont"] = df["Temp_cont"].replace([0.0000,-1.0000], np.nan)
df["Hum_cont"] = df["Hum_cont"].replace([0.0000,-1.0000], np.nan)
df["Wind_cont"] = df["Wind_cont"].replace([0.0000,-1.0000], np.nan)
df["Prec_cont"] = df["Prec_cont"].replace([-1.0000], np.nan)
fig, axs = plt.subplots(1, 4, figsize=(12, 3), sharey=True)
axs[0].hist(df["Prec_cont"], bins=50)
axs[1].hist(df["Wind_cont"], bins=50)
axs[2].hist(df["Temp_cont"], bins=50)
axs[3].hist(df["Hum_cont"], bins=50)
axs[0].set_xlabel("Prec_cont")
axs[1].set_xlabel("Wind_cont")
axs[2].set_xlabel("Temp_cont")
axs[3].set_xlabel("Hum_cont")
axs[0].set_ylabel("Number of Fires")

plt.show()
../../_images/b7cc171c7b4dc9a74546f44b86023430e83d5f6028aa546e5a49eae5e41250ed.png

Now that we have fixed the irregular data, we need to figure out what to do with the NaN values.

One very simple way to deal with these missing data is to just drop all of the rows where there are no values for putout_time, using the dropna() method:

df_cleaned = df.dropna().copy()

A disadvantage of dropping rows it that we end up with less data, so other approaches such as imputation can be preferable if our data set is small. After removing all of the bad rows, we are down to 10% of our original data set!

print("Number of rows (before removing NaN's): " + str(len(df)))
print("Number of rows (after removing NaN's): " + str(len(df_cleaned)))
Number of rows (before removing NaN's): 55367
Number of rows (after removing NaN's): 5892

We can also remove the very high outliers for putout_time_float (>99.995%):

q_hi  = df_cleaned['putout_time_float'].quantile(0.99995)
print(q_hi)

df_filtered = df_cleaned[(df_cleaned['putout_time_float'] < q_hi)].dropna().copy()
338.5995000000421
print("Number of rows (before removing outliers): " + str(len(df_cleaned)))
print("Number of rows (after removing outliers): " + str(len(df_filtered)))
Number of rows (before removing outliers): 5892
Number of rows (after removing outliers): 5891

We can also fill the NaN’s with a value instead, using the fillna() method:

df_filled = df.fillna(0).copy()
df_filled 
fire_size fire_size_class latitude longitude disc_pre_year Vegetation Temp_cont Wind_cont Hum_cont Prec_cont putout_time_float
0 10.0 C 18.105072 -66.753044 2007 12 24.527961 3.250413 78.724370 0.0 0.0
1 3.0 B 35.038330 -87.610000 2006 15 10.448298 2.122320 81.682678 86.8 0.0
2 60.0 C 34.947800 -88.722500 2004 16 13.696600 3.369050 65.063800 124.5 0.0
3 1.0 B 39.641400 -119.308300 2005 0 0.000000 0.000000 0.000000 0.0 0.0
4 2.0 B 30.700600 -90.591400 1999 12 0.000000 0.000000 0.000000 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
55362 6289.0 G 39.180000 -96.784167 2015 0 13.242324 3.804803 55.042092 249.0 0.0
55363 70868.0 G 38.342719 -120.695967 2015 0 0.000000 0.000000 0.000000 0.0 22.0
55364 5702.0 G 37.262607 -119.511139 2015 0 27.646067 2.529158 35.924406 0.0 0.0
55365 3261.0 F 40.604300 -123.080450 2015 15 0.000000 0.000000 0.000000 0.0 43.0
55366 76067.0 G 38.843988 -122.759707 2015 0 19.016883 1.208073 58.063802 18.8 0.0

55367 rows × 11 columns

Rather than filling with a specific value, it’s also possible to fill with the median or mean of the column.

df_fill_mean= df.copy()

meantime = df['putout_time_float'].mean()
df_fill_mean['putout_time_float'].fillna(meantime,inplace=True)
df_fill_mean
fire_size fire_size_class latitude longitude disc_pre_year Vegetation Temp_cont Wind_cont Hum_cont Prec_cont putout_time_float
0 10.0 C 18.105072 -66.753044 2007 12 24.527961 3.250413 78.724370 0.0 6.033592
1 3.0 B 35.038330 -87.610000 2006 15 10.448298 2.122320 81.682678 86.8 6.033592
2 60.0 C 34.947800 -88.722500 2004 16 13.696600 3.369050 65.063800 124.5 6.033592
3 1.0 B 39.641400 -119.308300 2005 0 NaN NaN NaN 0.0 0.000000
4 2.0 B 30.700600 -90.591400 1999 12 NaN NaN NaN NaN 6.033592
... ... ... ... ... ... ... ... ... ... ... ...
55362 6289.0 G 39.180000 -96.784167 2015 0 13.242324 3.804803 55.042092 249.0 6.033592
55363 70868.0 G 38.342719 -120.695967 2015 0 NaN NaN NaN NaN 22.000000
55364 5702.0 G 37.262607 -119.511139 2015 0 27.646067 2.529158 35.924406 0.0 6.033592
55365 3261.0 F 40.604300 -123.080450 2015 15 NaN NaN NaN NaN 43.000000
55366 76067.0 G 38.843988 -122.759707 2015 0 19.016883 1.208073 58.063802 18.8 6.033592

55367 rows × 11 columns

Finally, we can interpolate, using the interpolate() method. Interpolate refers to filling in using the values based on estimating from surrounding non-NaN data in a column.

df_interp = df.interpolate().copy()

Interpolation doesn’t make a lot of sense for this data set since the fires are independent of one another. Interpolation makes more sense in the context of a time series, where we are missing specific values in our sequence. To illustrate this, we will look at the effect of using the Interpolate() method on a sine wave that has 20% of its data missing.

This code just generates the time series and puts it into a dataframe

np.random.seed(42)                      
n = 100
t = np.linspace(0, 2 * np.pi, n)         # 0 → 2π
y = np.sin(t)
y_noisy = y+ np.random.normal(0, 0.1, n)

# knock out ~10 % of points
mask = np.random.choice(n, size=int(0.20 * n), replace=False)
y_noisy[mask] = np.nan

dfsine = pd.DataFrame({"t": t, "sine": y,"noisy_sine": y_noisy})

We can apply the interpolate function to the sine wave to estimate the missing values.

dfsine_interp = dfsine.interpolate().copy()
plt.scatter(dfsine['t'],dfsine['noisy_sine'],marker='o',facecolors='r',edgecolors='r',label="original")
plt.scatter(dfsine_interp['t'],dfsine_interp['noisy_sine'],marker='o',facecolors='none',edgecolors='k',label="interpolated")
plt.xlabel("t")
plt.ylabel("y")
plt.legend()
<matplotlib.legend.Legend at 0x157482640>
../../_images/1709543047d83000a88a4f2da7dbb9a833cd4b617e979d3cb7de0d144dcc295c.png

Using interpolation or filling missing values can potentially bias our data sets, so it’s important to keep in mind what problem we are interested in solving when choosing a method to deal with missing data.

Feature Scaling#

The sci-kit learn package has a number of built in functions for scaling variables.

MinMaxScaler scales value between a minimum and maximum. We can set what range we want to end up with.

For a sample x, the MinMaxScaler is calculated as

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
latitude_min_max_scaled = min_max_scaler.fit_transform(df[["latitude"]])

The StandardScaler removes the mean and scales to unit variance. For a sample x, the standard scaler is calculated as

z = (x-u)/s
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
latitude_std_scaled = std_scaler.fit_transform(df[["latitude"]])
fig, axs = plt.subplots(1, 3, figsize=(12, 3), sharey=True)
axs[0].hist(df["latitude"], bins=50)
axs[1].hist(latitude_min_max_scaled, bins=50)
axs[2].hist(latitude_std_scaled, bins=50)
axs[0].set_xlabel("Latitude")
axs[1].set_xlabel("Latitude (Min_Max_Scaled)")
axs[2].set_xlabel("Latitude (Std_Scaled)")
axs[0].set_ylabel("Number of Fires")

plt.show()
../../_images/827a7845cade8d8cb8d24af311b2d7bf719ddd305c5188f996c50d4ad987111f.png

We can also do custom transformations using sci-kit learn’s FunctionTransformer method. For example, fire_size is not normally distributed, but rather strongly skewed towards smaller fires. We can log scale this variable using the FunctionTransformer

from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_fire_size = log_transformer.transform(df_filtered[["fire_size"]])
fig, axs = plt.subplots(1, 2, figsize=(8, 3), sharey=True)
axs[0].hist(df_filtered["fire_size"], bins=50)
axs[1].hist(log_fire_size, bins=50)
axs[0].set_xlabel("Fire Size")
axs[1].set_xlabel("Log(Fire Size)")
axs[0].set_ylabel("Number of Fires")

plt.show()
../../_images/864cf4abfdc3b789efb575f22f1352f5994e04a386fbb18ad104ffac3ef98114.png
df_filtered["log_fire_size"] = log_fire_size

Encoding categorical variables#

The fire_size_class is a categorical variable (it is a letter between A - G).

fire_size_class = df[["fire_size_class"]]
fire_size_class.head(8)
fire_size_class
0 C
1 B
2 C
3 B
4 B
5 B
6 B
7 B

We can also check if our data set is balanced or not, by printing out the number of samples of each class in our data set.

fire_size_class_counts = df["fire_size_class"].value_counts()
print(fire_size_class_counts)
fire_size_class
B    36522
C    10811
G     3972
F     1968
D     1394
E      700
Name: count, dtype: int64

There are several different ways that we can encode this categorical variable for machine learning models. One is using the OrdinalEncoder function from sci-kit learn. This function will transform categories (B,C,D,E,…) to an integer label (0,1,2,3,…).

We will end up with a single column per original feature (or variable) in our data set.

OrdinalEncoder can be useful if our categories have a particular ordering to them (like fire size class does). However, it might not be a good choice for a feature like the state in which a fire occured, since we don’t have a particular ordering to the values in our data set in that case.

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
fire_size_class_encoded = ordinal_encoder.fit_transform(fire_size_class)
fire_size_class
fire_size_class
0 C
1 B
2 C
3 B
4 B
... ...
55362 G
55363 G
55364 G
55365 F
55366 G

55367 rows × 1 columns

fire_size_class_encoded[:8]
array([[1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]])
ordinal_encoder.categories_
[array(['B', 'C', 'D', 'E', 'F', 'G'], dtype=object)]

An alternative way that we can encode a categorical variable is using the OneHotEncoder function from sci-kit learn. This function converts each category into a separate binary column. For example, a sample that is class B, would be one hot encoded as

[1, 0, 0, 0, 0, 0]

or for a sample that is a fire of class E, it would be encoded as

[0, 0, 0, 1, 0, 0]

We will end up with as many columns as we have classes in our data set. For the fire_size_class variable, this would add 6 columns, since there are 6 possible classes represented in our data set.

This is best to use when our categorical variables are nominal (unordered), and is best for many of the types of models that assume inputs are numerical and unstructured (linear models, neural networks), which we will discuss later in the course.

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
fire_size_class_hot = cat_encoder.fit_transform(fire_size_class)
fire_size_class_hot
<55367x6 sparse matrix of type '<class 'numpy.float64'>'
	with 55367 stored elements in Compressed Sparse Row format>
fire_size_class_hot.toarray()
array([[0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1.]])
cat_encoder.categories_
[array(['B', 'C', 'D', 'E', 'F', 'G'], dtype=object)]

Creating a pipeline to preprocess our features#

df_filtered
fire_size fire_size_class latitude longitude disc_pre_year Vegetation Temp_cont Wind_cont Hum_cont Prec_cont putout_time_float log_fire_size
36 1420.0 F 33.241800 -104.912200 1994 16 36.241176 4.217647 17.058824 0.0 1.0 7.258412
56 15.3 C 36.586500 -96.323220 2015 0 5.258333 3.558333 71.500000 0.0 1.0 2.727853
57 270.0 D 35.473333 -84.450000 1999 15 16.431034 0.982759 67.050000 0.0 1.0 5.598422
59 57.0 C 36.816667 -82.651667 1995 0 13.002083 1.950000 37.854167 0.0 2.0 4.043051
78 450.0 E 34.330000 -117.513056 2006 16 33.900000 2.237500 39.375000 0.0 1.0 6.109248
... ... ... ... ... ... ... ... ... ... ... ... ...
55347 3200.0 F 46.679234 -93.270841 2015 15 10.666667 4.813889 51.583333 0.0 1.0 8.070906
55349 17831.0 G 30.005000 -100.391167 2015 12 28.250580 2.360093 58.243619 0.0 6.0 9.788694
55350 3452.0 F 30.400883 -101.331933 2015 12 28.793750 3.082979 39.510638 0.0 2.0 8.146709
55351 8300.0 G 45.922100 -111.422500 2015 15 11.032231 3.367769 48.628099 0.0 5.0 9.024011
55352 4704.0 F 48.937900 -119.309100 2015 15 21.204970 4.537914 36.257773 4.4 140.0 8.456168

5891 rows × 12 columns

Let’s say that we are interested in predicting the time it takes to put out a fire, given the other variables in our data set. First, we will scale the putout_time_float using the standard scaler and put it into an array called y, since its our target variable.

std_scaler = StandardScaler()
y = std_scaler.fit_transform(df_filtered[["putout_time_float"]])

Next we will create a feature matrix X, which will be the input to our model. First let’s remove our target variable from the dataframe.

features = df_filtered.copy()
features = df_filtered.drop(["putout_time_float"],axis=1)
features
fire_size fire_size_class latitude longitude disc_pre_year Vegetation Temp_cont Wind_cont Hum_cont Prec_cont log_fire_size
36 1420.0 F 33.241800 -104.912200 1994 16 36.241176 4.217647 17.058824 0.0 7.258412
56 15.3 C 36.586500 -96.323220 2015 0 5.258333 3.558333 71.500000 0.0 2.727853
57 270.0 D 35.473333 -84.450000 1999 15 16.431034 0.982759 67.050000 0.0 5.598422
59 57.0 C 36.816667 -82.651667 1995 0 13.002083 1.950000 37.854167 0.0 4.043051
78 450.0 E 34.330000 -117.513056 2006 16 33.900000 2.237500 39.375000 0.0 6.109248
... ... ... ... ... ... ... ... ... ... ... ...
55347 3200.0 F 46.679234 -93.270841 2015 15 10.666667 4.813889 51.583333 0.0 8.070906
55349 17831.0 G 30.005000 -100.391167 2015 12 28.250580 2.360093 58.243619 0.0 9.788694
55350 3452.0 F 30.400883 -101.331933 2015 12 28.793750 3.082979 39.510638 0.0 8.146709
55351 8300.0 G 45.922100 -111.422500 2015 15 11.032231 3.367769 48.628099 0.0 9.024011
55352 4704.0 F 48.937900 -119.309100 2015 15 21.204970 4.537914 36.257773 4.4 8.456168

5891 rows × 11 columns

Now we can apply a pipeline of transformations to our data set. sci-kit learn allows up to create separate pipelines for categorical and numerical variables and then run them on our entire dataframe simultaneously.

from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

We need to separate our variables based on how we want to transform them though.

categorical_cols = ["fire_size_class"]
numerical_cols = ["fire_size","latitude","longitude","disc_pre_year","Vegetation","Temp_cont","Wind_cont","Hum_cont","Prec_cont","log_fire_size"]
cat_pipeline = make_pipeline(OrdinalEncoder(),StandardScaler())
num_pipeline = make_pipeline(MinMaxScaler())
preprocessor = ColumnTransformer([
    ("num",num_pipeline,numerical_cols),
    ("cat",cat_pipeline,categorical_cols)])
X = preprocessor.fit_transform(features)
X.shape
(5891, 11)
preprocessor.get_feature_names_out()
array(['num__fire_size', 'num__latitude', 'num__longitude',
       'num__disc_pre_year', 'num__Vegetation', 'num__Temp_cont',
       'num__Wind_cont', 'num__Hum_cont', 'num__Prec_cont',
       'num__log_fire_size', 'cat__fire_size_class'], dtype=object)
g = sns.PairGrid(pd.DataFrame(X,columns = preprocessor.get_feature_names_out()))
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
<seaborn.axisgrid.PairGrid at 0x161139880>
../../_images/70aab853148f256898f9401f6d29cff61b8c79d0d4f805a04a2da58cafc0e264.png

Creating a training, validation, and test data set#

Sci-kit learn has several built in functions to facilitate create separate training and test data sets. The simplest one to use is train_test_split, which returns the features and targets split into train and test data sets. If we want to also create a validation data set using this function, we will need to split the data set twice.

from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X,y,test_size=0.5, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test_val,y_test_val,test_size=0.5, random_state=42)
print(X_train.shape,X_val.shape,X_test.shape)
(2945, 11) (1473, 11) (1473, 11)