Tutorial on Data Preprocessing (US Wildfires)

Tutorial on Data Preprocessing (US Wildfires)#

In this lesson, we will go through an example of how to do initial exploratory data analysis and data preprocessing for machine learning. To do this, we will use a data set of US Wildfires from 1990 - 2016. This data set includes the location and time of 50,000 recent wildfires, as well as information about the type of vegetation and co-located meteorological data during the time when the fires occurred. It is a subset of a much larger data set of 1.8 million US Wildfires.

The data set is available on Kaggle: US Wildfires and other attributes

We’ll start by downloading the data set and then we will explore the variables in the data set and pre-process the data for use with machine learning using pandas and the sci-kit learn python packages.

Download the data set#

# To facilitate downloading data from Kaggle, we can install this python package
!pip install kagglehub

Requirement already satisfied: kagglehub in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (0.2.9)
Requirement already satisfied: packaging in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (24.1)
Requirement already satisfied: requests in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (2.32.3)
Requirement already satisfied: tqdm in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (4.66.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (2024.8.30)

import kagglehub
import os

datapath = kagglehub.dataset_download("capcloudcoder/us-wildfire-data-plus-other-attributes")

print("Path to dataset files:", datapath)

Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.12)

Path to dataset files: /Users/karalamb/.cache/kagglehub/datasets/capcloudcoder/us-wildfire-data-plus-other-attributes/versions/4

os.listdir(datapath)

['FW_Veg_Rem_Combined.csv', 'Wildfire_att_description.txt']

The data set that we downloaded contains two files. The first is a csv that includes the variables and data, and the txt file provides metadata that described the variables that are contained in the csv file. We will start by loading and printing the metadata from the txt file.

with open(os.path.join(datapath,'Wildfire_att_description.txt'), 'r') as file:
    content = file.read()
print(content)

fire_name            ,Name of Fire
fire_size           ,Size of Fire 
fire_size_class      ,Class of Fire Size (A-G)
stat_cause_descr     ,Cause of Fire
latitude            ,Latitude of Fire
longitude           ,Longitude of Fire
state                ,State of Fire
discovery_month      ,Month in which Fire was discovered
putout_time         ,time it took to putout the fire
disc_pre_year         ,year in which the fire was discovered

Vegetation            ,Dominant vegetation in the areas (1:Tropical Evergreen Broadleaf Forest,2:Tropical Deciduous Broadleaf Forest,3:Temperate Evergreen Broadleaf Forest ,
4:Temperate Evergreen Needleleaf Forest TmpENF,5:Temperate Deciduous Broadleaf Forest,6:Boreal Evergreen Needleleaf Forest,7:Boreal Deciduous Needleleaf Forest, 8:Savanna , 9:C3 Grassland/Steppe, 10:C4 Grassland/Steppe, 11:Dense Shrubland 12:Open Shrubland, 13:Tundra Tundra, 14:Desert,15:Polar Desert/Rock/Ice, 16:Secondary Tropical Evergreen Broadleaf Forest, 17:Secondary Tropical Deciduous Broadleaf Forest, 18:Secondary Temperate Evergreen Broadleaf Forest, 19:Secondary Temperate Evergreen Needleleaf Forest
20:Secondary Temperate Deciduous Broadleaf Forest, 21:Secondary Boreal Evergreen Needleleaf Forest, 22:Secondary Boreal Deciduous Needleleaf Forest, 23:Water/Rivers Water
24:C3 Cropland, 25:C4 Cropland, 26:C3 Pastureland, 27:C4 Pastureland, 28:Urban land)

fire_mag              ,magnitude of fire intensity (scaled version of fire_size)
Temp_pre_30         ,temperature in deg C at the location of fire upto 30 days prior
Temp_pre_15         ,temperature in deg C at the location of fire upto 15 days prior
Temp_pre_7          ,temperature in deg C at the location of fire upto 7 days prior
Temp_cont           ,temperature in deg C at the location of fire upto day the fire was contained
Wind_pre_30         ,wind in m/s at the location of fire upto 30 days prior
Wind_pre_15         ,wind in m/s at the location of fire upto 15 days prior
Wind_pre_7          ,wind in m/s at the location of fire upto 7 days prior
Wind_cont           ,wind in m/s at the location of fire upto day the fire was contained
Hum_pre_30          ,humidity in % at the location of fire upto 30 days prior
Hum_pre_15          ,humidity in % at the location of fire upto 15 days prior
Hum_pre_7           ,humidity in % at the location of fire upto 7 days prior
Hum_cont            ,humidity in % at the location of fire upto day the fire was contained
Prec_pre_30         ,precipitation in mm at the location of fire upto 30 days prior
Prec_pre_15         ,precipitation in mm at the location of fire upto 15 days prior
Prec_pre_7          ,precipitation in mm at the location of fire upto 7 days prior
Prec_cont           ,precipitation in mm at the location of fire upto day the fire was contained
remoteness          ,non-dimensional distance to closest city

We will use the pandas package to read in the csv file.

import pandas as pd

wildfiresdb = pd.read_csv(os.path.join(datapath,'FW_Veg_Rem_Combined.csv'))

Explore the .csv file#

wildfiresdb.head()

	Unnamed: 0.1	Unnamed: 0	fire_name	fire_size	fire_size_class	stat_cause_descr	latitude	longitude	state	disc_clean_date	...	Wind_cont	Hum_pre_30	Hum_pre_15	Hum_pre_7	Hum_cont	Prec_pre_30	Prec_pre_15	Prec_pre_7	Prec_cont	remoteness
0	0	0	NaN	10.0	C	Missing/Undefined	18.105072	-66.753044	PR	2/11/2007	...	3.250413	78.216590	76.793750	76.381579	78.724370	0.0	0.0	0.0	0.0	0.017923
1	1	1	NaN	3.0	B	Arson	35.038330	-87.610000	TN	12/11/2006	...	2.122320	70.840000	65.858911	55.505882	81.682678	59.8	8.4	0.0	86.8	0.184355
2	2	2	NaN	60.0	C	Arson	34.947800	-88.722500	MS	2/29/2004	...	3.369050	75.531629	75.868613	76.812834	65.063800	168.8	42.2	18.1	124.5	0.194544
3	3	3	WNA 1	1.0	B	Debris Burning	39.641400	-119.308300	NV	6/6/2005	...	0.000000	44.778429	37.140811	35.353846	0.000000	10.4	7.2	0.0	0.0	0.487447
4	4	4	NaN	2.0	B	Miscellaneous	30.700600	-90.591400	LA	9/22/1999	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.0	-1.0	-1.0	-1.0	0.214633

5 rows × 43 columns

wildfiresdb.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'fire_name', 'fire_size',
       'fire_size_class', 'stat_cause_descr', 'latitude', 'longitude', 'state',
       'disc_clean_date', 'cont_clean_date', 'discovery_month',
       'disc_date_final', 'cont_date_final', 'putout_time', 'disc_date_pre',
       'disc_pre_year', 'disc_pre_month', 'wstation_usaf', 'dstation_m',
       'wstation_wban', 'wstation_byear', 'wstation_eyear', 'Vegetation',
       'fire_mag', 'weather_file', 'Temp_pre_30', 'Temp_pre_15', 'Temp_pre_7',
       'Temp_cont', 'Wind_pre_30', 'Wind_pre_15', 'Wind_pre_7', 'Wind_cont',
       'Hum_pre_30', 'Hum_pre_15', 'Hum_pre_7', 'Hum_cont', 'Prec_pre_30',
       'Prec_pre_15', 'Prec_pre_7', 'Prec_cont', 'remoteness'],
      dtype='object')

wildfiresdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55367 entries, 0 to 55366
Data columns (total 43 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 Unnamed: 0.1      55367 non-null  int64  
 Unnamed: 0        55367 non-null  int64  
 fire_name         25913 non-null  object 
 fire_size         55367 non-null  float64
 fire_size_class   55367 non-null  object 
 stat_cause_descr  55367 non-null  object 
 latitude          55367 non-null  float64
 longitude         55367 non-null  float64
 state             55367 non-null  object 
 disc_clean_date   55367 non-null  object 
cont_clean_date   27477 non-null  object 
discovery_month   55367 non-null  object 
disc_date_final   28708 non-null  object 
cont_date_final   25632 non-null  object 
putout_time       27477 non-null  object 
disc_date_pre     55367 non-null  object 
disc_pre_year     55367 non-null  int64  
disc_pre_month    55367 non-null  object 
wstation_usaf     55367 non-null  object 
dstation_m        55367 non-null  float64
wstation_wban     55367 non-null  int64  
wstation_byear    55367 non-null  int64  
wstation_eyear    55367 non-null  int64  
Vegetation        55367 non-null  int64  
fire_mag          55367 non-null  float64
weather_file      55367 non-null  object 
Temp_pre_30       55367 non-null  float64
Temp_pre_15       55367 non-null  float64
Temp_pre_7        55367 non-null  float64
Temp_cont         55367 non-null  float64
Wind_pre_30       55367 non-null  float64
Wind_pre_15       55367 non-null  float64
Wind_pre_7        55367 non-null  float64
Wind_cont         55367 non-null  float64
Hum_pre_30        55367 non-null  float64
Hum_pre_15        55367 non-null  float64
Hum_pre_7         55367 non-null  float64
Hum_cont          55367 non-null  float64
Prec_pre_30       55367 non-null  float64
Prec_pre_15       55367 non-null  float64
Prec_pre_7        55367 non-null  float64
Prec_cont         55367 non-null  float64
remoteness        55367 non-null  float64
dtypes: float64(22), int64(7), object(14)
memory usage: 18.2+ MB

wildfiresdb.describe()

	Unnamed: 0.1	Unnamed: 0	fire_size	latitude	longitude	disc_pre_year	dstation_m	wstation_wban	wstation_byear	wstation_eyear	...	Wind_cont	Hum_pre_30	Hum_pre_15	Hum_pre_7	Hum_cont	Prec_pre_30	Prec_pre_15	Prec_pre_7	Prec_cont	remoteness
count	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	...	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000	55367.000000
mean	27683.000000	27683.000000	2104.645161	36.172866	-94.757971	2003.765474	40256.474678	61029.607311	1979.341900	2015.480990	...	1.132284	40.781796	38.453935	37.001865	25.056738	26.277046	11.654253	4.689920	15.590440	0.236799
std	15983.220514	15983.220514	14777.005364	6.724348	15.878194	6.584889	25272.081410	40830.393541	23.372803	6.767851	...	2.030611	31.086856	31.042541	30.827885	31.187638	112.050198	56.920510	31.205327	59.757113	0.144865
min	0.000000	0.000000	0.510000	17.956533	-165.936000	1991.000000	6.166452	100.000000	1931.000000	1993.000000	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	0.000000
25%	13841.500000	13841.500000	1.200000	32.265960	-102.541513	1999.000000	21373.361515	13927.000000	1973.000000	2010.000000	...	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	0.137800
50%	27683.000000	27683.000000	4.000000	34.600000	-91.212359	2005.000000	35621.334820	73803.000000	1978.000000	2020.000000	...	0.000000	55.657480	51.753846	48.230769	0.000000	0.000000	0.000000	0.000000	0.000000	0.202114
75%	41524.500000	41524.500000	20.000000	38.975235	-82.847500	2009.000000	53985.904315	99999.000000	2004.000000	2020.000000	...	2.848603	67.384352	65.911469	64.645296	60.193606	18.900000	3.600000	0.000000	0.000000	0.284782
max	55366.000000	55366.000000	606945.000000	69.849500	-65.285833	2015.000000	224153.661800	99999.000000	2014.000000	2020.000000	...	24.200000	96.000000	94.000000	96.000000	94.000000	13560.800000	2527.000000	1638.000000	2126.000000	1.000000

8 rows × 29 columns

Explore the distributions of the variables in our data set#

Since we have 43 variables, we will focus on a subset of these variables for now. We want to understand how the variables are distributed. We’ll focus on the size of the fire, its location, how long it took to put out, the year it was discovered, the type of vegetation, and historical metereological variables (Temperature, Wind, Humidity, and Precipitation) averaged over the time it took to put out the fire.

variables = ["fire_size","fire_size_class","latitude","longitude","putout_time","disc_pre_year","Vegetation","Temp_cont","Wind_cont","Hum_cont","Prec_cont"]

First we will put this subsection of the data into a DataFrame object by using the variable names we have chosen.

df = wildfiresdb[variables].copy()
df.head()

	fire_size	fire_size_class	latitude	longitude	putout_time	disc_pre_year	Vegetation	Temp_cont	Wind_cont	Hum_cont	Prec_cont
0	10.0	C	18.105072	-66.753044	NaN	2007	12	24.527961	3.250413	78.724370	0.0
1	3.0	B	35.038330	-87.610000	NaN	2006	15	10.448298	2.122320	81.682678	86.8
2	60.0	C	34.947800	-88.722500	NaN	2004	16	13.696600	3.369050	65.063800	124.5
3	1.0	B	39.641400	-119.308300	0 days 00:00:00.000000000	2005	0	0.000000	0.000000	0.000000	0.0
4	2.0	B	30.700600	-90.591400	NaN	1999	12	-1.000000	-1.000000	-1.000000	-1.0

You can explore the data in various ways.

First, how many samples are in our data set?

print("Number of rows: " + str(len(df)))

Number of rows: 55367

We can use matplotlib to visualize the marginal distributions of our variables using the hist function:

import matplotlib.pyplot as plt

# extra code – the next 5 lines define the default font sizes
plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

df.hist(bins=50, figsize=(12, 8))
#save_fig("attribute_histogram_plots")  # extra code
plt.show()

../../_images/7871715a7b49f48cefd9d1f27ea85eb043f855a505896ecea260834ba1b111cb.png

Another useful python package that we can use is seaborn.

Warning! Seaborn can typically be quite slow, especially if you have a large data set or many variables. One simple way to deal with this is to randomly subsample the data, and visualize only a fraction of the data set.

import seaborn as sns

We can look at how the variables are correlated with one another using the PairGrid function

We need both of the lines below to create the plot. The first line just constructs a blank grid of subplots with each row and column corresponding to a numeric value in the data set. The second line draws a bivariate plot on every axis.

g = sns.PairGrid(df)
g.map(sns.scatterplot)

<seaborn.axisgrid.PairGrid at 0x1514523d0>

../../_images/a941c8129f41ebbdb18a825eee31ef601a853953e021170dc9de26c63c803454.png

We can also look for correlations between the variables in our data set, using the pandas DataFrame corr method. The numeric_only argument needs to be set to True to avoid an error. This makes sure it only uses the variables in our dataframe that have numeric values.

corr_matrix = df.corr(numeric_only=True)

# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6)) # Adjust figure size as needed
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

../../_images/e9acaf9ef3f66f5135a57252e25d09159a31c968280cfc95519fa215e45719b0.png

Visualize the data spatially#

Since we have latitude and longitude, we can also visualize the spatial distribution of the points in our data set. Looking at fire size, it’s clear that the largest fires occur for the most part in the Western US.

import numpy as np

df.plot(kind='scatter',x='longitude',y='latitude',grid=True,alpha=0.2,
                s = np.log(df["fire_size"]),label = "Fire Size",
                c = np.log(df["fire_size"]),cmap="jet",colorbar = True, legend = True, sharex=False,figsize=(10,7))
plt.show()

/opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages/matplotlib/collections.py:963: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor

../../_images/92f15a7d8200485bd660ee1014a8b16e11875220d81ebadf7fc7f4f3967e6c9c.png

Handling missing/irregular data#

The column putout_time contains many NaN values, because data on how long it took to put out the fire is not available for every fire.

First, we can try to understand why data might be missing.

no_putout_time = pd.isna(df['putout_time'])

Let’s compare the mean size of the fires which have no data for putout_time to the mean size of the fires which do have data:

print("Mean size of fires with NaN for putout_time:")
(df['fire_size'][no_putout_time]).mean()

Mean size of fires with NaN for putout_time:

415.81547412190747

print("Mean size of fires with values for putout_time:")
(df['fire_size'][~no_putout_time]).mean()

Mean size of fires with values for putout_time:

3818.859229537432

This suggests that smaller fires typically don’t have data about putout_time associated with them.

putout_time is currently formatted as string timestamps, rather than floating point numbers, and they are unfortunately not all in the same format. We can see this by printing out the unique values from this column.

unique = df['putout_time'].unique()

unique

array([nan, '0 days 00:00:00.000000000', '1 days 00:00:00.000000000',
       '2 days 00:00:00.000000000', '3 days 00:00:00.000000000',
       '26 days 00:00:00.000000000', '9 days 00:00:00.000000000',
       '4 days 00:00:00.000000000', '18 days 00:00:00.000000000',
       '5 days 00:00:00.000000000', '16 days 00:00:00.000000000',
       '7 days 00:00:00.000000000', '19 days 00:00:00.000000000',
       '12 days 00:00:00.000000000', '6 days 00:00:00.000000000',
       '167 days 00:00:00.000000000', '8 days 00:00:00.000000000',
       '69 days 00:00:00.000000000', '31 days 00:00:00.000000000',
       '36 days 00:00:00.000000000', '29 days 00:00:00.000000000',
       '15 days 00:00:00.000000000', '92 days 00:00:00.000000000',
       '91 days 00:00:00.000000000', '10 days 00:00:00.000000000',
       '11 days 00:00:00.000000000', '63 days 00:00:00.000000000',
       '118 days 00:00:00.000000000', '13 days 00:00:00.000000000',
       '59 days 00:00:00.000000000', '93 days 00:00:00.000000000',
       '28 days 00:00:00.000000000', '127 days 00:00:00.000000000',
       '50 days 00:00:00.000000000', '30 days 00:00:00.000000000',
       '152 days 00:00:00.000000000', '54 days 00:00:00.000000000',
       '25 days 00:00:00.000000000', '47 days 00:00:00.000000000',
       '55 days 00:00:00.000000000', '108 days 00:00:00.000000000',
       '39 days 00:00:00.000000000', '104 days 00:00:00.000000000',
       '17 days 00:00:00.000000000', '80 days 00:00:00.000000000',
       '22 days 00:00:00.000000000', '20 days 00:00:00.000000000',
       '97 days 00:00:00.000000000', '84 days 00:00:00.000000000',
       '81 days 00:00:00.000000000', '67 days 00:00:00.000000000',
       '14 days 00:00:00.000000000', '60 days 00:00:00.000000000',
       '62 days 00:00:00.000000000', '103 days 00:00:00.000000000',
       '24 days 00:00:00.000000000', '193 days 00:00:00.000000000',
       '117 days 00:00:00.000000000', '116 days 00:00:00.000000000',
       '41 days 00:00:00.000000000', '61 days 00:00:00.000000000',
       '145 days 00:00:00.000000000', '106 days 00:00:00.000000000',
       '94 days 00:00:00.000000000', '70 days 00:00:00.000000000',
       '43 days 00:00:00.000000000', '90 days 00:00:00.000000000',
       '35 days 00:00:00.000000000', '40 days 00:00:00.000000000',
       '21 days 00:00:00.000000000', '65 days 00:00:00.000000000',
       '23 days 00:00:00.000000000', '89 days 00:00:00.000000000',
       '114 days 00:00:00.000000000', '57 days 00:00:00.000000000',
       '27 days 00:00:00.000000000', '46 days 00:00:00.000000000',
       '87 days 00:00:00.000000000', '74 days 00:00:00.000000000',
       '37 days 00:00:00.000000000', '42 days 00:00:00.000000000',
       '73 days 00:00:00.000000000', '148 days 00:00:00.000000000',
       '141 days 00:00:00.000000000', '122 days 00:00:00.000000000',
       '66 days 00:00:00.000000000', '44 days 00:00:00.000000000',
       '88 days 00:00:00.000000000', '161 days 00:00:00.000000000',
       '34 days 00:00:00.000000000', '132 days 00:00:00.000000000',
       '140 days 00:00:00.000000000', '45 days 00:00:00.000000000',
       '102 days 00:00:00.000000000', '121 days 00:00:00.000000000',
       '51 days 00:00:00.000000000', '261 days 00:00:00.000000000',
       '77 days 00:00:00.000000000', '48 days 00:00:00.000000000',
       '53 days 00:00:00.000000000', '71 days 00:00:00.000000000',
       '56 days 00:00:00.000000000', '79 days 00:00:00.000000000',
       '85 days 00:00:00.000000000', '222 days 00:00:00.000000000',
       '76 days 00:00:00.000000000', '58 days 00:00:00.000000000',
       '72 days 00:00:00.000000000', '83 days 00:00:00.000000000',
       '32 days 00:00:00.000000000', '131 days 00:00:00.000000000',
       '52 days 00:00:00.000000000', '99 days 00:00:00.000000000',
       '112 days 00:00:00.000000000', '213 days 00:00:00.000000000',
       '135 days 00:00:00.000000000', '168 days 00:00:00.000000000',
       '68 days 00:00:00.000000000', '33 days 00:00:00.000000000',
       '101 days 00:00:00.000000000', '64 days 00:00:00.000000000',
       '38 days 00:00:00.000000000', '115 days 00:00:00.000000000',
       '165 days 00:00:00.000000000', '119 days 00:00:00.000000000',
       '146 days 00:00:00.000000000', '136 days 00:00:00.000000000',
       '110 days 00:00:00.000000000', '255 days 00:00:00.000000000',
       '86 days 00:00:00.000000000', '49 days 00:00:00.000000000',
       '184 days 00:00:00.000000000', '78 days 00:00:00.000000000',
       '96 days 00:00:00.000000000', '3287 days 00:00:00.000000000',
       '211 days 00:00:00.000000000', '178 days 00:00:00.000000000', '21',
       '11', '77', '75', '87', '67', '66', '22', '10', '2', '9', '1', '3',
       '6', '18', '12', '5', '13', '27', '8', '7', '4', '0', '82', '32',
       '80', '49', '60', '99', '91', '84', '78', '54', '71', '51', '30',
       '14', '94', '45', '42', '56', '58', '55', '33', '53', '17', '31',
       '23', '29', '50', '92', '79', '47', '20', '15', '115', '28', '46',
       '73', '72', '83', '90', '88', '89', '64', '81', '19', '37', '39',
       '43', '38', '101', '24', '112', '86', '74', '26', '110', '108',
       '35', '44', '36', '59', '68', '70', '96', '34', '48', '116', '63',
       '40', '98', '85', '16', '102', '131', '142', '128', '76', '25',
       '52', '93', '118', '107', '97', '104', '65', '129', '111', '106',
       '103', '114', '61', '62', '122', '121', '41', '144', '155', '117',
       '57', '105', '119', '113', '138', '100', '312', '125', '95', '132',
       '126', '164', '120', '163', '184', '177', '145', '134', '109',
       '127', '147', '124', '141', '139', '150', '123', '69', '137',
       '148', '136', '151', '140', '133', '146', '204', '178', '143',
       '237', '153', '371', '170'], dtype=object)

Since the information that we are interested in is the first “word” in each string, we can use the string split method to get the number of days it took to put out a fire, and put this into a new column in our data frame. We also want the variable to be a floating point number (not a string) so that we can use it in our (future) machine learning model.

df['putout_time_float'] = df['putout_time'].str.split().str[0].astype(float)
df['putout_time_float']

       NaN
       NaN
       NaN
       0.0
       NaN
         ... 
   NaN
  22.0
   NaN
  43.0
   NaN
Name: putout_time_float, Length: 55367, dtype: float64

df['putout_time_float'].describe()

count    27477.000000
mean         6.033592
std         27.803757
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max       3287.000000
Name: putout_time_float, dtype: float64

df

	fire_size	fire_size_class	latitude	longitude	putout_time	disc_pre_year	Vegetation	Temp_cont	Wind_cont	Hum_cont	Prec_cont	putout_time_float
0	10.0	C	18.105072	-66.753044	NaN	2007	12	24.527961	3.250413	78.724370	0.0	NaN
1	3.0	B	35.038330	-87.610000	NaN	2006	15	10.448298	2.122320	81.682678	86.8	NaN
2	60.0	C	34.947800	-88.722500	NaN	2004	16	13.696600	3.369050	65.063800	124.5	NaN
3	1.0	B	39.641400	-119.308300	0 days 00:00:00.000000000	2005	0	0.000000	0.000000	0.000000	0.0	0.0
4	2.0	B	30.700600	-90.591400	NaN	1999	12	-1.000000	-1.000000	-1.000000	-1.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
55362	6289.0	G	39.180000	-96.784167	NaN	2015	0	13.242324	3.804803	55.042092	249.0	NaN
55363	70868.0	G	38.342719	-120.695967	22	2015	0	-1.000000	-1.000000	-1.000000	-1.0	22.0
55364	5702.0	G	37.262607	-119.511139	NaN	2015	0	27.646067	2.529158	35.924406	0.0	NaN
55365	3261.0	F	40.604300	-123.080450	43	2015	15	-1.000000	-1.000000	-1.000000	-1.0	43.0
55366	76067.0	G	38.843988	-122.759707	NaN	2015	0	19.016883	1.208073	58.063802	18.8	NaN

55367 rows × 12 columns

We’ll drop the original column now, since we have fixed the irregular data

df = df.drop(['putout_time'],axis=1)
df.head()

	fire_size	fire_size_class	latitude	longitude	disc_pre_year	Vegetation	Temp_cont	Wind_cont	Hum_cont	Prec_cont	putout_time_float
0	10.0	C	18.105072	-66.753044	2007	12	24.527961	3.250413	78.724370	0.0	NaN
1	3.0	B	35.038330	-87.610000	2006	15	10.448298	2.122320	81.682678	86.8	NaN
2	60.0	C	34.947800	-88.722500	2004	16	13.696600	3.369050	65.063800	124.5	NaN
3	1.0	B	39.641400	-119.308300	2005	0	0.000000	0.000000	0.000000	0.0	0.0
4	2.0	B	30.700600	-90.591400	1999	12	-1.000000	-1.000000	-1.000000	-1.0	NaN

If we look at the meteorological variables (Prec_cont, Wind_cont, Temp_cont, Hum_cont), we also notice that the distributions look strange. This is because they have 0 and -1 for missing values. We can guess this because it’s not physically reasonable that these values are negative, or that the Wind_cont, Temp_cont, Hum_cont are exactly 0.0000, especially during fire season.

fig, axs = plt.subplots(1, 4, figsize=(12, 3), sharey=True)
axs[0].hist(df["Prec_cont"], bins=50)
axs[1].hist(df["Wind_cont"], bins=50)
axs[2].hist(df["Temp_cont"], bins=50)
axs[3].hist(df["Hum_cont"], bins=50)
axs[0].set_xlabel("Prec_cont")
axs[1].set_xlabel("Wind_cont")
axs[2].set_xlabel("Temp_cont")
axs[3].set_xlabel("Hum_cont")
axs[0].set_ylabel("Number of Fires")

plt.show()

../../_images/d9300e293d36f200e125ced3cefdde1a89b3d51ff3ab09b0265153932abae07c.png

df["Temp_cont"]

      24.527961
      10.448298
      13.696600
       0.000000
      -1.000000
           ...    
  13.242324
  -1.000000
  27.646067
  -1.000000
  19.016883
Name: Temp_cont, Length: 55367, dtype: float64

We can replace these suspicious values with NaN’s using the replace method.

df["Temp_cont"] = df["Temp_cont"].replace([0.0000,-1.0000], np.nan)
df["Hum_cont"] = df["Hum_cont"].replace([0.0000,-1.0000], np.nan)
df["Wind_cont"] = df["Wind_cont"].replace([0.0000,-1.0000], np.nan)
df["Prec_cont"] = df["Prec_cont"].replace([-1.0000], np.nan)

fig, axs = plt.subplots(1, 4, figsize=(12, 3), sharey=True)
axs[0].hist(df["Prec_cont"], bins=50)
axs[1].hist(df["Wind_cont"], bins=50)
axs[2].hist(df["Temp_cont"], bins=50)
axs[3].hist(df["Hum_cont"], bins=50)
axs[0].set_xlabel("Prec_cont")
axs[1].set_xlabel("Wind_cont")
axs[2].set_xlabel("Temp_cont")
axs[3].set_xlabel("Hum_cont")
axs[0].set_ylabel("Number of Fires")

plt.show()

../../_images/b7cc171c7b4dc9a74546f44b86023430e83d5f6028aa546e5a49eae5e41250ed.png

Now that we have fixed the irregular data, we need to figure out what to do with the NaN values.

One very simple way to deal with these missing data is to just drop all of the rows where there are no values for putout_time, using the dropna() method:

df_cleaned = df.dropna().copy()

A disadvantage of dropping rows it that we end up with less data, so other approaches such as imputation can be preferable if our data set is small. After removing all of the bad rows, we are down to 10% of our original data set!

print("Number of rows (before removing NaN's): " + str(len(df)))
print("Number of rows (after removing NaN's): " + str(len(df_cleaned)))

Number of rows (before removing NaN's): 55367
Number of rows (after removing NaN's): 5892

We can also remove the very high outliers for putout_time_float (>99.995%):

q_hi  = df_cleaned['putout_time_float'].quantile(0.99995)
print(q_hi)

df_filtered = df_cleaned[(df_cleaned['putout_time_float'] < q_hi)].dropna().copy()

338.5995000000421

print("Number of rows (before removing outliers): " + str(len(df_cleaned)))
print("Number of rows (after removing outliers): " + str(len(df_filtered)))

Number of rows (before removing outliers): 5892
Number of rows (after removing outliers): 5891

We can also fill the NaN’s with a value instead, using the fillna() method:

df_filled = df.fillna(0).copy()

df_filled

	fire_size	fire_size_class	latitude	longitude	disc_pre_year	Vegetation	Temp_cont	Wind_cont	Hum_cont	Prec_cont	putout_time_float
0	10.0	C	18.105072	-66.753044	2007	12	24.527961	3.250413	78.724370	0.0	0.0
1	3.0	B	35.038330	-87.610000	2006	15	10.448298	2.122320	81.682678	86.8	0.0
2	60.0	C	34.947800	-88.722500	2004	16	13.696600	3.369050	65.063800	124.5	0.0
3	1.0	B	39.641400	-119.308300	2005	0	0.000000	0.000000	0.000000	0.0	0.0
4	2.0	B	30.700600	-90.591400	1999	12	0.000000	0.000000	0.000000	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...
55362	6289.0	G	39.180000	-96.784167	2015	0	13.242324	3.804803	55.042092	249.0	0.0
55363	70868.0	G	38.342719	-120.695967	2015	0	0.000000	0.000000	0.000000	0.0	22.0
55364	5702.0	G	37.262607	-119.511139	2015	0	27.646067	2.529158	35.924406	0.0	0.0
55365	3261.0	F	40.604300	-123.080450	2015	15	0.000000	0.000000	0.000000	0.0	43.0
55366	76067.0	G	38.843988	-122.759707	2015	0	19.016883	1.208073	58.063802	18.8	0.0

55367 rows × 11 columns

Rather than filling with a specific value, it’s also possible to fill with the median or mean of the column.

df_fill_mean= df.copy()

meantime = df['putout_time_float'].mean()
df_fill_mean['putout_time_float'].fillna(meantime,inplace=True)
df_fill_mean

	fire_size	fire_size_class	latitude	longitude	disc_pre_year	Vegetation	Temp_cont	Wind_cont	Hum_cont	Prec_cont	putout_time_float
0	10.0	C	18.105072	-66.753044	2007	12	24.527961	3.250413	78.724370	0.0	6.033592
1	3.0	B	35.038330	-87.610000	2006	15	10.448298	2.122320	81.682678	86.8	6.033592
2	60.0	C	34.947800	-88.722500	2004	16	13.696600	3.369050	65.063800	124.5	6.033592
3	1.0	B	39.641400	-119.308300	2005	0	NaN	NaN	NaN	0.0	0.000000
4	2.0	B	30.700600	-90.591400	1999	12	NaN	NaN	NaN	NaN	6.033592
...	...	...	...	...	...	...	...	...	...	...	...
55362	6289.0	G	39.180000	-96.784167	2015	0	13.242324	3.804803	55.042092	249.0	6.033592
55363	70868.0	G	38.342719	-120.695967	2015	0	NaN	NaN	NaN	NaN	22.000000
55364	5702.0	G	37.262607	-119.511139	2015	0	27.646067	2.529158	35.924406	0.0	6.033592
55365	3261.0	F	40.604300	-123.080450	2015	15	NaN	NaN	NaN	NaN	43.000000
55366	76067.0	G	38.843988	-122.759707	2015	0	19.016883	1.208073	58.063802	18.8	6.033592

55367 rows × 11 columns

Finally, we can interpolate, using the interpolate() method. Interpolate refers to filling in using the values based on estimating from surrounding non-NaN data in a column.

df_interp = df.interpolate().copy()

Interpolation doesn’t make a lot of sense for this data set since the fires are independent of one another. Interpolation makes more sense in the context of a time series, where we are missing specific values in our sequence. To illustrate this, we will look at the effect of using the Interpolate() method on a sine wave that has 20% of its data missing.

This code just generates the time series and puts it into a dataframe

np.random.seed(42)                      
n = 100
t = np.linspace(0, 2 * np.pi, n)         # 0 → 2π
y = np.sin(t)
y_noisy = y+ np.random.normal(0, 0.1, n)

# knock out ~10 % of points
mask = np.random.choice(n, size=int(0.20 * n), replace=False)
y_noisy[mask] = np.nan

dfsine = pd.DataFrame({"t": t, "sine": y,"noisy_sine": y_noisy})

We can apply the interpolate function to the sine wave to estimate the missing values.

dfsine_interp = dfsine.interpolate().copy()

plt.scatter(dfsine['t'],dfsine['noisy_sine'],marker='o',facecolors='r',edgecolors='r',label="original")
plt.scatter(dfsine_interp['t'],dfsine_interp['noisy_sine'],marker='o',facecolors='none',edgecolors='k',label="interpolated")
plt.xlabel("t")
plt.ylabel("y")
plt.legend()

<matplotlib.legend.Legend at 0x157482640>

../../_images/1709543047d83000a88a4f2da7dbb9a833cd4b617e979d3cb7de0d144dcc295c.png

Using interpolation or filling missing values can potentially bias our data sets, so it’s important to keep in mind what problem we are interested in solving when choosing a method to deal with missing data.

Feature Scaling#

The sci-kit learn package has a number of built in functions for scaling variables.

MinMaxScaler scales value between a minimum and maximum. We can set what range we want to end up with.

For a sample x, the MinMaxScaler is calculated as

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
latitude_min_max_scaled = min_max_scaler.fit_transform(df[["latitude"]])

The StandardScaler removes the mean and scales to unit variance. For a sample x, the standard scaler is calculated as

z = (x-u)/s

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
latitude_std_scaled = std_scaler.fit_transform(df[["latitude"]])

fig, axs = plt.subplots(1, 3, figsize=(12, 3), sharey=True)
axs[0].hist(df["latitude"], bins=50)
axs[1].hist(latitude_min_max_scaled, bins=50)
axs[2].hist(latitude_std_scaled, bins=50)
axs[0].set_xlabel("Latitude")
axs[1].set_xlabel("Latitude (Min_Max_Scaled)")
axs[2].set_xlabel("Latitude (Std_Scaled)")
axs[0].set_ylabel("Number of Fires")

plt.show()

../../_images/827a7845cade8d8cb8d24af311b2d7bf719ddd305c5188f996c50d4ad987111f.png

We can also do custom transformations using sci-kit learn’s FunctionTransformer method. For example, fire_size is not normally distributed, but rather strongly skewed towards smaller fires. We can log scale this variable using the FunctionTransformer

from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_fire_size = log_transformer.transform(df_filtered[["fire_size"]])

fig, axs = plt.subplots(1, 2, figsize=(8, 3), sharey=True)
axs[0].hist(df_filtered["fire_size"], bins=50)
axs[1].hist(log_fire_size, bins=50)
axs[0].set_xlabel("Fire Size")
axs[1].set_xlabel("Log(Fire Size)")
axs[0].set_ylabel("Number of Fires")

plt.show()

../../_images/864cf4abfdc3b789efb575f22f1352f5994e04a386fbb18ad104ffac3ef98114.png

df_filtered["log_fire_size"] = log_fire_size

Encoding categorical variables#

The fire_size_class is a categorical variable (it is a letter between A - G).

fire_size_class = df[["fire_size_class"]]
fire_size_class.head(8)

	fire_size_class
0	C
1	B
2	C
3	B
4	B
5	B
6	B
7	B

We can also check if our data set is balanced or not, by printing out the number of samples of each class in our data set.

fire_size_class_counts = df["fire_size_class"].value_counts()
print(fire_size_class_counts)

fire_size_class
B    36522
C    10811
G     3972
F     1968
D     1394
E      700
Name: count, dtype: int64

There are several different ways that we can encode this categorical variable for machine learning models. One is using the OrdinalEncoder function from sci-kit learn. This function will transform categories (B,C,D,E,…) to an integer label (0,1,2,3,…).

We will end up with a single column per original feature (or variable) in our data set.

OrdinalEncoder can be useful if our categories have a particular ordering to them (like fire size class does). However, it might not be a good choice for a feature like the state in which a fire occured, since we don’t have a particular ordering to the values in our data set in that case.

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
fire_size_class_encoded = ordinal_encoder.fit_transform(fire_size_class)

fire_size_class

	fire_size_class
0	C
1	B
2	C
3	B
4	B
...	...
55362	G
55363	G
55364	G
55365	F
55366	G

55367 rows × 1 columns

fire_size_class_encoded[:8]

array([[1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]])

ordinal_encoder.categories_

[array(['B', 'C', 'D', 'E', 'F', 'G'], dtype=object)]

An alternative way that we can encode a categorical variable is using the OneHotEncoder function from sci-kit learn. This function converts each category into a separate binary column. For example, a sample that is class B, would be one hot encoded as

[1, 0, 0, 0, 0, 0]

or for a sample that is a fire of class E, it would be encoded as

[0, 0, 0, 1, 0, 0]

We will end up with as many columns as we have classes in our data set. For the fire_size_class variable, this would add 6 columns, since there are 6 possible classes represented in our data set.

This is best to use when our categorical variables are nominal (unordered), and is best for many of the types of models that assume inputs are numerical and unstructured (linear models, neural networks), which we will discuss later in the course.

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
fire_size_class_hot = cat_encoder.fit_transform(fire_size_class)

fire_size_class_hot

<55367x6 sparse matrix of type '<class 'numpy.float64'>'
	with 55367 stored elements in Compressed Sparse Row format>

fire_size_class_hot.toarray()

array([[0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1.]])

cat_encoder.categories_

[array(['B', 'C', 'D', 'E', 'F', 'G'], dtype=object)]

Creating a pipeline to preprocess our features#

df_filtered

	fire_size	fire_size_class	latitude	longitude	disc_pre_year	Vegetation	Temp_cont	Wind_cont	Hum_cont	Prec_cont	putout_time_float	log_fire_size
36	1420.0	F	33.241800	-104.912200	1994	16	36.241176	4.217647	17.058824	0.0	1.0	7.258412
56	15.3	C	36.586500	-96.323220	2015	0	5.258333	3.558333	71.500000	0.0	1.0	2.727853
57	270.0	D	35.473333	-84.450000	1999	15	16.431034	0.982759	67.050000	0.0	1.0	5.598422
59	57.0	C	36.816667	-82.651667	1995	0	13.002083	1.950000	37.854167	0.0	2.0	4.043051
78	450.0	E	34.330000	-117.513056	2006	16	33.900000	2.237500	39.375000	0.0	1.0	6.109248
...	...	...	...	...	...	...	...	...	...	...	...	...
55347	3200.0	F	46.679234	-93.270841	2015	15	10.666667	4.813889	51.583333	0.0	1.0	8.070906
55349	17831.0	G	30.005000	-100.391167	2015	12	28.250580	2.360093	58.243619	0.0	6.0	9.788694
55350	3452.0	F	30.400883	-101.331933	2015	12	28.793750	3.082979	39.510638	0.0	2.0	8.146709
55351	8300.0	G	45.922100	-111.422500	2015	15	11.032231	3.367769	48.628099	0.0	5.0	9.024011
55352	4704.0	F	48.937900	-119.309100	2015	15	21.204970	4.537914	36.257773	4.4	140.0	8.456168

5891 rows × 12 columns

Let’s say that we are interested in predicting the time it takes to put out a fire, given the other variables in our data set. First, we will scale the putout_time_float using the standard scaler and put it into an array called y, since its our target variable.

std_scaler = StandardScaler()
y = std_scaler.fit_transform(df_filtered[["putout_time_float"]])

Next we will create a feature matrix X, which will be the input to our model. First let’s remove our target variable from the dataframe.

features = df_filtered.copy()
features = df_filtered.drop(["putout_time_float"],axis=1)

features

	fire_size	fire_size_class	latitude	longitude	disc_pre_year	Vegetation	Temp_cont	Wind_cont	Hum_cont	Prec_cont	log_fire_size
36	1420.0	F	33.241800	-104.912200	1994	16	36.241176	4.217647	17.058824	0.0	7.258412
56	15.3	C	36.586500	-96.323220	2015	0	5.258333	3.558333	71.500000	0.0	2.727853
57	270.0	D	35.473333	-84.450000	1999	15	16.431034	0.982759	67.050000	0.0	5.598422
59	57.0	C	36.816667	-82.651667	1995	0	13.002083	1.950000	37.854167	0.0	4.043051
78	450.0	E	34.330000	-117.513056	2006	16	33.900000	2.237500	39.375000	0.0	6.109248
...	...	...	...	...	...	...	...	...	...	...	...
55347	3200.0	F	46.679234	-93.270841	2015	15	10.666667	4.813889	51.583333	0.0	8.070906
55349	17831.0	G	30.005000	-100.391167	2015	12	28.250580	2.360093	58.243619	0.0	9.788694
55350	3452.0	F	30.400883	-101.331933	2015	12	28.793750	3.082979	39.510638	0.0	8.146709
55351	8300.0	G	45.922100	-111.422500	2015	15	11.032231	3.367769	48.628099	0.0	9.024011
55352	4704.0	F	48.937900	-119.309100	2015	15	21.204970	4.537914	36.257773	4.4	8.456168

5891 rows × 11 columns

Now we can apply a pipeline of transformations to our data set. sci-kit learn allows up to create separate pipelines for categorical and numerical variables and then run them on our entire dataframe simultaneously.

from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

We need to separate our variables based on how we want to transform them though.

categorical_cols = ["fire_size_class"]
numerical_cols = ["fire_size","latitude","longitude","disc_pre_year","Vegetation","Temp_cont","Wind_cont","Hum_cont","Prec_cont","log_fire_size"]

cat_pipeline = make_pipeline(OrdinalEncoder(),StandardScaler())
num_pipeline = make_pipeline(MinMaxScaler())

preprocessor = ColumnTransformer([
    ("num",num_pipeline,numerical_cols),
    ("cat",cat_pipeline,categorical_cols)])

X = preprocessor.fit_transform(features)

X.shape

(5891, 11)

preprocessor.get_feature_names_out()

array(['num__fire_size', 'num__latitude', 'num__longitude',
       'num__disc_pre_year', 'num__Vegetation', 'num__Temp_cont',
       'num__Wind_cont', 'num__Hum_cont', 'num__Prec_cont',
       'num__log_fire_size', 'cat__fire_size_class'], dtype=object)

g = sns.PairGrid(pd.DataFrame(X,columns = preprocessor.get_feature_names_out()))
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)

<seaborn.axisgrid.PairGrid at 0x161139880>

../../_images/70aab853148f256898f9401f6d29cff61b8c79d0d4f805a04a2da58cafc0e264.png

Creating a training, validation, and test data set#

Sci-kit learn has several built in functions to facilitate create separate training and test data sets. The simplest one to use is train_test_split, which returns the features and targets split into train and test data sets. If we want to also create a validation data set using this function, we will need to split the data set twice.

from sklearn.model_selection import train_test_split

X_train, X_test_val, y_train, y_test_val = train_test_split(X,y,test_size=0.5, random_state=42)

X_val, X_test, y_val, y_test = train_test_split(X_test_val,y_test_val,test_size=0.5, random_state=42)

print(X_train.shape,X_val.shape,X_test.shape)

(2945, 11) (1473, 11) (1473, 11)