Assignment 2: Predicting Health Impacts from Air Quality Factors#
In this assignment, we will use a (synthetic) data set looking at how health impacts are related to air quality factors. This data set is available at https://www.kaggle.com/datasets/rabieelkharoua/air-quality-and-health-impact-dataset. The data includes public health outcomes and how they are related to air quality and meteorological factors.
import pandas as pd
import matplotlib.pyplot as plt
import os
Download the data from Kaggle#
# To facilitate downloading data from Kaggle, we can install this python package
!pip install kagglehub
Requirement already satisfied: kagglehub in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (0.2.9)
Requirement already satisfied: packaging in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (24.1)
Requirement already satisfied: requests in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (2.32.3)
Requirement already satisfied: tqdm in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (4.66.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (2024.8.30)
import kagglehub
# Download latest version
path = kagglehub.dataset_download("rabieelkharoua/air-quality-and-health-impact-dataset")
print("Path to dataset files:", path)
Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.12)
Path to dataset files: /Users/karalamb/.cache/kagglehub/datasets/rabieelkharoua/air-quality-and-health-impact-dataset/versions/1
os.listdir(path)
['air_quality_health_impact_data.csv']
Part 1: Data exploration#
Load in the csv file as a dataframe using
pandas
.
Check whether there are any NaN’s in the dataframe
Make a histogram of the different numerical variables in the dataframe
There are two possible targets in the dataframe. One is a categorical variable
HealthImpactClass
, and the other is a numerical variableHealthImpactScore
. Create numpy arrays, one namedy_classification
, containing theHealthImpactClass
, and one namedy_regression
, containing theHealthImpactScore
.
Check how balanced the 5 classes are in
HealthImpactClass
.
Create a numpy array called
features
that includes the following 9 variables:AQI
PM10
PM2_5
NO2
SO2
O3
Temperature
Humidity
WindSpeed
Part 2: Preprocessing#
Create two python lists, one including the class names, and the other including the feature names. The classification of health impact is derived from the health impact score, using the following thresholds:
0: ‘Very High’ (HealthImpactScore >= 80)
1: ‘High’ (60 <= HealthImpactScore < 80)
2: ‘Moderate’ (40 <= HealthImpactScore < 60)
3: ‘Low’ (20 <= HealthImpactScore < 40)
4: ‘Very Low’ (HealthImpactScore < 20)
Use the
StandardScaler
method to scale the numerical variables in thefeature
array, and save this as a numpy arrayX
Part 3: Training, validation, and test split#
Split the data into training, validation, and test data sets, with 80% of the data used for training, and 10% each for validation and testing. Create separate regression and classification targets, using
y_classification
andy_regression
.
Part 4: Train a Random Forest Classifier#
Train a
RandomForestClassifier
with 120 estimators and with a maximum depth of 10. Set the class_weight to “balanced”, since the classes are imbalanced. You can use the default values for other hyperparmameters.
Create a heatmap plot for the confusion matrix showing the performance of the trained classifier on the validation data set. Label the heatmap with the
classnames
on the x and y axis.
Because the classes are imbalanced, the confusion matrix shows that the classifier does not perform that well on the classes that are not well-represented in the data. One way to improve this is to use over-sampling to augment the data set. Using the imbalanced-learn
library, we can use the SMOTE
algorithm (https://arxiv.org/pdf/1106.1813) to oversample the data set, using the lines of code below.
!pip install imbalanced-learn
Requirement already satisfied: imbalanced-learn in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (0.12.4)
Requirement already satisfied: numpy>=1.17.3 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (1.24.3)
Requirement already satisfied: scipy>=1.5.0 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (1.10.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (1.3.2)
Requirement already satisfied: joblib>=1.1.1 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (3.5.0)
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_classification_train)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 1
----> 1 X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_classification_train)
NameError: name 'X_train' is not defined
Train a new random forest classifier using
X_resampled
andy_resampled
. Use the same hyperparameters as your original random forest.
Create a new heatmap plot for the confusion matrix showing the performance of the classifier that was trained on the oversampled data set, evaluated on the validation data set. Label the heatmap with the
classnames
list on the x and y axis.
Part 5: Train a Random Forest Regressor#
Train a
RandomForestRegressor
on the training data set, using the regression target. Set the number of estimators to 120 and the maximum tree depth to 10. You can use the other default hyperparameters.
Using your trained
RandomForestRegressor
, predict the target values for the validation data set and calculate the coefficient of determination between the true targets and the values predicted by theRandomForestRegressor
.
Make a barplot of the feature importance in your trained random forest. Label the x-axis with the feature names.
What are the 4 most important features in terms of determining the health impact score? Print them out.