Assignment 2: Predicting Health Impacts from Air Quality Factors#

In this assignment, we will use a (synthetic) data set looking at how health impacts are related to air quality factors. This data set is available at https://www.kaggle.com/datasets/rabieelkharoua/air-quality-and-health-impact-dataset. The data includes public health outcomes and how they are related to air quality and meteorological factors.

import pandas as pd
import matplotlib.pyplot as plt
import os

Download the data from Kaggle#

# To facilitate downloading data from Kaggle, we can install this python package
!pip install kagglehub
Requirement already satisfied: kagglehub in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (0.2.9)
Requirement already satisfied: packaging in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (24.1)
Requirement already satisfied: requests in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (2.32.3)
Requirement already satisfied: tqdm in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from kagglehub) (4.66.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from requests->kagglehub) (2024.8.30)
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rabieelkharoua/air-quality-and-health-impact-dataset")

print("Path to dataset files:", path)
Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.12)
Path to dataset files: /Users/karalamb/.cache/kagglehub/datasets/rabieelkharoua/air-quality-and-health-impact-dataset/versions/1
os.listdir(path)
['air_quality_health_impact_data.csv']

Part 1: Data exploration#

  1. Load in the csv file as a dataframe using pandas.

  1. Check whether there are any NaN’s in the dataframe

  1. Make a histogram of the different numerical variables in the dataframe

  1. There are two possible targets in the dataframe. One is a categorical variable HealthImpactClass, and the other is a numerical variable HealthImpactScore. Create numpy arrays, one named y_classification, containing the HealthImpactClass, and one named y_regression, containing the HealthImpactScore.

  1. Check how balanced the 5 classes are in HealthImpactClass.

  1. Create a numpy array called features that includes the following 9 variables:

    • AQI

    • PM10

    • PM2_5

    • NO2

    • SO2

    • O3

    • Temperature

    • Humidity

    • WindSpeed

Part 2: Preprocessing#

  1. Create two python lists, one including the class names, and the other including the feature names. The classification of health impact is derived from the health impact score, using the following thresholds:

  • 0: ‘Very High’ (HealthImpactScore >= 80)

  • 1: ‘High’ (60 <= HealthImpactScore < 80)

  • 2: ‘Moderate’ (40 <= HealthImpactScore < 60)

  • 3: ‘Low’ (20 <= HealthImpactScore < 40)

  • 4: ‘Very Low’ (HealthImpactScore < 20)

  1. Use the StandardScaler method to scale the numerical variables in the feature array, and save this as a numpy array X

Part 3: Training, validation, and test split#

  1. Split the data into training, validation, and test data sets, with 80% of the data used for training, and 10% each for validation and testing. Create separate regression and classification targets, using y_classification and y_regression.

Part 4: Train a Random Forest Classifier#

  1. Train a RandomForestClassifier with 120 estimators and with a maximum depth of 10. Set the class_weight to “balanced”, since the classes are imbalanced. You can use the default values for other hyperparmameters.

  1. Create a heatmap plot for the confusion matrix showing the performance of the trained classifier on the validation data set. Label the heatmap with the classnames on the x and y axis.

Because the classes are imbalanced, the confusion matrix shows that the classifier does not perform that well on the classes that are not well-represented in the data. One way to improve this is to use over-sampling to augment the data set. Using the imbalanced-learn library, we can use the SMOTE algorithm (https://arxiv.org/pdf/1106.1813) to oversample the data set, using the lines of code below.

!pip install imbalanced-learn
Requirement already satisfied: imbalanced-learn in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (0.12.4)
Requirement already satisfied: numpy>=1.17.3 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (1.24.3)
Requirement already satisfied: scipy>=1.5.0 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (1.10.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (1.3.2)
Requirement already satisfied: joblib>=1.1.1 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/envs/ML4Climate2025/lib/python3.8/site-packages (from imbalanced-learn) (3.5.0)
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_classification_train)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_classification_train)

NameError: name 'X_train' is not defined
  1. Train a new random forest classifier using X_resampled and y_resampled. Use the same hyperparameters as your original random forest.

  1. Create a new heatmap plot for the confusion matrix showing the performance of the classifier that was trained on the oversampled data set, evaluated on the validation data set. Label the heatmap with the classnames list on the x and y axis.

Part 5: Train a Random Forest Regressor#

  1. Train a RandomForestRegressor on the training data set, using the regression target. Set the number of estimators to 120 and the maximum tree depth to 10. You can use the other default hyperparameters.

  1. Using your trained RandomForestRegressor, predict the target values for the validation data set and calculate the coefficient of determination between the true targets and the values predicted by the RandomForestRegressor.

  1. Make a barplot of the feature importance in your trained random forest. Label the x-axis with the feature names.

  1. What are the 4 most important features in terms of determining the health impact score? Print them out.