Application of Machine Learning Algorithms for Estimating Daily PM2.5 Concentrations Restricted; Files Only

Huo, Runing (Spring 2023)

Permanent URL:


Background: The detrimental impact of PM2.5 air pollution is widespread, as it has been linked to premature mortality and a diverse range of health concerns such as cardiovascular and respiratory illnesses. Machine learning approaches offer several advantages for predicting PM2.5 levels at locations without monitoring data. These include the ability to handle complex and large datasets, detect nonlinear associations, and provide accurate and adaptable solutions.

Objectives: Compare the prediction ability of four machine learning algorithms with three types of cross-validation experiments using data from 2018 in California.

Methods: Four machine learning algorithms were applied in this analysis: random forest, Bayesian additive regression trees (BART), gradient boosting and soft Bayesian additive regression trees (SoftBART). We performed 3 types of 10-fold cross-validations (ordinary, spatial, and temporal) using, R-squared, mean absolute error (MAE), and root-mean square error (RMSE). We also obtained average predictions of PM2.5 concentrations at 1km spatial resolution for January, April, July, Octobe in 2018.

Results: In the cross-validation analysis, we found the random forest performed the best with highest R-squared and smallest RMSE and MAE values. Random forest model also the least computationally intensive approach. Gradients boosting and BART model with larger number of trees are the second-best model. When using small number of trees, SoftBART model behaved similarly with the BART model.

Conclusions: In this study, we demonstrated the superior predictive performance of random forest, which is a commonly used method for predicting daily PM2.5 concentrations. 

Table of Contents

1. Introduction. 1

2. Materials and Methods. 3

2.1 Motivating Datasets. 3

2.2 Statistical Analysis. 4

2.2.1 Machine Learning Models. 4

2.2.3 Prediction Performance Comparison. 7

3. Results. 10

3.1 Results of Cross-Validation Experiments. 10

3.1.1 Traditional 10-fold Cross-Validation. 11

3.1.2 Spatial 10-fold Cross-Validation. 12

3.1.3 Temporal 10-fold Cross-Validation. 14

3.2 Results of Predictions. 16

4. Discussion. 18

References: 19

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified Preview image embargoed

Primary PDF

Supplemental Files