Application of Machine Learning Algorithms for Estimating Daily PM2.5 Concentrations Open Access

Huo, Runing (Spring 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/mw22v676s?locale=en

Published

Abstract

Background: The detrimental impact of PM2.5 air pollution is widespread, as it has been linked to premature mortality and a diverse range of health concerns such as cardiovascular and respiratory illnesses. Machine learning approaches offer several advantages for predicting PM2.5 levels at locations without monitoring data. These include the ability to handle complex and large datasets, detect nonlinear associations, and provide accurate and adaptable solutions.

Objectives: Compare the prediction ability of four machine learning algorithms with three types of cross-validation experiments using data from 2018 in California.

Methods: Four machine learning algorithms were applied in this analysis: random forest, Bayesian additive regression trees (BART), gradient boosting and soft Bayesian additive regression trees (SoftBART). We performed 3 types of 10-fold cross-validations (ordinary, spatial, and temporal) using, R-squared, mean absolute error (MAE), and root-mean square error (RMSE). We also obtained average predictions of PM2.5 concentrations at 1km spatial resolution for January, April, July, Octobe in 2018.

Results: In the cross-validation analysis, we found the random forest performed the best with highest R-squared and smallest RMSE and MAE values. Random forest model also the least computationally intensive approach. Gradients boosting and BART model with larger number of trees are the second-best model. When using small number of trees, SoftBART model behaved similarly with the BART model.

Conclusions: In this study, we demonstrated the superior predictive performance of random forest, which is a commonly used method for predicting daily PM2.5 concentrations.

1. Introduction. 1

2. Materials and Methods. 3

2.1 Motivating Datasets. 3

2.2 Statistical Analysis. 4

2.2.1 Machine Learning Models. 4

2.2.3 Prediction Performance Comparison. 7

3. Results. 10

3.1 Results of Cross-Validation Experiments. 10

3.1.1 Traditional 10-fold Cross-Validation. 11

3.1.2 Spatial 10-fold Cross-Validation. 12

3.1.3 Temporal 10-fold Cross-Validation. 14

3.2 Results of Predictions. 16

4. Discussion. 18

References: 19

About this Master's Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Rollins School of Public Health
Department	Biostatistics
Subfield / Discipline	Biostatistics - MPH & MSPH
Degree	M.S.P.H.
Submission	Master's Thesis
Language	English
Research Field	Environmental Health Statistics
Keyword	Machine Learning
Committee Chair / Thesis Advisor	Chang, Howard, Emory University
Committee Members	Liu, Yang, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Application of Machine Learning Algorithms for Estimating Daily PM2.5 Concentrations ()	2023-04-13 23:43:09 -0400	Download

Application of Machine Learning Algorithms for Estimating Daily PM2.5 Concentrations Open Access

Huo, Runing (Spring 2023)

Abstract

Table of Contents

About this Master's Thesis

Primary PDF

Supplemental Files