An Application of Bayesian Additive Regression Trees (BART) to Estimate Daily Concentrations of PM2.5 Components in California Open Access

Zhang, Tianyu (Spring 2020)

Permanent URL:


Background: Fine particulate matter (PM2.5), defined as particles that have an aerodynamic diameter of less than 2.5 micrometers, represents a complex mixture of solids and liquids that are small enough to pass through the upper respiratory system and penetrate deep into the lungs. Various studies have found associations between adverse health outcomes and specific PM2.5 species, such as sulfate, nitrate and carbon-containing species. Hence, it’s important to accurately measure the concentration of PM2.5 and its component to support additional epidemiological studies and perform health impact analyses.

Methods: In this work, we examine the use of Bayesian Additive Regression Tree (BART) for predicting concentrations of 4 major components of PM2.5: elemental carbon (EC), organic carbon (OC), nitrate (NO3), and sulfate (SO4).  BART employs a sum-of-trees model and the prediction is based on the average of a set of trees where each decision tree contributions a small proportion of the prediction. Meteorological variables, population size, land use variables, numerical model simulations (CMAQ), and satellite-derived fractional aerosol optical depth (AOD) in California during the period 2005 to 2014 were used as predictors for PM2.5 species concentrations. We evaluated the importance of PM2.5, numerical model simulations and AODs by leaving or keeping them in the model.

Results: After tuning parameters in the model to achieve a prediction coverage probability of about 95%, our model consistently results in a R2 between 0.64 and 0.83 in 5-fold ordinary and spatial leave-on-monitor-out cross-validation (CV) experiments for four species of interest when PM2.5 itself is a predictor. When PM2.5 is not a predictor, the models achieved a smaller R2 from 0.52 to 0.72. In spatial CV experiments, including AOD parameters or CMAQ simulations can improve R2, especially when total PM2.5 mass is not included as a predictor. The relative importance of different AOD parameters varies across PM2.5 components. AOD3 and AOD2 are most important for NO3 and OC respectively. For SO4, many AOD parameters show moderate importance. For EC, none of the AOD parameter shows high importance.

Conclusions: Collocated PM2.5, fractional AOD and CMAQ simulations are important predictors for daily concentrations of PM2.5 component EC, OC, NO3 and SO4. The ensemble learning method BART provides good prediction accuracy, as well as uncertainty measures that can be utilized in subsequent analyses. . 

Table of Contents


2. Methods

2.1 Data

2.2 Modeling

3. Results

4. Discussion

5. Bibliography

Appendix A. Data and Data Processing

Appendix B. Supplementary Tables and Figures

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files