Assess Improvement of Balancing Covariates by Propensity Score approach using Generalized Boosted Model (GBM) and Application Based on National Cancer Database Open Access

Song, Haocan (Spring 2018)

Permanent URL:


Background: Observational study is one of the most commonly used study designs in many

medical research, but they have a major limitation of getting vulnerable to selection bias to

make valid causal inference. Propensity score (PS) matching and weighting are popular

methods that can be applied to reduce the bias and estimating causal effects in observational

studies. In this work, we focused on General Boosted Method (GBM), a tree-based approach

to obtain more accurate estimated PS score without specifying the form of prediction

function, and we further compared its performance in terms of covariate balancing with the

conventional model-based approach, such as logistic regression.

Method and Study Design: In this study, we tested 3 alternative methods for propensity

score (PS) estimation: main-effect logistic regression model (model 1: LOGREG),

comprehensive logistic regression model with all two-way interactions and polynomial terms

(model 2: LOGREG(INT)), and GBM (model 3). Implemented these algorithms for an

application based on prostate cancer from NCDB dataset, where we aimed to conduct an

effect comparison of overall survival between proton radiation therapy and conventional xray

based radiation therapy. Matching was performed to eliminate confounding effect via

PSM with caliper and different matching ratio up to 1:5. Balance was evaluated before and

after matching by standardized difference. The proportional hazard model was carried out to

estimate the hazard ratio of proton therapy with 95% confidence interval in the matched


Conclusion: The study reveals that covariate balancing can be improved by a more accurate

PS estimation model through GBM or comprehensive logistic regression, and both

approaches should be encouraged in the practice. In case study, we also found that proton

radiation therapy hold an improved clinical benefit for prostate cancer patients for long-term



Table of Contents



1.1 Observational Study

1.2 Propensity Score

1.3 Variable Selection for the Propensity Score Model

1.4 Propensity Score Calculation

  1.4.1 Main-effect Logistic Regression Model (LOGREG)

  1.4.2 Comprehensive Logistic Regression Model with all Two-way Interactions and Polynomial Terms (LOGREG(INT))

  1.4.3 Generalized Boosted Models (GBM)

1.5 Propensity Score Matching

  1.5.1 Greedy Matching

  1.5.2 1-1 to 1-N Caliper Matching

1.6 Treatment Effect

  1.6.1 Average Treatment Effect (ATE):

  1.6.2 Average Treatment Effect Among the Treated (ATT):

1.7 Checking balance on the covariates before and after matching



2.1 Study Objective

2.2 NCDB database

2.3 Define study population

2.4 Select the covariates

2.5 Statistical methods



3.1 Patients characteristics

3.2 Estimating propensity scores

3.3 PS Matching

3.4 Checking balance on the covariates before and after matching

3.4.1 Greedy Matching

3.4.2 1-1 to 1-N Caliper Matching








About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research field
Committee Chair / Thesis Advisor
Committee Members
Partnering Agencies
Last modified

Primary PDF

Supplemental Files