Robust Statistical Methods for Handling Missing Data Open Access

Watson, Domonique (2016)

Permanent URL:


Since the dawn of data collection, researchers have faced the problem of missing data. There are a multitude of reasons data may be missing. Appropriately dealing with missing data requires a careful examination of the data to identify the source, pattern, and missing data mechanism. It is well known that a naive analysis without adequate handling of missing data reduces statistical power, results in loss of efficiency, and potentially biases parameter estimates which can ultimately lead to invalid conclusions. Multiple imputation (MI) is one of the most widely used methods for handling missing data. The key idea of MI is to replace each missing value with a set of plausible values drawn from their predictive distributions conditional on the observed data. Multiple imputed data sets are generated to account for uncertainty of imputing missing values. We review the terminology and current literature on missing data in Chapter 1.

In Chapter 2, we aim to develop MI methods to handle missing data in the presence of high-dimensional data where the missing data mechanism is assumed to be ignorable. Existing (MI) methods implemented in most statistical software are not applicable or do not perform well in the high-dimensional setting where the number of predictor is large relative to the sample size. To remedy this issue, we develop an MI approach that uses dimension reduction techniques. Specifically, our approach uses sure independent screening (SIS) followed by either sparse principal component analysis (sPCA) or sufficient dimension reduction regression in constructing imputation models in the presence of high-dimensional data. Our extensive simulation studies demonstrate that in the presence of high-dimensional data using SIS followed by sPCA to perform MI achieves better performance than the other imputation methods including several existing imputation approaches. We further illustrate our approach using gene expression data from a prostate cancer study.

In Chapter 3, we develop nonparametric imputation methods to handle non-ignorable missing data. Most imputation techniques are designed for ignorable missing data since non-ignorability is an assumption more challenging to handle. Under non-ignorable missingness, one assumes the nonresponse mechanism depends on unobserved values, and the outcome model for the variable with missing values and the nonresponse model must be modeled jointly. Consequently, joint modeling can produce results that are sensitive to the misspecification of the outcome and nonresponse models. We propose a nonparametric method for handling non-ignorable missingness via bootstrap imputation and multiple imputation. The key idea underlying our proposed approach is to formulate two working models for the outcome and for nonresponse, respectively. Using the two working models, we derive predictive scores which achieves dimension reduction and use the resulting scores coupled with a nearest neighbor hot deck to multiply impute missing values. Our approach allows users to incorporate prior knowledge on the working models through the use of weights. Compared with the existing MI methods, our approach is more robust to misspecification of the two models and allows for a natural sensitivity analysis. The proposed bootstrap imputation approach is shown to outperform several existing multiple imputation methods for non-ignorable missing data in simulations. In addition, the method is illustrated using data from the Georgia Coverdell Acute Stroke Registry.

In Chapter 4, we aim to evaluate diagnostic methods for imputation models assuming a non-ignorable missing data mechanism. Most of the existing diagnostic approaches have been developed assuming the missing data mechanism is ignorable. As a consequence, they are not directly applicable to our nonparametric imputation methods for nonignorable imputation methods. To address this issue, we adapt posterior predictive checking with the posterior predictive p-value as the summary measure to assess the performance of imputation models under non-ignorable missingness. In simulations, we correctly and incorrectly specify the imputation models and determine whether posterior predictive checking is useful in detecting discrepancies in the misspecified imputation model. Our extensive simulations suggest that, in the settings we evaluated, posterior predictive p-values can be useful in diagnosing deficiencies in non-ignorable imputation models. We also illustrate this approach using the Georgia Coverdell Acute Stroke Registry.In Chapter 5, we present potential future work to extend our methodologies to handle additional problems that arise from missing data.

Table of Contents

1 Introduction. 1

1.1 Missing Data Problem. 2

1.2 Missing Dat a Nomenclature. 2

1.3 Methods to Handle Missing Data. 5

1.3.1 Missing Completely at Random Methods. 5

1.3.2 Missing at Random Methods. 6 Techniques for low-dimensional data with missing values assuming MAR. 6 Techniques for high-dimensional data with missing values assuming MAR. 8

1.3.3 Missing Not at Random Methods. 10 Parametric Imputation Techniques for MNAR data . 11 Nonparametric Imputation Methods for MNAR data. 18

1.4 Diagnostic Methods for Missing Data Models. 20

1.5 Motivating examples. 22

1.6 Outline. 22

2 Multiple Imputation using Dimension Reduction Techniques for High- Dimensional Data. 24

2.1 Introduction. 26

2.1.1 Dimension Reduction Techniques for High-Dimensional Data. 28

2.2 Methodology. 30

2.3 Simulation studies. 33

2.3.1 Simulation setup. 34

2.3.2 Results. 36

2.4 Data example. 39

2.5 Discussion. 40

3 Nonparametric Imputation for Nonignorable Missing Data. 55

3.1 Introduction. 57

3.2 Methodology. 61

3.2.1 Nonparametric Bootstrap Imputation. 62

3.2.2 Nonparametric Multiple Imputation. 66

3.3 Simulation studies. 67

3.3.1 Setup. 67

3.3.2 Results. 69

3.4 Motivating Example. 71

3.5 Discussion. 74

4 Evaluating Posterior Predictive Checking For Imputation Models Under the Missing Not at Random Assumption. 80

4.1 Introduction. 82

4.2 Missing Not at Random Imputation Methods. 85

4.2.1 Random Indicator Method. 85

4.2.2 Proxy Pattern Mixture Hotdeck. 86

4.3 Diagnostic Methods. 87

4.3.1 Posterior Predictive Checking. 87

4.3.2 Discrepancy functions targeted to substantive inferences. 89

4.4 Simulations. 89

4.4.1 Goal. 89

4.4.2 Simulation setting. 90

4.4.3 Simulation Results. 92 Effect of imputation model and analysis models. 92 Effect of imputation method and proportion of missingness. 93 Effect of mean and median posterior predictive p-values. 94

4.5 Applications to Stroke Registry Data. 94

4.6 Discussion. 96

5 Future Work. 109

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files