Statistical Methods for Incomplete Big Data Open Access

Deng, Yi (Fall 2017)

Permanent URL:


Advances in technology have led to generation of enormous amounts of data, also known as "big data". Such explosion in turn brings daunting challenges for data analysis and for generating meaningful findings using big data. One major challenge is the occurrence of missing data. Data insights may be impacted if missing values are inadequately handled. In this dissertation, we develop and investigate methods for handling missing data in the environment of big data.

In Chapter 1, we first review the terminology on missing data and existing methods for handling incomplete data or big data. Furthermore, we present distributed analyses of big data that are stored in multiple sources. 

In Chapter 2, we develop two approaches of using regularized regressions to impute missing values in the presence of high-dimensional big data. The approaches can accommodate mixed incomplete data and handle general missing data patterns. Our approaches are compared to several existing imputation methods in simulation studies. The simulation results demonstrate that the proposed multiple imputation approach based on an indirect use of regularized regression outperforms any other imputation methods. 

In addition to traditional types of data with missing values, this dissertation also investigates handling distributed incomplete data, with the purpose of protecting the privacy. For example, in the case of medical patients, institutions such as the Veteran's Health Administration have policies that restrict their data to internal facilities. Under such circumstances, distributed analyses are necessary but challenging when data are subject to missing values. In Chapter 3, we propose privacy-preserving methods to handle missing data in distributed analyses for horizontally partitioned data. The methods, in particular, target data that are missing at random and missing not at random. In Chapter 4, we present privacy-preserving methods on vertically partitioned data with missing values.

Table of Contents

1 Introduction

           1.1  Missing-Data Patterns and Mechanisms

           1.2  Commonly-Used Missing Data Methods

                       1.2.1  Complete-Case Analysis

                       1.2.2  Single and Multiple Imputation

                       1.2.3  Inverse Probability Weighting

           1.3  Existing Methods for Handling Incomplete Big Data

                       1.3.1  Nonparametric Methods for Incomplete Big Data

                       1.3.2  Parametric Methods for Incomplete Big Data

           1.4  Distributed Analysis of Big Data

                       1.4.1  Distributed Analysis for Horizontally Partitioned Data

                       1.4.2  Distributed Analysis for Vertically Partitioned Data

                       1.4.3  Missing Data in Distributed Analysis

2 Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data

           2.1  Introduction

           2.2  Methodology

                       2.2.1  Multiple imputation by chained equations

                       2.2.2  Direct use of regularized regression for multiple imputation

                       2.2.3  Indirect use of regularized regression for multiple imputation

           2.3  Simulation Studies

           2.4  Data Examples

                       2.4.1 Georgia stroke registry data

                       2.4.2 Prostate cancer data

           2.5  Discussion

3 Privacy-Preserving Methods for Horizontally Partitioned Incomplete Data

           3.1  Introduction

           3.2  Methodology

                       3.2.1  Privacy-preserving inverse probability weighting for horizontally partitioned data

                       3.2.2  Privacy-preserving multiple imputation for horizontally partitioned data of univariate missing data patterns

                       3.2.3  Privacy-preserving multiple imputation for general missing data patterns

                       3.2.4  Sensitivity Analysis under MNAR assumption

           3.3  Simulation Studies

                       3.3.1 Simulation Study when Data are MAR

                       3.3.2    Simulation Study when Data are MNAR

           3.4  Data Example

           3.5  Discussion

4  Privacy-Preserving Methods for Vertically Partitioned Incomplete Data

           4.1 Introduction

           4.2 Methodology

                       4.2.1  Privacy-preserving inverse probability weighting for vertically partitioned data

                       4.2.2  Privacy-preserving multiple imputations for vertically partitioned data

           4.3       Simulation Studies

           4.4       Data Example

           4.5       Discussion

5 Summary and Future Work

           5.1 Summary

           5.2 Future Work

Appendix for Chapter 2

A.1 Details of MICE-DURR for three types of data

A.2 Details of MICE-IURR for three types of data


About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files