Statistical Learning Methods for Big Biomedical Data Public

Li, Ziyi (Spring 2018)

Permanent URL: https://etd.library.emory.edu/concern/etds/m326m179b?locale=fr
Published

Abstract

The rapid advancement of biological and clinical technologies has generated several distinct types of big biomedical data, including -omics data and electronic health record data. Such data and their distinct features have created challenges in obtaining meaningful and applicable research findings. In this dissertation, we develop three statistical learning methods for the analysis of big biomedical data.

Principal component analysis (PCA) is a popular tool for dimensionality reduction, data mining, and visualization of high dimensional data. It has been recognized that complex biological mechanisms occur through concerted relationships of multiple genes working in networks that are often represented by graphs. Recent work has shown that incorporating such biological information improves feature selection and prediction performance in regression analysis, but there has been limited work on extending this approach to PCA. In the first project, we propose two new sparse PCA methods called Fused and Grouped sparse PCA that enable incorporation of prior biological information in variable selection, leading to improved feature selection and more interpretable principal component loadings and potentially providing insight on molecular underpinnings of complex diseases. Our simulation studies suggest that, compared to existing sparse PCA methods, the proposed methods achieve higher sensitivity and specificity when the graph structure is correctly specified, and are fairly robust to misspecified graph structures. Application to a glioblastoma gene expression dataset identified pathways that are suggested in the literature to be related with glioblastoma.

Electronic health record (EHR) data provide promising opportunity to explore personalized treatment regime and to make clinical predictions. Compared with genomics data, EHR data are known for their irregularity and complexity. In addition, analyzing EHR data involves privacy issues and sharing such data among multiple research sites may not be feasible due to privacy concerns and regulatory hurdles. Re- cent work uses contextual embedding models and successfully builds one predictive model for analysis of EHR data from multiple sites for more than seventy common diagnoses. Although the existing model can achieve a relatively high predictive accuracy, it cannot build global models without sharing data among sites. In the second project, we propose three novel contextual embedding methods to build predictive models called Naive updates, Dropout updates, and Distributed Noise Contrastive Estimation (Distributed NCE). In addition, we also propose Distributed NCE with DP, which is an updated version of Distributed NCE, to obtain reliable privacy protections.  Our simulation study with a real dataset demonstrates that the proposed methods not only can build predictive model with privacy protection distributedly, but also well preserve the model structure and achieve comparable prediction accuracy compared with hidden-truth model built with all the data.

Biclustering technique can identify local patterns of a data matrix by clustering rows and columns at the same time. Various biclustering methods have been proposed and successfully applied to analyze gene expression data. While existing biclustering methods have many desirable features, most of them are developed for continuous data and none of them can handle genomic data of various types, for example, binomial data as in Single Nucleotide Polymorphism(SNP) data or negative binomial data as in RNA-seq data. In addition, none of existing methods can utilize biological information such as those from functional genomics or proteomics. Recent work has shown that incorporating biological information can improve variable selection and prediction performance in analyses such as linear regression and multivariate analysis. In the third project, we propose a novel Bayesian biclustering method that can handle multiple data types including Gaussian, Binomial, Negative binomial, and Poisson data. In addition, our method uses a Bayesian adaptive structured shrinkage prior that enables feature selection guided by biological information such as those from functional genomics. Our simulation studies and application to mutiple genomics datasets demonstrate robust and superior performance of the proposed method, compared to other existing biclustering methods.

For future work, we can continue the direction of the fi topic and explore the potential extension of sparse PCA combining neural network, or continue the direction of the second topic and replace Word2Vec with recently proposed embedding approaches, or continue the direction of the third topic to incorporate subject level phenotype information into the biclustering process.

Table of Contents

1 Introduction 1

1.1 Overview of Big Biomedical Data . . . . . . . . . . . . . . . . . . . . 2

1.2 Principal Component Analysis (PCA) : A multivariate analysis method 6

1.2.1 Sparse PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Sparse PCA with structural information . . . . . . . . . . . . 10

1.3 Predictive Model Construction using EHR data . . . . . . . . . . . . 13

1.3.1 Analyzing EHR Data with NDL Methods . . . . . . . . . . . 14

1.3.2 Deep Learning Methods . . . . . . . . . . . . . . . . . . . . . 16

1.3.3 Analyzing EHR data using DL Methods . . . . . . . . . . . . 17

1.4 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4.1 Greedy algorithms: CC, xMotifs, and ISA . . . . . . . . . . . 21

1.4.2 Distribution parameter identication algorithms: Plaid and FABIA 24

1.5 Motivation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.6 Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Incorporating Biological Information in Sparse Principal Component Analysis with Application to Genomic Data 28

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.1 Standard and Sparse Principal Component Analysis . . . . . . 32

2.2.2 Grouped sparse PCA . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.3 Fused sparse PCA . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.1 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4 Application to the Glioblastoma Data . . . . . . . . . . . . . . . . . . 43

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 Distributed learning from multiple EHR databases : Contextual embedding models for medical events 50

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.1 Skip-gram Model . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.2 Patient-Diagnosis Projection Similarity Model . . . . . . . . . 55

3.2.3 Distributed Noise Contrastive Estimation . . . . . . . . . . . . 56

3.2.4 Distributed Noise Contrastive Estimation with Privacy Protection 57

3.3 Two alternative solutions . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3.1 Naive updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3.2 Dropout updates . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4 Numerical study with real data . . . . . . . . . . . . . . . . . . . . . 61

3.4.1 Data and Data preprocess . . . . . . . . . . . . . . . . . . . . 62

3.4.2 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.3 Tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 69

4 Bayesian Generalized Biclustering Analysis via Adaptive Structured Shrinkage 72

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2.1 Prior Specication . . . . . . . . . . . . . . . . . . . . . . . . 78

4.2.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.3.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.3.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . 88

4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.4 Real data applications . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.4.1 Gene expression datasets . . . . . . . . . . . . . . . . . . . . . 94

4.4.2 Proteomics dataset . . . . . . . . . . . . . . . . . . . . . . . . 95

4.4.3 RNAseq dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.4.4 Integrative dataset . . . . . . . . . . . . . . . . . . . . . . . . 96

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 Future work 99

A Appendix for Chapter 2 102

B Appendix for Chapter 3 105

Bibliography 107

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Mot-clé
Committee Chair / Thesis Advisor
Committee Members
Dernière modification

Primary PDF

Supplemental Files