Statistical Methods for Biomedical Network Data Pubblico

Cai, Qingpo (Summer 2018)

Permanent URL: https://etd.library.emory.edu/concern/etds/cf95jb516?locale=it
Published

Abstract

There are tens of thousands of units in a biological system. Network representations have been used to describe interactions between these units. Studying biological networks is a key to understand complex biological activities. In this dissertation, we develop statistical methods for analyzing biological network data, aiming to find subnetwork or network marker strongly associated with the clinical outcome of interest.

Selecting informative nodes over large-scale networks  becomes increasingly important in many research areas.  Most existing methods focus on the local network structure and incur heavy computational costs for the large-scale problem.  In the extbf{first project}, we propose a novel prior model for Bayesian network marker selection in the generalized linear model (GLM) framework: the Thresholded Graph Laplacian Gaussian (TGLG) prior, which adopts the graph Laplacian matrix to characterize the conditional dependence between neighboring markers accounting for the global network structure. Under mild conditions, we show the proposed model enjoys the posterior consistency with a diverging number of edges and nodes in the network. We also develop a Metropolis-adjusted Langevin algorithm (MALA) for efficient posterior computation, which is scalable to large-scale networks. We illustrate the superiorities of the proposed method compared with existing alternatives via extensive simulation studies and an analysis of the breast cancer gene expression dataset in the Cancer Genome Atlas (TCGA).

Untargeted metabolomics using high-resolution liquid chromatography - mass spectrometry (LC-MS) is becoming one of the major areas of high-throughput biology. Functional analysis, i.e. analyzing the data based on metabolic pathways or the genome-scale metabolic network, is critical in feature selection and interpretation of metabolomics data. One of the main challenges in the functional analyses is the lack of the feature identity in the LC-MS data itself. By matching mass-to-charge ratio (m/z) values of the features to theoretical values derived from known metabolites, some features can be matched to one or more known metabolites. When multiple matching occurs, in most cases only one of the matchings can be true. At the same time, some known metabolites are missing in the measurements. Current network/pathway analysis methods ignore the uncertainty in metabolite identification and the missing observations, which could lead to errors in the selection of significant subnetworks/pathways.  In the extbf{second project}, we propose a flexible network feature selection framework that combines metabolomics data with the genome-scale metabolic network. The method adopts a sequential feature screening procedure and machine learning-based criteria to select important sub-networks and identify the optimal feature matching simultaneously. Simulation studies show that the proposed method has a much higher sensitivity than the commonly used maximal matching approach. For demonstration, we apply the method on a cohort of healthy subjects to detect subnetworks associated with the Body Mass Index (BMI). The method identifies several subnetworks that are supported by the current literature, as well as detect some subnetworks with plausible new functional implications.

Mediation analysis is a modelling framework to study the relationship between the independent variable (exposure) and the dependent variable (outcome) via including the mediator variable. Traditionally, mediation analysis is developed under regression and causal inference framework, which focuses on measuring or testing the mediation effect. Alternative to existing mediation analysis framework, we propose a new mediation analysis framework focusing on predictive modeling in the extbf{third project}. We propose new definitions for predictive exposure, predictive mediator and predictive network mediator. An estimation procedure is proposed to identify predictive exposure and predictive mediator and simulation studies are conducted to illustrate the performance of the proposed estimation procedure. Two greedy algorithms are proposed to identify network mediator for single and multiple exposure variable and are applied on a dataset from Emory-Georgia Tech Predictive Health Initiative Cohort of the Center for Health Discovery and Well Being.

 

Table of Contents

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Variable selection methods for genomic network data . . . . . . . . . 3

1.3 Metabolomic network data . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Mediation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Bayesian network marker selection via the thresholded graph Lapla-

cian Gaussian prior 12

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Posterior Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.2 scalefree network . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.3 Application to breast cancer data from the Cancer Genome Atlas 27

2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Network Marker Selection for Untargeted LC-MS Metabolomics

Data 33

i

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1 The setup of the problem . . . . . . . . . . . . . . . . . . . . 35

3.2.2 Metabolic ego networks . . . . . . . . . . . . . . . . . . . . . . 35

3.2.3 Optimal matching . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 47

4 A new framework for predictive network mediator analysis 49

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 A predictive mediation analysis framework . . . . . . . . . . . . . . . 50

4.2.1 The area under the curve (AUC) . . . . . . . . . . . . . . . . 51

4.2.2 Predictive mediation analysis framework . . . . . . . . . . . . 52

4.2.2.1 Predictive exposure . . . . . . . . . . . . . . . . . . . 52

4.2.2.2 Predictive mediator . . . . . . . . . . . . . . . . . . 52

4.2.2.3 Building network . . . . . . . . . . . . . . . . . . . . 53

4.2.2.4 Predictive network mediator for single exposure . . . 53

4.2.3 Estimation procedure and algorithm . . . . . . . . . . . . . . 54

4.2.3.1 Estimation for predictive exposure . . . . . . . . . . 54

4.2.3.2 Estimation for predictive mediator . . . . . . . . . . 55

4.2.3.3 Greedy algorithms for predictive network mediator . 56

4.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Future work 67

A Appendix for Chapter 2 69

A.0.1 Regularity conditions . . . . . . . . . . . . . . . . . . . . . . . 69

A.0.2 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.0.3 Proof for Thoerem 1 . . . . . . . . . . . . . . . . . . . . . . . 72

A.0.4 Proof for Thoerem 2 . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography 78

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Parola chiave
Committee Chair / Thesis Advisor
Committee Members
Ultima modifica

Primary PDF

Supplemental Files