Network-Based Machine Learning Methods for Omics Data Open Access

Kong, Yunchuan (Spring 2020)

Permanent URL: https://etd.library.emory.edu/concern/etds/w3763788r?locale=en

Published

Abstract

In the field of bioinformatics, large-scale biological networks play an essential role for studying transcriptomic data. As networks can bring useful relational information in solving problems, tasks involving biological networks range from system biology, statistical modeling, to machine learning. In this dissertation, focusing on different roles of biological networks, we explore both the construction of networks and their integration with statistical methods. On the one hand, we have found hypergraphs, an extension of traditional networks, to be an excellent tool to represent higher-order interactive relationships among biological units and analyze complex systems; on the other hand, we have discovered that incorporating known biological networks and constructing biological feature networks can be helpful in improving certain supervised machine learning algorithms.

The first topic of this dissertation is about the highly dynamic biological regulatory system. It is shown that correlations between certain functionally related genes change over different biological conditions, which are often unobserved in the data. At the gene level, the dynamic correlations result in three-way gene interactions involving a pair of genes that change correlation, and a third gene that reflects the underlying cellular conditions. This type of ternary relation can be quantified by the Liquid Association statistic. Studying these three-way interactions at the gene triplet level have revealed important regulatory mechanisms in the biological system. Currently, due to the extremely large amount of possible combinations of triplets within a high-throughput gene expression dataset, no method is available to examine the ternary relationship at the biological system level. Hence, in Chapter 2, we propose a new method, Hypergraph for Dynamic Correlation (HDC), to construct module-level three-way interaction networks. The method is able to present integrative uniform hypergraphs to reflect the global dynamic correlation pattern in the biological system, providing guidance to downstream gene triplet-level analyses. To validate the method's ability, we conducted two real data experiments using a melanoma RNA-seq dataset from The Cancer Genome Atlas (TCGA) and a yeast cell cycle dataset. The resulting hypergraphs are clearly biologically plausible, and suggest novel relations relevant to the biological conditions in the data. We believe the new approach provides a valuable alternative method to analyze omics data that can extract higher order structures.

In the second topic of this dissertation, we aim at solving a unique challenge in predictive modeling for gene expression data, which usually bear small samples $(n)$ compared to the huge amount of features $(p)$. This ``$n\ll p$'' property has hampered application of deep learning techniques for disease outcome classification. Recently, literature shows that sparse learning by incorporating external gene network information could be a potential solution to this issue. To build a robust classification model, we propose the Graph-Embedded Deep Feedforward Networks (GEDFN) in Chapter 3, to integrate external relational information of features into the deep neural network architecture. The method is able to achieve sparse connection between network layers to prevent overfitting. To validate the method's capability, we conducted both simulation experiments and real data analysis using a Breast Invasive Carcinoma (BRCA) RNA-seq dataset and a Kidney Renal Clear Cell Carcinoma (KIRC) RNA-seq dataset from The Cancer Genome Atlas (TCGA). The resulting high classification accuracy and easily interpretable feature selection results suggest the method is a useful addition to the current graph-guided classification models and feature selection procedures.

The third topic of this dissertation is an extension of the second topic. Faced with the ``$n\ll p$'' challenge in predictive modeling, the GEDFN model with sparse learning by incorporating known functional relations between the biological units, has been proved a solution to this issue in Chapter 3. However, such methods require an existing feature graph, and potential mis-specification of the feature graph can be harmful on classification and feature selection. To address this limitation and develop a robust classification model without relying on external knowledge, we propose a \underline{for}est \underline{g}raph-\underline{e}mbedded deep feedforward \underline{net}work (forgeNet) model in Chapter 4, to integrate the GEDFN architecture with a forest feature graph extractor, so that the feature graph can be learned in a supervised manner and specifically constructed for a given prediction task. Similar as in Chapter 3, to validate the method's capability, we experimented the forgeNet model again with both synthetic and real datasets. The resulting high classification accuracy suggests that the method is a valuable addition to sparse deep learning models for omics data.

In the future work, possible directions are to continue exploring the integration of biological networks and statistical modeling. Certain research area has already been established such as the Graph Convolution Network (GCN). Also, following our construction of hypergraphs in the first topic, it is also tempting to study further applications beyond the scientific findings themselves.

1 Introduction

2 HDC: Hypergraph for Dynamic Correlation

3 GEDFN: Graph-Embedded Deep Feedforward Network

4 forgeNet: Forest Graph-Embedded Deep Feedforward Network

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Biostatistics
Degree	Ph.D.
Submission	Dissertation
Language	English
Research Field	Biology, Bioinformatics Statistics Computer Science
Keyword	Deep Learning Machine Learning Network Analysis Computational Statistics
Committee Chair / Thesis Advisor	Tianwei Yu, Emory University
Committee Members	Zhaohui Qin, Emory University Glen Satten, CDC Hao Wu, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Network-Based Machine Learning Methods for Omics Data ()	2020-02-27 15:37:57 -0500	Download

Network-Based Machine Learning Methods for Omics Data Open Access

Kong, Yunchuan (Spring 2020)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files