The Applications of Big Data Analytics and Machine Learning in Managing, Processing and Analyzing Big Omics Data Public

Sun, Xiaobo (Summer 2018)

Permanent URL: https://etd.library.emory.edu/concern/etds/1n79h434j?locale=fr
Published

Abstract

The development of high-throughput genomics technologies has resulted in massive quantities of diverse omics data. However, existing dataset search tools rely almost exclusively on the metadata. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets beyond just metadata. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable NoSQL database that hosts a large collection of processed omics datasets, and provide a web-based user interface for searching and queries.

In addition, operations on big omics data can be a challenge for traditional single machine based methods. For example, sorted merging of a large number of Variant Call Format (VCF) files are frequently encountered in large scale whole genome sequencing projects. We custom design optimized schemas for Hadoop (MapReduce), HBase and Spark, to perform sorted merging of massive genome-wide data. These schemas all adopt the divide-and-conquer strategy to split and conquer tasks in an ordered, parallel and bottleneck-free way. Our experiments on merging VCF files suggest that all three schemas either deliver a significant improvement in efficiency or render much better strong/weak scalabilities over traditional methods such as the VCFTools, thus providing generalized scalable schemas for performing sorted merging on genetics and genomics data using these Apache distributed systems.

Methylation level changes of CpG sites are associated with specific diseases such as Alzheimer’s disease. However, quantifying these changes across the whole genome remains a challenge, especially for those not covered by the array-based technologies. In this study, we develop an ensemble feature selection and classification model to identify the most relevant features to CpG methylation level changes in a specific disease from a comprehensive collection of genome-wide precomputed epigenomic profiles, and to predict methylation level changes at CpGs beyond the array. Therefore, it provides insights to the mechanism behind CpG methylation level changes in a specific disease as well as an approach to evaluate an individual’s risk exposures to it.

Table of Contents

1       Introduction                                                                                                         1

        1.1      The Big Omics Data ………………………………………………........... 1

           1.2      Outline  …………………………………………………………………... 7

2         OmicSeq: A web-based search engine for exploring omics datasets         9

        2.1      Introduction ………………………………………………………………. 9

           2.2      Methods ………………………………………………………………….. 11 

                       2.2.1   Processing Different Data Types ……………………………….... 11

                       2.2.2   TrackRank Algorithm ……………………………………………. 14

           2.3      Results ……………………………………………………………………. 16

                       2.3.1   The Current Release of Omicseq Search Engine ……………….... 16

                       2.3.2   System Architecture ………………………………………………. 18

                       2.3.3   Omicseq Web Server …………………………………………….. 19

                       2.3.4   Database ………………………………………………………….. 20

                       2.3.5   User Cases ………………………………………………………… 21

           2.4      Discussion ……………………………………………………………….... 22

3         Optimized Distributed Systems Achieve Significant Performance Improvement on Sorted Merging of Massive VCF Files    27

           3.1      Introduction ………………………………………………………………... 27

           3.2      Methods ……………………………………………………………………. 29

                       3.2.1   Overview …………………………………………………………... 30

                       3.2.2   Data Formats and Operations ……………………………………… 31

                       3.2.3   MapReduce (Hadoop) Schema …………………………………….. 33

                       3.2.4   HBase Schema …………………………………………………….. 34

                       3.2.5   Spark Schema …………………………………………………….... 36

                       3.2.6   Parallel Multiway-Merge and MPI-based High Performance

Computing Implementations …………………………………….... 37

                       3.2.7   Strong and Weak Scalabilities ……………………………………... 39

           3.3      Results …………………………………………………………………….. 40

                       3.3.1   Overall Performance Analysis of Clustered-based Schemas ……… 41

                       3.3.2   Strong and Weak Scalabilities of Apache Cluster-based Schemas and Traditional Parallel Methods ………… 42

                       3.3.3   The Anatomic Performances Analysis of Apache Cluster-based Schemas ……………43

                       3.3.4   Execution Speed Comparisons Among Traditional Methods and

Apache Cluster-based Schemas ……………………………………44

           3.4      Discussion ………………………………………………46

4         Ensemble Learning of Changes of CpG Methylation Levels            66

           4.1      Introduction ………………………………………………………………..66

           4.2      Methods ……………………………………………………………………69

                       4.2.1   Overview ………………………………………………………….. 69

                       4.2.2   Datasets and Prediction Features ………………………………… 70

                       4.2.3   Experimental Dataset Construction ……………………………… 71

                      4.2.4   Feature Selection ………………………………………………… 72

                       4.2.5   Feature Engineering ………………………………………………74

                       4.2.6   Data Visualization ………………………………………………...75

                       4.2.7   Model Selection and Hyperparameter Tuning ……………………75

                       4.2.8   Prediction and Model Evaluation …………………………………77

           4.3      Results …………………………………………………………………….77

                       4.3.1   Selected Features ………………………………………………… 77

                       4.3.2   Predictions on CpG Methylation Level Changes on Alzheimer’s Disease ……78

                       4.3.3   Predictions on CpG Methylation Level Changes on Placenta

with Arsenic Exposure …………………………………………… 79

           4.4      Discussion ……………………………………………………………...... 79

5         Summary                                                                                       93

List of Figures

2.1      Illustration of ranking genomic datasets of different types. …………………………. 24

2.2      The architecture of Omicseq web system. …………………………………………… 25

2.3      Search interface and result pages for KLK3, ERBB2 and PTEN. ………………….... 26

3.1      Converting VCF files to TPED.  ……………………………………………………… 50

3.2      The workflow chart of MapReduce Schema. ………………………………………… 51

3.3      The workflow chart of HBase schema. ………………………………………………. 52

3.4      The workflow chart of Spark schema. ……………………………………………….. 53

3.5      The execution plan of HPC-based implementation. …………………………………. 54

3.6      The scalability of Apache cluster-based schemas on input data size. ……………….. 55

3.7      Comparing the strong scalability between traditional parallel/distributed methods and Apache cluster-based schemas.  56

3.8      Comparing the weak scalability between traditional parallel/distributed methods and Apache cluster-based schemas. … 57

3.9      The performance anatomy of cluster-based schemas on increasing input data size. ... 58

3.10    Execution speed comparison among Apache cluster-based schemas and Traditional methods. …59

3.11    The MapReduce Schema. ………………………………………………………........ 60

3.12    The HBase Schema. ………………………………………………………................. 61

3.13    The Spark Schema. .………………………………………………………................. 62

4.1      Illustration of the ensemble learning workflow.  …………………………………….. 83

4.2      ROC and precision-recall curves of the ensemble model and its component models in evaluations of predictions on the dataset of Alzheimer’s disease. …84

4.3      ROC and precision-recall curves of the ensemble model and its component models in evaluations of predictions on the RICHS dataset. ……85

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Mot-clé
Committee Chair / Thesis Advisor
Committee Members
Dernière modification

Primary PDF

Supplemental Files