Application of the DIKW Model in Malaria Systems Biology: From NGS Data to Disease Progression Insight 公开

Chien, Jung-Ting (2017)

Permanent URL: https://etd.library.emory.edu/concern/etds/hx11xg01f?locale=zh
Published

Abstract

The data, information, knowledge and wisdom (DIKW) model has been widely used in data science fields to generate a comprehensive view of each domain. It provides a hierarchical representation of the understanding of the domain knowledge; the DIKW model can reveal insights in systems biology by integrating different types of omics data to form a comprehensive understanding.

The foundation of systems biology is mining genomics data with machine learning. As the use of high-throughput, next-generation sequencing (NGS) applications grows, research in genomics enters the big data era. NGS applications can be divided into two major categories, short-read and long-read techniques, which are based on the principle differences in generating reads. A read is the fundamental element of genomic information. Short-read applications have been widely applied in several fields of genomics research, while long-read applications just came to market in 2011. Long-read applications have shown the potential to handle several areas of genomic questions. However, obtaining a well-defined genome still has a number of challenges in malaria systems biology research, and these challenges block researchers understanding the mechanism of the malaria disease progression.

To tackle these challenges, we built a novel long-read NGS pipeline with third party modules and modified them to solve complicated Plasmodium genome assembly questions. These techniques provided a solution where traditional, short-read technologies could not because of the Plasmodium genomes highly repetitive nature. We also implemented infrastructure to solve data management difficulties and developed several novel and robust pipelines to process and analyze the data. We host this pipeline along with other third party applications for data quality control, generic data visualization and data management tools. Our pipeline is also scalable and flexible to combine different technologies (long reads and short reads) to assemble the Plasmodium genome and conduct downstream annotations.

This dissertation describes an overview of omics research in the big data era and reveals the possibility of applying DIKW models through mining genomics data. A detailed discussion on how to apply our platform to solve questions, including multiple Plasmodium genome assemblies and annotations, and an initial discussion of applying machine learning approaches in a host-pathogen transcriptome analysis and its data mining applications are also provided.

Table of Contents

Table of Contents

Abstract................................................................................................................................ 4

Acknowledgements............................................................................................................... 7

Table of Contents.................................................................................................................. 8

Chapter 1, Introduction....................................................................................................... 13

1.1 Motivation................................................................................................................. 13

1.2 Key Questions............................................................................................................ 14

1.3 Dissertation Hypotheses........................................................................................... 15

1.5 Contributions............................................................................................................ 18

1.6 References................................................................................................................ 19

Chapter 2, Systems Biology in Big Data Era.......................................................................... 22

2.1 Introduction to the Fundamental Elements of Systems Biology.................................. 22

2.2 Introduction to Current Genomics............................................................................. 23

2.3 Introduction to Current Proteomics........................................................................... 28

2.4 Introduction to Current Metabolomics...................................................................... 30

2.5 Malaria Genomics..................................................................................................... 31

2.6 Malaria Transcriptomics............................................................................................ 33

2.7 Transcriptional Regulation: Epigenetics..................................................................... 38

2.8 The DIKW Model in Malaria Systems Biology.............................................................. 40

2.9 Malaria Systems Biology: Data Science from Conceptualization to Action.................. 44

2.10 References.............................................................................................................. 45

Chapter 3, Data Management Framework........................................................................... 56

3.1 Introduction to the Infrastructure of Long-Reads Genomic Data Collection................ 56

3.2 The Information Scheme in UML................................................................................. 57

3.3 Workflow Design........................................................................................................ 59

3.4 Discussion................................................................................................................. 60

3.5 References................................................................................................................ 60

Chapter 4, Large Genome Assembly & Annotation - 3rd Generation Sequencing-Based Workflow 62

4.1 The Challenges of Assembling a Plasmodium Genome............................................... 62

4.2 Assembly Procedure.................................................................................................. 63

4.3 Assembly Evaluation - P. coatneyi.............................................................................. 64

4.4 Annotation Strategy................................................................................................... 66

4.4.1 Genome Annotation Workflow................................................................................ 68

4.5 Repetitive Gene Families Analysis............................................................................. 69

4.6 Mitochondria and Apicoplast Genome....................................................................... 71

4.7 Case Studies of Robustness - Applying the Assembly & Annotation Pipeline to Other Species of Plasmodium 73

4.8 Combining Hi-C Assembly with the PacBio Assembly Pipeline.................................... 73

4.9 Discussion................................................................................................................. 78

4.10 References.............................................................................................................. 83

Chapter 5, Using Data Mining Approaches to obtain the Insights from Genomics & Transcriptome Data 86

5.1 Machine Learning in Systems Biology......................................................................... 86

5.2 Data Mining Applications in Malaria Host-Pathogen Transcriptome Analysis............. 97

5.3 The Initial Results of Malaria Host-Pathogen Transcriptome Analysis...................... 105

5.3.1 Pathogen Transcriptome Analysis......................................................................... 105

5.3.2 Host Transcriptome Analysis................................................................................ 107

5.3.3 Acute phase DEGs Analysis................................................................................... 115

5.3.4 Chronic phase DEGs Analysis................................................................................ 118

5.4 Discussion............................................................................................................... 120

5.5 Conclusion............................................................................................................... 121

5.6 Supplementary........................................................................................................ 123

5.7 References.............................................................................................................. 124

Appendix........................................................................................................................... 129

Appendix A.................................................................................................................... 129

Appendix B.................................................................................................................... 129

Appendix C.................................................................................................................... 130

Appendix D.................................................................................................................... 130

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research field
关键词
Committee Chair / Thesis Advisor
Committee Members
最新修改

Primary PDF

Supplemental Files