Disease Risk annotation of Genomic and Epigenomic Variants using Machine Learning Approaches Open Access

Huang, Yanting (Summer 2022)

Permanent URL: https://etd.library.emory.edu/concern/etds/70795903r?locale=en


There has been a tremendous quantity of omics data produced by high-throughput genomics technologies nowadays. Understanding the impact of genomic variations and epigenomic modifications is important for discovering the mechanism of complex diseases. Over the last two decades, thousands of genome-wide association studies (GWASs) and epigenome-wide association studies (EWASs) have identified tens of thousands of disease-susceptibility loci that are associated with certain diseases. In addition to the association studies, current progress of machine learning and deep learning studies have pushed the edge and provided great opportunities to integrate omics data to uncover complicated relationships of features from different aspects of regulatory factors for the disease risk annotations of genomic and epigenomic variants. By utilizing comprehensive omics data from the The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Mapping Consortium (REMC) projects, I proposed several machine learning predictive models with different focuses on genomic and epigenomic variants annotations, which includes 1) EWASplus, an ensemble learning based framework for the risk prediction of DNA methylation loci associated with Alzheimer’s Disease, 2) CASAVA (Disease Category-specific Annotation of Variants), a disease category specific risk annotation for the whole genome wide SNPs (single nucleotide polymorphism), 3) DRAFT (Disease Risk Annotation with Few shoTs learning), an end-to-end deep learning based approach that incorporates contrastive learning to tackle the lack of risk variants that hinder the application of traditional deep learning models to this research field.

Table of Contents

Table of Contents

Chapter 1 Introduction 1

1.1 Background 1

1.2 Outline of the dissertation 2

Chapter 2 EWASplus: An ensemble learning approach for the risk prediction of Alzheimer’s Disease associated CpGs 4

2.1 Introduction 4

2.2 Background 4

2.3 Results 6

2.3.1 EWASplus overview 6

2.3.2 EWASplus performance compared to methylation array 8

2.3.3 EWASPlus performance for off-array CpGs 10

2.3.4 Comparison with a competing method 13

2.3.5 Experimental validation of EWASplus predictions 15

2.3.6 EWASplus performance on multiple cohorts 16

2.3.7 Biological insights into AD 18

2.4 Methods 20

2.4.1 Cohorts 20

2.4.2 Sample preparation and differential DNAm CpGs identification 20

2.4.3 Training sets selection 22

2.4.4 Base classifiers 22

2.4.5 Feature selection 23

2.4.6 Hyperparameter tuning and ensemble model 24

2.4.7 Performance evaluation metrics 24

2.4.8 Binomial test for enrichment of protein kinases 25

2.4.9 Log-scale rank score (LRS) for prioritizing AD-associated loci 25

2.4.10 Loci selection for targeted bisulfite sequencing 25

2.4.11 Adaption of Zhang et al. for comparison with EWASplus 26

2.4.12 Targeted bisulfite sequencing 27

2.4.13 Protein-protein interaction and pathway analyses 27

2.5 Discussion 27

Chapter 3 CASAVA: A disease category-specific annotation of variants using an ensemble learning framework 32

3.1 Introduction 32

3.2 Methods 33

3.2.1 Risk variants for diseases and disease categories 34

3.2.2 Constructing control sets of benign variants 34

3.2.3 Processing sequencing features 34

3.2.4 Ensemble learning for class imbalance problem 35

3.2.5 Genomic properties of CASAVA score 36

3.2.6 Applying CASAVA to disease-specific risk prediction 37

3.2.7 Applying transfer learning to disease-specific risk prediction 37

3.2.8 Comparison with commonly used scoring methods 38

3.2.9 Performance evaluation 38

3.2.10 Case study for immune system diseases 39

3.2.11 Exploring informative features in CASAVA 39

3.3 Results 40

3.3.1 Overview of CASAVA 40

3.3.2 Disease categories 43

3.3.3 Predicting disease category-specific risk variants 45

3.3.4 Disease category-specificity in CASAVA scores 47

3.3.5 Benefits of using various ensemble learning techniques 47

3.3.6 Contributions from different group of features 48

3.3.7 Genome-wide pattern of CASAVA scores 50

3.3.8 Results on testing sets 51

3.3.9 Utility of CASAVA scores on disease-specific risk prediction 51

3.3.10 Applying transfer learning to improve disease-specific risk prediction 54

3.3.11 Case study: MHC2TA and IKZF1 for immune system diseases 56

3.3.12 Informative features in CASAVA 58

3.4 Discussion 60

Chapter 4 DRAFT: Disease Risk Annotation with Few shoTs learning 68

4.1 Introduction 68

4.2 Methods 69

4.2.1 Data collection and preprocessing 69

4.2.2 Triplet Loss and Lifted Structured Loss 71

4.2.3 Implementation details 72

4.3 Results 72

4.3.1 Evaluation and performance comparison 72

4.3.2 Conclusion 76

Chapter 5 Future Works 77

Bibliography 79

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files