Statistical and Machine Learning Methods in the Studies of Epigenetics Regulation Open Access

Xu, Tianlei (Spring 2018)

Permanent URL:


Rapid development of next generation sequencing technologies produces a plethora of large-scale epigenome profiling data. Given the quantity of available epigenome datasets, obtaining a clear and comprehensive picture of the underlying regulatory network remains a challenge. The multitude of cell type heterogeneity and temporal changes in the epigenome make it impossible to assay all epigenome events for each type of cell. Computational model shows its advantages in capturing intrinsic correlations among epigenetic features and adaptively predicting epigenome marks in a dynamic scenario. Current progress in machine learning provides opportunities to uncover higher level patterns of epigenome interactions and integrating regulatory signals from different resources. My works aim to utilize public data resources to characterize, predict and understand the epigenome-wide regulatory relationship. The first part of my work is a novel computational model to predict in vivo transcription factor (TF) binding using base-pair resolution methylation data. The model combines cell-type specific methylation patterns and static genomic features, and accurately predicts binding sites of a variety of TFs among diverse cell types. The second part of my work is a computational framework to integrate sequence, gene expression and epigenome data for genome wide TF binding prediction. This extended supervised framework integrates motif features, context-specific gene expression and chromatin accessibility profiles across multiple cell types and scale up the TF prediction task beyond the limits of candidate sites with limited known motifs. The third part of my work is a novel computational strategy for functional annotation of non-coding genomic regions. It takes advantage of the newly emerged, genome-wide and tissue-specific expression quantitative trait loci (eQTL) information to help annotate a set of genomic intervals in terms of transcription regulation. This method builds a bridge connecting genomic intervals with biological pathways and pre-defined biological-meaningful gene sets. Tissue specificity analysis provides additional evidence of the distinct roles of different tissues in the disease mechanisms.

Table of Contents

Chapter 1 Epigenomic feature prediction from high-throughput data. 1

1.1 Introduction. 1

1.1.1 Epigenomic research and high-throughput data. 1

1.1.2 Epigenomic features in gene regulation. 2

1.1.3 Prediction methods. 3

1.2 Prediction of protein binding. 4

1.2.1 Rationale for transcription factor binding prediction using epigenetic profiles. 4

1.2.2 Prediction methods using histone modifications. 5

1.2.3 Prediction methods using chromatin accessibility. 7

1.2.4 Other methods for TFBS prediction. 9

1.3 Prediction of enhancer 11

1.3.1 Diversity of definition of Enhancer 11

1.3.2 Challenges of enhancer prediction. 13

1.3.3 Tools for enhancer prediction. 14

1.4 Prediction of DNA methylation. 17

1.5 Prediction of spatial chromatin structure. 21

1.6 Prediction of gene expression. 23

1.7 Discussion. 28

Chapter 2 Base-resolution methylation patterns accurately predict transcription factor bindings in vivo  46

2.1 Introduction. 46

2.2 Material and Methods. 53

2.2.1 Description of the Methylphet method. 53

2.2.2 Methylation models. 54

2.2.3 Other genomic features. 57

2.2.4 Prediction. 57

2.2.5 Data and processing. 58

2.2.6 Data Access. 60

2.3 Result 60

2.3.1 TF binding prediction results. 60

2.3.2 Cross-sample TF binding prediction results. 64

2.3.3 Cross-TF prediction results. 66

2.3.4 Experimental validation in mouse dentate gyrus (DG) cells. 66

2.3.5 Contribution of different features in Methylphet 68

2.3.6 Comparison with other predicting tools and other machine learning methods. 68

2.3.7 Description of the software. 69

2.4 Discussion. 70

Chapter 3 Multi-layer Ensemble Learning Model Accurately Predict Transcript Factor Binding Sites Using DNase-seq and RNA-seq Data  80

3.1 Introduction. 80

3.2 Methods. 81

3.2.1. Feature Engineering of models. 81

3.2.2 Multiple Layer bagging random forest. 85

3.3 Discussion. 87

Chapter 4 Regulatory annotation of genomic intervals based on tissue-specific expression QTLs  89

4.1 Introduction. 89

4.2 Result 92

4.2.1 Overview of loci2path. 92

4.2.2 GTEx eQTL data from 44 tissues. 94

4.2.3 MSigDB Pathways. 97

4.2.4 Query regions from immunoBase. 97

4.2.5 Tissue specificity captures distinct modules of pathogenesis in Psoriasis. 98

4.2.6 Shared risk pathways among 12 core Immune Disease. 103

4.2.7 Software availability. 107

4.3 Discussion. 108

4.4 Method. 110

4.4.1 Enrichment measurement 110

4.4.2 Assessment of tissue specificity. 110

4.4.3 Tissue Specificity measured by average tissue number 111

4.4.4 Output 111

4.4.5 Tissue Enrichment test of query regions. 111

4.4.6 Multiple-test correction using adjusted p-value. 112

4.4.7 Datasets. 112


About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files