Deep Models for Gene Regulation Open Access

Denas, Olgert (2014)

Permanent URL: https://etd.library.emory.edu/concern/etds/tt44pn11b?locale=en
Published

Abstract

The recent increase in the production pace of functional genomics data has created new opportunities in understanding regulation. Advances range from the identification of new regulatory elements to gene expression prediction from genomic and epigenomic features. At the same time, this data-rich environment has raised challenges in retrieving and interpreting information contained therein.

Based on recent algorithmic developments, deep artificial neural networks (ANN) have been used to build representations of the input that preserve only the information needed to the task at hand. Prediction models based on these representations have achieved excellent results in machine learning competitions. The deep learning paradigm describes how to build these representations and train the prediction models in a single learning exercise.

In this work, we propose ANN as tools for modeling gene regulation and a novel technique for interpreting what the model has learned.

We implement software for the design of ANNs and for training practices over functional genomics data. As a proof of concept, use our software to model differential gene expression during cell differentiation. To show the versatility of ANNs, we train a regression model on measurements of protein-DNA interaction to predict gene expression levels.

Typically, input feature extraction from a trained ANN is formulated as an optimization problem whose solution is slow to obtain and not unique. We propose a new efficient feature extraction technique for classification problems that provides guarantees on the class probability of the features and their norm. We apply this technique to identify differential gene expression associated features that agree with previous empirical studies.

Finally, we propose building representations of functional features from protein-DNA interaction measurements using a deep stack of nonlinear transformations. We show that these reduced representations are informative and can be used to label parts of the gene, regulatory elements, and quiescent regions.

While widely successful, deep ANNs are considered to be hard to use and interpret. We hope that this work will help increase the adoption of such models in the genomics community.

Table of Contents

1 Introduction

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of remaining chapters . . . . . . . . . . . . . . . . . 4
1.3 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . 5
2 Background 7
2.1 Importance of regulation . . . . . . . . . . . . . . . . . . . . . 7
2.2 Mechanisms of gene regulation . . . . . . . . . . . . . . . . . . 10
2.2.1 Transcriptional regulation . . . . . . . . . . . . . . . . 10
2.2.2 Post-transcriptional regulation . . . . . . . . . . . . . . 13
2.2.3 Epigenetic regulation . . . . . . . . . . . . . . . . . . . 14
2.3 Computational methods for TF binding analysis . . . . . . . . 16
2.3.1 From HGP to ENCODE . . . . . . . . . . . . . . . . . 16
2.3.2 Next Generation Sequencing . . . . . . . . . . . . . . . 17
2.3.3 From read counts to signal . . . . . . . . . . . . . . . . 20
2.4 Articial neural networks . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Feed forward networks . . . . . . . . . . . . . . . . . . 24
2.4.3 Convolutional Neural Networks . . . . . . . . . . . . . 26
2.4.4 Modern ANNs and Representation learning . . . . . . 27

3 Feature extraction 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Relevant feature extraction from ANNs . . . . . . . . . . . . . 31
3.3 Convex optimization based method . . . . . . . . . . . . . . . 32
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Deep models for regulation 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Dierential gene expression modeling . . . . . . . . . . . . . . 45
4.2.1 The G1E biological model and data . . . . . . . . . . . 45
4.2.2 Feature extraction . . . . . . . . . . . . . . . . . . . . 49
4.3 Gene expression prediction from TFos . . . . . . . . . . . . . . 55
4.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Regression model . . . . . . . . . . . . . . . . . . . . . 56
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Unsupervised modeling of functional genomics data 60
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Deep representations of the genome . . . . . . . . . . . . . . . 63
5.2.1 Experimental setting . . . . . . . . . . . . . . . . . . . 63
5.2.2 Data and Model . . . . . . . . . . . . . . . . . . . . . . 66
5.3 TF composition analysis of timing replication domains . . . . 67
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.2 Data and Model . . . . . . . . . . . . . . . . . . . . . . 70
5.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Bibliography 74

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files