Statistical and informatics methods for analyzing next generation sequencing data Open Access

Chen, Li (2017)

Permanent URL: https://etd.library.emory.edu/concern/etds/cj82k799m?locale=en

Published

Abstract

In the era of genomic big data, it is demanded to develop statistical and informatics methods for the analysis of big data. The integrative analysis of datasets generated from different sources or in different biological conditions is of particular interest. First, we develop a statistical method ChIPComp to perform quantitative comparison of multiple ChIP-seq datasets in different biological conditions. ChIPComp detects genomic regions showing differential protein binding or histone modification by considering data from control experiments, signal to noise ratios, biological variations, and multiple-factor experimental designs in a linear model framework. Simulations and real data analyses demonstrate that ChIPComp provides more accurate and robust results compared with existing methods. By utilizing tens of thousands of trait-associated GWAS SNPs cataloged, we present traseR, a computational tool that could explore the collection of trait-associated SNPs to indicate whether a given genomic interval or intervals is likely to be functionally connected with certain phenotypes or diseases. Real data results indicate that traseR offers a turnkey solution for enrichment analysis of trait-associated SNPs. Besides analyzing datasets from a single source (GWAS or epigenomics), we perform a joint analysis for multiple data sources by annotating GWAS SNPs using thousands of genomic and epigenomic datasets, and building DIVAN, a data-driven machine learning approach that aims to identify disease-specific noncoding risk variants in a genome-wide scale, which is helpful to understand the cryptic link between non-coding sequence variants and the pathophysiology of complex diseases/phenotypes. By being disease-specific, DIVAN demonstrates to be more powerful than competing methods in the identification of disease-specific non-coding risk variants.

1. Introduction

1.1 Background

1.2 Outline of the dissertation

2. ChIPComp: A novel statistical method for quantitative comparison of multiple ChIP-seq datasets

2.1 Introduction

2.2 Methods

2.2.1 The data model

2.2.2 Estimate the background signal from control data

2.2.3 Model the IP-background relationship

2.2.4 The final model

2.2.5 The procedures for quantitative comparison

2.3 Results

2.3.1 Data description

2.3.2 Simulation

2.3.3 Implementation

2.3.4 Real data results

2.4 Discussion

2.5 Appendix

3. DIVAN: accurate identification of non-coding disease-specific risk variants using multi-omics profiles

3.1 Introduction

3.2 Methods

3.2.1 Software and data package availability

3.2.2 Data sources

3.2.3 Feature selection-based ensemble-learning framework

3.2.4 Competing methods

3.3 Results

3.3.1 Overview of the DIVAN approach

3.3.2 Characteristics of epigenomic profiles across risk variants

3.3.3 Disease-specific variant prioritization evaluation using cross-validation

3.3.4 Disease-class variant prioritization

3.3.5 Applying DIVAN to disease-specific variants in the GRASP database

3.3.6 Applying DIVAN to regulatory variants in the HGMD database

3.3.7 Applying DIVAN on synonymous mutations

3.3.8 Exploration and interpretation of features

3.3.9 Additional tests on more settings of DIVAN

3.4 Discussion

3.5 Appendix

3.5.1 Availability of data and material

3.5.2 Supplementary figures

3.5.3 Supplementary tables

4. traseR: an R package for performing trait-associated SNP enrichment analysis in genomic intervals

4.1 Introduction

4.2 Method

4.2.1 Background SNPs

4.2.2 Enrichment tests

4.2.3 Linkage disequilibrium

4.3 Results

4.3.1 SNP collection

4.3.2 Real data analyses

4.3.3 Computational time

5. Conclusion and future work

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Computer Science and Informatics
Degree	PhD
Submission	Dissertation
Language	English
Research Field	Computer Science Biology, Biostatistics Biology, Bioinformatics
Keyword	variants software ChIP-seq statistics machine learning bioinformatics
Committee Chair / Thesis Advisor	Qin, Zhaohui, Emory University Wu, Hao, Emory University
Committee Members	Jin, Peng, Emory University Cooper, Lee, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Statistical and informatics methods for analyzing next generation sequencing data ()	2018-08-28 15:32:47 -0400	Download

Statistical and informatics methods for analyzing next generation sequencing data Open Access

Chen, Li (2017)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files