Novel Statistical Methods for Analyzing Next Generation Sequencing Data Open Access

Liao, Peizhou (2017)

Permanent URL: https://etd.library.emory.edu/concern/etds/8p58pd847?locale=en
Published

Abstract

The recent advancement of next-generation sequencing (NGS) technologies and the rapid reduction of sequencing costs have led to extensive use of sequencing data in disease association studies and population genetic studies. New challenges arise from NGS data for statistical analysis, including genotype calling, inference of population structure, and design of sequencing studies, etc. In this dissertation, we propose some novel statistical methods for analyzing NGS data that can properly handle these issues.

A fundamental challenge in analyzing NGS data is to determine an individual's genotype correctly, as the accuracy of the inferred genotype is essential to downstream analyses. To improve the accuracy of called genotypes, in the first project, we propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The approach, which we call PhredEM, uses the expectation-maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. It also includes a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be nonmonomorphic require application of the EM algorithm. PhredEM can be used together with a linkage-disequilibrium-based method such as Beagle, which can further improve genotype calling as a refinement step. We demonstrate the advantages of PhredEM over existing methods using both simulated data and real sequencing data from the UK10K project and the 1000 Genomes project. Inferring population structure is important for both population genetics and genetic epidemiology. Principal components analysis (PCA) has been effective in ascertaining population structure with array genotype data but can yield biased conclusions when used with NGS data having sequencing properties that are systematically different across different groups of samples. To allow robust inference on population structure using PCA, in the second project, we provide an approach that is based on using sequencing reads directly without calling genotypes. Our approach is to adjust the data from different sequencing groups to have the same read depth and error rate so that PCA does not generate spurious components representing sequencing quality. To accomplish this, we have developed a subsampling procedure to match the depth distributions in different sequencing groups, and a read-flipping procedure to match the error rates. We average over subsamples and read flips to minimize loss of information. We demonstrate the utility of our approach using two datasets from 1000 Genomes, and further evaluate it using simulation studies. We have recently developed TASER, an association test of rare variants with NGS data that allows systematic differences in sequencing qualities (e.g., depth and sequencing error rate) between cases and controls. However, it is unknown what is the optimal design of a case-control study that has a trade-off between number of samples and coverage of depth. In the third project, we conducted simulation studies to evaluate how the sequencing effort should be best allocated between sample size and depth based on factors including ancestry, sequencing error rate, and disease risk model. We found that the best power was generally achieved by sequencing as many samples as possible (while decreasing depth if necessary). We noted, however, when the sequencing platform had a very high error rate (e.g., 0.5%) and rarer variants incurred higher risks, the best power was then achieved with a medium (e.g., 10x) depth.

Table of Contents

1 Introduction

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Research Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Genotype calling . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Inference of population structure . . . . . . . . . . . . . . . . 5

1.2.3 Sequencing design for rare variant association studies . . . . . 7

1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 Methods for calling genotypes . . . . . . . . . . . . . . . . . . 9

1.3.2 Methods for inferring population structure . . . . . . . . . . . 10

1.3.3 Design of NGS studies for testing rare variant associations . . 12

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 PhredEM: A Phred-Score-Informed Genotype-Calling Approach for

Next-Generation Sequencing Studies

2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1 PhredEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.2 Screening algorithm . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.3 PhredEM with LD refinement . . . . . . . . . . . . . . . . . . 20

2.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Application to the UK10K SCOOP Data . . . . . . . . . . . . . . . . 27

2.4 Application to the 1000 Genomes CEU Data . . . . . . . . . . . . . . 31

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6.1 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6.2 Proof of concavity of pl*() . . . . . . . . . . . . . . . . . . . 37

2.7 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Robust Inference of Population Structure from Next-Generation Se-

quencing Data with Systematic Differences in Sequencing

3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.1.1 Estimating the per-base error rate . . . . . . . . . . . . . . . . 45

3.1.2 Pruning SNPs and picking ancestry informative markers . . . 46

3.1.3 Handling systematic differences in sequencing . . . . . . . . . 47

3.1.4 Application to stratified and admixed populations from 1000

Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.5 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.1 Inference on a stratified population from 1000 Genomes . . . . 52

3.2.2 Inference on an admixed population from 1000 Genomes . . . 54

3.2.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.1 Matching the read distributions in different sequencing groups

when sample sizes of the sequencing groups differ . . . . . . . 63

3.4.2 Choosing the read-flipping probability . . . . . . . . . . . . . 63

3.4.3 Sampling MAFs for three populations . . . . . . . . . . . . . . 64

3.4.4 Simulating read count data . . . . . . . . . . . . . . . . . . . 64

3.4.5 A simulation study assuming three groups with differential sequencing

qualities . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.5 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 Optimal Design of Next-Generation Sequencing Studies for Testing

Rare Variant Associations

4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1.1 Omnibus TASER . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1.2 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1.2.1 Generating European and African haplotypes . . . . 72

4.1.2.2 Generating individual genotypes and phenotypes . . 73

4.1.2.3 Generating sequencing read count data . . . . . . . . 74

4.1.2.4 Sequencing designs . . . . . . . . . . . . . . . . . . . 74

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.1 Simulating read count data . . . . . . . . . . . . . . . . . . . 80

4.5 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5 Summary and Future Work

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Bibliography 88

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files