Applying Diploid Method to Improve Read-mapping and Analysis Based on NGS Data Open Access

Yuan, Shuai (2014)

Permanent URL: https://etd.library.emory.edu/concern/etds/pk02c984x?locale=en%255D
Published

Abstract

Next generation sequencing (NGS) technologies have been applied extensively in genetics and genomics research. A fundamental problem when it comes to analyzing NGS data is accurately mapping short sequencing reads back to the reference genome. This important issue affects the interpretation and downstream analysis of the NGS experiments. Although plenty of read mapping algorithms and software have been developed, the majority of them uses the universal reference genome as a scaffold and do not automatically take into consideration the possibility of genetic variants. Ignoring the genetic variants information will cause a proportion of unmapped or incorrectly mapped reads, which affects the calculation, interpretation and analysis in many studies. Issues caused include the significant bias when detecting Allele-Specific Expression (ASE) from RNA sequencing data, low genotype calling accuracy, low Single Nucleotide Polymorphisms (SNPs) discovery rate and so on. Given that genetic variants are ubiquitous, it would be highly desirable if they can be factored into the read mapping procedure.

In our study, we developed a method that produces a personalized diploid reference genome based on all known genetic variants of that particular individual. We show that using such a personalized diploid reference genome with existing mapping software can improve mapping accuracy and significantly reduce the bias toward reference allele in ASE analysis.

By combining the imputation technology with reference genome personalization method, our studies, using real data, indicate further improvement in read mapping rate as well as genotype calling and SNPs discovery. Because many whole genome sequencing (WGS) studies are conducted on cohorts that have been previously genotyped using array-based genotyping platforms, we believe the strategy introduced here will be of high practical value to investigators working on WGS.

Our open source software is implemented as a standalone C++ code and has been integrated into Galaxy, a data intensive biomedical research platform, for pipeline visualization and better usability.

Table of Contents

1 Introduction.....1

1.1 Next Generation Sequencing.......................1

1.2 Read mapping tools............................4

1.3 Motivation.................................6

2 Read-mapping using personalized diploid reference genome for RNA sequencing data reduces ASE bias.....8

2.1 Introduction................................9

2.2 Methods..................................13

2.2.1 Constructing personalized, diploid reference genome...............13

2.2.2 Reads mapping ..........................16

2.2.3 An alternative method for reducing ASE bias....................18

2.2.4 Simulation studies ........................18

2.2.5 Real data studies .........................19

2.3 Results...................................20

2.3.1 ASE bias in simulation studies..................20

2.3.2 Mapping accuracy ........................21

2.3.3 Real data analysis on ASE bias and mapped reads...........25

2.3.4 More real data results ......................28

2.4 Discussion.................................32

3 Using personalized diploid reference genome to improve read mapping and genotype calling in DNA sequencing studies........34

3.1 Introduction................................35

3.2 Material and Methods ..........................39

3.2.1 RefEdit+ Pipeline ........................39

3.2.2 Competing read mapping strategies...........46

3.3 Results...................................48

3.3.1 An example ............................ 48

3.3.2 Performance comparison study design...........48

3.3.3 Study samples...........................50

3.3.4 Genotypes from genotyping arrays................50

3.3.5 Genotype summary from genotyping array and imputation...........51

3.3.6 Read mapping rate ........................63

3.3.7 Genotype calling consistency..................64

3.3.8 Mendelian inconsistency .....................68

3.3.9 SNP identification ........................72

3.4 Discussion.................................73

3.5 Web Resources ..............................77

4 RefEditor-Galaxy, a Galaxy tool for enhancing read mapping as part of bioinformatics workflows......78

4.1 Introduction................................79

4.1.1 RefEditor .............................79

4.1.2 Galaxy............................... 80

4.2 RefEditor-Galaxy.............................82

4.2.1 toolconf.xml ...........................85

4.2.2 vcf2genotypes.*..........................85

4.2.3 DiploidConstructor.*.......................86

4.2.4 MappingConverter.*....................... 86

4.2.5 test-data..............................88

4.3 Installation ................................88

4.4 Use Case..................................90

4.5 Discussion.................................91

5 Conclusion.....93

5.1 FutureWork................................93

5.1.1 Dynamic reference genome....................93

5.1.2 More formats support ......................94

5.1.3 More studies............................94

5.2 Summary .................................95

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files