Macro-scale genomic studies of bacterial pathogens Open Access
Petit, Robert (Summer 2018)
Abstract
The low cost of genome sequencing has led to a significant increase in publicly available datasets of bacterial pathogens. Taking advantage of this data requires new strategies for using computational resources and bioinformatics, as well as applying traditional organism-specific knowledge. With this understanding, I used public datasets to investigate two important bacterial pathogens Bacillus anthracis and Staphylococcus aureus.
In my first research project, I focused on Bacillus anthracis, the etiologic agent of anthrax, which shares over 99% average nucleotide identity with Bacillus cereus Group (BCerG) bacteria. This closeness, coupled with sequencing error rates, can cause B. cereus to be falsely identified as B. anthracis. To address this issue, I developed a typing schema for fine-scale differentiation of these two species. I identified a set of 31-mers specific to B. anthracis and another set specific to all BCerG including B. anthracis. I determined the limits of detection of these k-mers on synthetic data and developed a model to predict the presence of true B. anthracis sequences. I then reanalyzed a New York subway metagenome dataset, which falsely identified evidence for B. anthracis. I found no evidence for anthrax but instead the presence of unsampled close relatives to B. anthracis.
My second project concerned Staphylococcus aureus, a major antibiotic-resistant pathogen responsible for a wide spectrum of hospital and community-associated infections. S. aureus was well represented in genome sequencing studies submitted to public repositories but there were no tools available to make use of this useful data. To fill this void, I developed Staphopia, an analysis pipeline, database and application programming interface focused on S. aureus and processed over 44,000 publicly available S. aureus genomes. I found patterns in antibiotic resistance between S. aureus sequence types and a bias towards sequencing clinically relevant methicillin-resistant S. aureus strains.
I conclude, with a discussion about future macro-scale comparative genomic studies consisting of tens of thousands of genomes. I also provide comments on the expected rewards and challenges associated with macro-scale studies. Overall, this body of work illustrates the importance of public datasets for bacterial pathogens and integrating organism specific knowledge into bacterial sequence analyses.
Table of Contents
Chapter 1: Introduction ____________________________________ 1
Bacterial sequence analysis step by step________________________ 1
Sequence Quality Control ______________________________________________2
Genome assembly ___________________________________________________3
Genome annotation __________________________________________________5
Genotyping bacteria based on genome sequence _______________________________5
Identifying Variation__________________________________________________6
Antimicrobial Resistance and Virulence Factors________________________________7
Comparative genomic analyses ______________________________ 7
Phylogenetics _____________________________________________________ 8
Pan-genome _______________________________________________________9
Genome wide association studies _________________________________________9
A deluge of bacterial sequences_____________________________ 10
A brief history of DNA sequencing technologies_______________________________ 10
Affordable high-throughput sequencing ____________________________________ 11
New opportunities in existing data _______________________________________ 12
Outline for this dissertation _______________________________ 13
Appendix_____________________________________________ 15
Chapter 2: Fine-scale differentiation between Bacillus anthracis and Bacillus cereus group signatures in metagenome shotgun data ______ 27
Abstract _____________________________________________ 28
Introduction __________________________________________ 29
Methods _____________________________________________ 32
Metagenome data and reference genome sequences ____________________________32
Mapping metagenome data to B. anthracis plasmids and chromosomes_______________33
Custom 31-mer assay for B. anthracis and Bacillus cereus Group ___________________33
Finding the limits for lethal factor-based detection of B. anthracis __________________35
Assessing Quality of B. anthracis and B. cereus Group specific 31-mers _______________35
Prediction of low coverage B. anthracis chromosome in shotgun sequencing datasets _____36
Results ______________________________________________ 37
NY subway metagenome sequences map to core regions of B. anthracis and B. cereus chromosome and plasmids but not to lethal factor gene_____37
B. anthracis genome coverage below 0.18x is a “gray area” for detection, where lethal toxin genes may not be sampled _38
Conserved and specific 31-mer sets for B. anthracis and BCerG chromosomes __________39
High background levels of B. cereus strains produce false positive B. anthracis specific k-mers due to random sequence errors_______40
A “specialist” model to interpret patterns of B. anthracis genetic signatures in metagenome samples ________ 41
Discussion____________________________________________ 42
Conclusions___________________________________________ 46
Acknowledgements _____________________________________ 46
Funding _____________________________________________ 46
Appendix_____________________________________________ 48
Chapter 3: Staphylococcus aureus viewed from the perspective of 40,000+ genomes ______________ 63
Abstract _____________________________________________ 64
Introduction __________________________________________ 65
Materials & Methods ____________________________________ 66
Staphopia Analysis Pipeline ____________________________________________66
Web Application, Relational Database and Application Programming Interface__________70
Processing Public Data _______________________________________________70
Metadata Collection _________________________________________________ 71
Creating non-redundant S. aureus diversity set _______________________________ 73
Results ______________________________________________ 74
Design of the Staphopia Analysis Pipeline and processing 43,000+ genomes ___________74
Sequence and assembly quality trends _____________________________________ 75
Genetic diversity measured by MLST ______________________________________ 77
Antibiotic resistance genes_____________________________________________ 77
Publication, metadata and strain geographic distribution ________________________79
A non-redundant S. aureus diversity set____________________________________80
Discussion____________________________________________ 81
Conclusions___________________________________________ 85
Links________________________________________________ 86
Appendix_____________________________________________ 87
Chapter 4: The influence of horizontal gene transfer barriers on Staphylococcus aureus and the potential of gene transfer networks to identify novel barriers._________ 98
Abstract _____________________________________________ 98
Introduction __________________________________________ 99
MRSA - a case where human action can break down barriers to HGT _ 102
VRSA - a case where barriers to HGT can have great public health consequences ______________________105
Using high-throughput DNA sequencing to build gene transfer networks _____________________________ 108
Using gene transfer networks to predict and monitor future spread of antibiotic resistance___________113
Conclusions__________________________________________ 115
Appendix____________________________________________ 116
Chapter 5: Summary and Future Directions____________________ 123
Summary ___________________________________________ 123
Future Directions: Macro-scale bacterial genomics______________ 126
Rewards of macro-scale genomics__________________________ 127
Statistical power __________________________________________________ 127
A better overview of a species __________________________________________ 127
Rational sampling _________________________________________________ 128
Challenges of macro-scale genomics ________________________ 129
Imperfect data ____________________________________________________ 129
Evolving sequencing technologies _______________________________________ 130
Data management and distribution ______________________________________ 130
Scalability_______________________________________________________ 131
Emerging macro-scale genomic projects _____________________ 132
Final remarks ________________________________________ 133
Appendix: Other Published Work ___________________________ 135
Bibliography __________________________________________ 138
About this Dissertation
School | |
---|---|
Department | |
Subfield / Discipline | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Macro-scale genomic studies of bacterial pathogens () | 2018-07-09 17:23:57 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|