Macro-scale genomic studies of bacterial pathogens Open Access

Petit, Robert (Summer 2018)

Permanent URL:


The low cost of genome sequencing has led to a significant increase in publicly available datasets of bacterial pathogens. Taking advantage of this data requires new strategies for using computational resources and bioinformatics, as well as applying traditional organism-specific knowledge. With this understanding, I used public datasets to investigate two important bacterial pathogens Bacillus anthracis and Staphylococcus aureus. 

In my first research project, I focused on Bacillus anthracis, the etiologic agent of anthrax, which shares over 99% average nucleotide identity with Bacillus cereus Group (BCerG) bacteria. This closeness, coupled with sequencing error rates, can cause B. cereus to be falsely identified as B. anthracis. To address this issue, I developed a typing schema for fine-scale differentiation of these two species. I identified a set of 31-mers specific to B. anthracis and another set specific to all BCerG including B. anthracis. I determined the limits of detection of these k-mers on synthetic data and developed a model to predict the presence of true B. anthracis sequences. I then reanalyzed a New York subway metagenome dataset, which falsely identified evidence for B. anthracis. I found no evidence for anthrax but instead the presence of unsampled close relatives to B. anthracis

My second project concerned Staphylococcus aureus, a major antibiotic-resistant pathogen responsible for a wide spectrum of hospital and community-associated infections. S. aureus was well represented in genome sequencing studies submitted to public repositories but there were no tools available to make use of this useful data. To fill this void, I developed Staphopia, an analysis pipeline, database and application programming interface focused on S. aureus and processed over 44,000 publicly available S. aureus genomes. I found patterns in antibiotic resistance between S. aureus sequence types and a bias towards sequencing clinically relevant methicillin-resistant S. aureus strains. 

I conclude, with a discussion about future macro-scale comparative genomic studies consisting of tens of thousands of genomes. I also provide comments on the expected rewards and challenges associated with macro-scale studies. Overall, this body of work illustrates the importance of public datasets for bacterial pathogens and integrating organism specific knowledge into bacterial sequence analyses.

Table of Contents

Chapter 1: Introduction ____________________________________ 1 

Bacterial sequence analysis step by step________________________ 1 

Sequence Quality Control ______________________________________________2 

Genome assembly ___________________________________________________3 

Genome annotation __________________________________________________5 

Genotyping bacteria based on genome sequence _______________________________5 

Identifying Variation__________________________________________________6 

Antimicrobial Resistance and Virulence Factors________________________________7 

Comparative genomic analyses ______________________________ 7 

Phylogenetics _____________________________________________________ 8 

Pan-genome _______________________________________________________9 

Genome wide association studies _________________________________________9 

A deluge of bacterial sequences_____________________________ 10 

A brief history of DNA sequencing technologies_______________________________ 10 

Affordable high-throughput sequencing ____________________________________ 11 

New opportunities in existing data _______________________________________ 12 

Outline for this dissertation _______________________________ 13 

Appendix_____________________________________________ 15 

Chapter 2: Fine-scale differentiation between Bacillus anthracis and Bacillus cereus group signatures in metagenome shotgun data ______ 27 

Abstract _____________________________________________ 28 

Introduction __________________________________________ 29 

Methods _____________________________________________ 32 

Metagenome data and reference genome sequences ____________________________32 

Mapping metagenome data to B. anthracis plasmids and chromosomes_______________33 

Custom 31-mer assay for B. anthracis and Bacillus cereus Group ___________________33 

Finding the limits for lethal factor-based detection of B. anthracis __________________35 

Assessing Quality of B. anthracis and B. cereus Group specific 31-mers _______________35 

Prediction of low coverage B. anthracis chromosome in shotgun sequencing datasets _____36 

Results ______________________________________________ 37 

NY subway metagenome sequences map to core regions of B. anthracis and B. cereus chromosome and plasmids but not to lethal factor gene_____37 

B. anthracis genome coverage below 0.18x is a “gray area” for detection, where lethal toxin genes may not be sampled _38 

Conserved and specific 31-mer sets for B. anthracis and BCerG chromosomes __________39 

High background levels of B. cereus strains produce false positive B. anthracis specific k-mers due to random sequence errors_______40 

A “specialist” model to interpret patterns of B. anthracis genetic signatures in metagenome samples ________ 41 

Discussion____________________________________________ 42 

Conclusions___________________________________________ 46 

Acknowledgements _____________________________________ 46 

Funding _____________________________________________ 46 

Appendix_____________________________________________ 48 

Chapter 3: Staphylococcus aureus viewed from the perspective of 40,000+ genomes ______________ 63 

Abstract _____________________________________________ 64 

Introduction __________________________________________ 65 

Materials & Methods ____________________________________ 66 

Staphopia Analysis Pipeline ____________________________________________66 

Web Application, Relational Database and Application Programming Interface__________70 

Processing Public Data _______________________________________________70 

Metadata Collection _________________________________________________ 71 

Creating non-redundant S. aureus diversity set _______________________________ 73 

Results ______________________________________________ 74 

Design of the Staphopia Analysis Pipeline and processing 43,000+ genomes ___________74 

Sequence and assembly quality trends _____________________________________ 75 

Genetic diversity measured by MLST ______________________________________ 77 

Antibiotic resistance genes_____________________________________________ 77 

Publication, metadata and strain geographic distribution ________________________79 

A non-redundant S. aureus diversity set____________________________________80 

Discussion____________________________________________ 81 

Conclusions___________________________________________ 85 

Links________________________________________________ 86 

Appendix_____________________________________________ 87 

Chapter 4: The influence of horizontal gene transfer barriers on Staphylococcus aureus and the potential of gene transfer networks to identify novel barriers._________ 98 

Abstract _____________________________________________ 98 

Introduction __________________________________________ 99 

MRSA - a case where human action can break down barriers to HGT _ 102 

VRSA - a case where barriers to HGT can have great public health consequences ______________________105 

Using high-throughput DNA sequencing to build gene transfer networks _____________________________ 108 

Using gene transfer networks to predict and monitor future spread of antibiotic resistance___________113 

Conclusions__________________________________________ 115 

Appendix____________________________________________ 116 

Chapter 5: Summary and Future Directions____________________ 123 

Summary ___________________________________________ 123 

Future Directions: Macro-scale bacterial genomics______________ 126 

Rewards of macro-scale genomics__________________________ 127 

Statistical power __________________________________________________ 127 

A better overview of a species __________________________________________ 127 

Rational sampling _________________________________________________ 128 

Challenges of macro-scale genomics ________________________ 129 

Imperfect data ____________________________________________________ 129 

Evolving sequencing technologies _______________________________________ 130 

Data management and distribution ______________________________________ 130 

Scalability_______________________________________________________ 131 

Emerging macro-scale genomic projects _____________________ 132 

Final remarks ________________________________________ 133 

Appendix: Other Published Work ___________________________ 135 

Bibliography __________________________________________ 138 

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files