Inferring Strains in Mixed Samples using Computational Methods Restricted; Files Only

Vestal, Gary (Fall 2025)

Permanent URL: https://etd.library.emory.edu/concern/etds/g445cf76s?locale=en
Published

Abstract

Strain disambiguation—the inference of individual strains in multi-strain genomic samples—offers a promising but underutilized methodology for analyzing complex infection data. This dissertation investigates whether strain disambiguation can be adapted into a template-free, strain-number scalable framework suitable for common challenges in infectious disease epidemiology. Building on the recently developed StrainRecon algorithm, originally designed for 24-SNP malaria genotyping, we develop the Strain Inference Methods (SIM): a generalizable framework for analyzing mixed-strain samples without relying on prior strain panels.

We adapt the SIM framework to three epidemiologically relevant use cases in public health:

  Population-level analysis, where we apply SIM to malaria survey data from western Kenya and introduce novel metrics that capture changes in transmission type prevalences and intensity

  Sample-to-sample comparison, where we extend SIM to support molecular correction in drug efficacy trials by developing two new methods (StrainMatch-T and StrainMatch-33-T) for distinguishing recrudescence from reinfection in multi-strain infections.

  Analysis of bacterial data, where we adapt SIM for use with highly multiplexed amplicon sequencing (HMAS) by evaluating its performance on synthetic Salmonella samples.

Across these settings, we find that strain disambiguation improves analytical resolution, revealing patterns and relationships that are obscured by conventional methods. Our findings suggest that strain disambiguation shows promise for improving the analysis of complex, mixed-strain datasets in a wide range of epidemiological workflows and public health surveillance applications.

Table of Contents

The Need for Epidemiology  1

Limitations with Current Approaches  2

An Ideal Lens  3

Biological Techniques for Characterizing Strains in Infections  6

StrainRecon and STIM  7

3.1 Introduction  11

3.2 Methods  17

3.2.1 Ethics statement  17

3.2.2 Study area and sample collection  18

3.2.3 Laboratory testing  19

3.2.4 Definitions  20

3.2.5 Processing 24-SNP data  21

3.2.6 Data analysis  22

3.2.6.1 StrainRecon for constituent strains and STIM for MOI estimation  22

3.2.6.2 Analysis of MOI  23

3.2.6.3 Relatedness: FST and IBD  24

3.2.6.4 Other population genetics metrics  25

3.3 Results  25

3.3.1 Temporal trends of MOI  25

3.3.2 Relatedness: FST and IBD  31

3.3.2.1 Distribution of different number of SNPs in strains within each year  31

3.3.2.2 FST Population Relatedness  32

3.3.2.3 IBD strain relatedness within years and between years  33

3.3.2.4 IBD Strain Relatedness Within Subjects  35

3.3.3 Hs and Ne  37

3.3.3.1 Expected Heterozygosity (Hs)  37

3.3.3.2 Effective strain population size (Ne)  38

3.3.4 Inferring Superinfection and Co-transmission from within-host IBD and MOI  40

3.4 Discussion  41

3.5 Next Steps in Developing SIM  49

4.1 Background  51

4.2 Materials and Methods  54

4.2.1 Terminology  54

4.2.3 Previous State of the Art Methods: Marker-Matching  54

4.2.3.1 SNP-Matching with 24-SNP Data  55

4.2.3.2 Allele-Matching Methods: The 2022 WHO Algorithm  55

4.2.3.3 Limitations of Marker-matching  55

4.2.4 StrainMatch Methods  56

4.2.4.1 The StrainMatch-T Algorithm  56

4.2.4.2 The StrainMatch-33-T Method  58

4.2.4.3 Unused Method: StrainMatch-B  59

4.2.5 Experiments  60

4.2.5.1 Calibration of StrainMatch-T  60

4.2.5.2 Method Evaluation  60

4.2.5.3 Field Experiment  61

4.3 Results  63

4.3.1 StrainMatch-T Calibration  63

4.3.2 In-Silico Method Comparison  65

4.3.3 Field Dataset Analysis Results  70

4.4 Discussion  76

4.4.1 Assessing Molecular Correction Methods  76

4.4.2 Assessing Field Data  79

4.4.3 Possible Application: An Extended Molecular Correction Pipeline  80

4.4.4 Limitations and Future Work  81

4.5 Conclusion  82

5.1 Introduction  84

5.2 Materials and Methods  87

5.2.1 Datasets  88

5.2.1.1 Lab Datasets  88

5.2.1.1.1 Sample Preparation and HMAS Workflow  89

5.2.1.1.2 Step-Mothur  91

5.2.2 In-Silico Dataset  92

5.2.2.1 Dataset Design  92

5.2.2.2 Making a generative model based on lab data  94

5.2.3 Evaluating SIM  96

5.2.3.1 Strain Reconstruction (Goal #1, Goal #2)  97

5.2.3.2 Strain Detection (Goal #3)  98

5.2.3.2.1 StrainMatch-B  98

5.2.3.2.2 Baseline for Strain Detection: Allele-Match Thresholding  100

5.2.3.2.3 Evaluation  100

5.3 Results  101

5.3.1 Lab data results  101

5.3.2 Making an in-silico noise model  101

5.3.2.1 Relative Abundances of Error Variants  102

5.3.2.2 Read Depth  105

5.3.2.3 Number of Error Variants  107

5.3.2.4 Number of Mutations (Base Pair Changes)  110

5.3.2.5 Summary  113

5.3.3 Analyzing Simulated HMAS Samples with SIM  114

5.3.3.1 Strain Reconstruction Accuracy  114

5.3.3.2 Strain Detection  118

5.4 Discussion  121

5.4.1 Contributions  121

5.4.2 Implications for Public Health Applications  122

5.4.3 Limitations and Future Work  123

5.4.4 Strain Disambiguation Insights  124

6.1 Data availability  128

Appendix 3A StrainRecon and STIM  135

Appendix 3B Statistical analysis of temporal trends of MOI  135

Appendix 3C Relatedness Metrics  137

3C.1 FST  137

3C.2 IBD  137

3C.2.1 IBD Relatedness within and across years  138

3C.2.2 Within-host IBD strain relatedness differentiation  141

Appendix 3D Verification of relationship between within-host IBD and MOI  142

Appendix 3E Effective Population Size  144

Appendix 4A Terminology: Similarity  146

Appendix 4B Molecular Correction Methods  147

4B.1 3/3 WHO Allele Matching Method  147

4B.1.1 Simulated False Positive Rate as a function of MOI and Allele Frequency  147

4B1.1.2 Simulating Samples from Uganda  148

4B.2 StrainMatch-T  150

4B.2.1 Algorithm  150

4B.2.2 Indeterminacy  151

4B.2.3 Sensitivity Analysis  154

4B.3 Marker-matching using 24-SNP Data  157

4B.3.1 Algorithm  157

4B.3.2 Calibration and Sensitivity Testing  157

4B.4: StrainMatch-B  160

4B.1 Algorithm  160

4B.2 Initial Development with In-Silico 24-SNP Data  161

4B.2.1 Feature Selection  162

4B.2.2 Parameter Tuning  168

4B.2.3 Summary  171

Appendix 4C Field Study Results  172

4C.1.1 Estimated Bounds on True Number of Recrudescences and Reinfections  172

4C.1.2 MOI Analysis  172

Appendix 5A: Validating Assumptions  173

5A.1 Data Fidelity Assumptions  173

5A.2 Allele Frequency Consistency Assumption  177

Appendix 5B: Predicting and Measuring the Error Rate Using Previous Models  181

Appendix 5C: Conover-Iman Statistical Tests for the Effect of Strain Composition on Reconstruction Accuracy  183

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified Preview image embargoed

Primary PDF

Supplemental Files