Inferring Strains in Mixed Samples using Computational Methods Restricted; Files Only
Vestal, Gary (Fall 2025)
Abstract
Strain disambiguation—the inference of individual strains in multi-strain genomic samples—offers a promising but underutilized methodology for analyzing complex infection data. This dissertation investigates whether strain disambiguation can be adapted into a template-free, strain-number scalable framework suitable for common challenges in infectious disease epidemiology. Building on the recently developed StrainRecon algorithm, originally designed for 24-SNP malaria genotyping, we develop the Strain Inference Methods (SIM): a generalizable framework for analyzing mixed-strain samples without relying on prior strain panels.
We adapt the SIM framework to three epidemiologically relevant use cases in public health:
Population-level analysis, where we apply SIM to malaria survey data from western Kenya and introduce novel metrics that capture changes in transmission type prevalences and intensity
Sample-to-sample comparison, where we extend SIM to support molecular correction in drug efficacy trials by developing two new methods (StrainMatch-T and StrainMatch-33-T) for distinguishing recrudescence from reinfection in multi-strain infections.
Analysis of bacterial data, where we adapt SIM for use with highly multiplexed amplicon sequencing (HMAS) by evaluating its performance on synthetic Salmonella samples.
Across these settings, we find that strain disambiguation improves analytical resolution, revealing patterns and relationships that are obscured by conventional methods. Our findings suggest that strain disambiguation shows promise for improving the analysis of complex, mixed-strain datasets in a wide range of epidemiological workflows and public health surveillance applications.
Table of Contents
The Need for Epidemiology 1
Limitations with Current Approaches 2
An Ideal Lens 3
Biological Techniques for Characterizing Strains in Infections 6
StrainRecon and STIM 7
3.1 Introduction 11
3.2 Methods 17
3.2.1 Ethics statement 17
3.2.2 Study area and sample collection 18
3.2.3 Laboratory testing 19
3.2.4 Definitions 20
3.2.5 Processing 24-SNP data 21
3.2.6 Data analysis 22
3.2.6.1 StrainRecon for constituent strains and STIM for MOI estimation 22
3.2.6.2 Analysis of MOI 23
3.2.6.3 Relatedness: FST and IBD 24
3.2.6.4 Other population genetics metrics 25
3.3 Results 25
3.3.1 Temporal trends of MOI 25
3.3.2 Relatedness: FST and IBD 31
3.3.2.1 Distribution of different number of SNPs in strains within each year 31
3.3.2.2 FST Population Relatedness 32
3.3.2.3 IBD strain relatedness within years and between years 33
3.3.2.4 IBD Strain Relatedness Within Subjects 35
3.3.3 Hs and Ne 37
3.3.3.1 Expected Heterozygosity (Hs) 37
3.3.3.2 Effective strain population size (Ne) 38
3.3.4 Inferring Superinfection and Co-transmission from within-host IBD and MOI 40
3.4 Discussion 41
3.5 Next Steps in Developing SIM 49
4.1 Background 51
4.2 Materials and Methods 54
4.2.1 Terminology 54
4.2.3 Previous State of the Art Methods: Marker-Matching 54
4.2.3.1 SNP-Matching with 24-SNP Data 55
4.2.3.2 Allele-Matching Methods: The 2022 WHO Algorithm 55
4.2.3.3 Limitations of Marker-matching 55
4.2.4 StrainMatch Methods 56
4.2.4.1 The StrainMatch-T Algorithm 56
4.2.4.2 The StrainMatch-33-T Method 58
4.2.4.3 Unused Method: StrainMatch-B 59
4.2.5 Experiments 60
4.2.5.1 Calibration of StrainMatch-T 60
4.2.5.2 Method Evaluation 60
4.2.5.3 Field Experiment 61
4.3 Results 63
4.3.1 StrainMatch-T Calibration 63
4.3.2 In-Silico Method Comparison 65
4.3.3 Field Dataset Analysis Results 70
4.4 Discussion 76
4.4.1 Assessing Molecular Correction Methods 76
4.4.2 Assessing Field Data 79
4.4.3 Possible Application: An Extended Molecular Correction Pipeline 80
4.4.4 Limitations and Future Work 81
4.5 Conclusion 82
5.1 Introduction 84
5.2 Materials and Methods 87
5.2.1 Datasets 88
5.2.1.1 Lab Datasets 88
5.2.1.1.1 Sample Preparation and HMAS Workflow 89
5.2.1.1.2 Step-Mothur 91
5.2.2 In-Silico Dataset 92
5.2.2.1 Dataset Design 92
5.2.2.2 Making a generative model based on lab data 94
5.2.3 Evaluating SIM 96
5.2.3.1 Strain Reconstruction (Goal #1, Goal #2) 97
5.2.3.2 Strain Detection (Goal #3) 98
5.2.3.2.1 StrainMatch-B 98
5.2.3.2.2 Baseline for Strain Detection: Allele-Match Thresholding 100
5.2.3.2.3 Evaluation 100
5.3 Results 101
5.3.1 Lab data results 101
5.3.2 Making an in-silico noise model 101
5.3.2.1 Relative Abundances of Error Variants 102
5.3.2.2 Read Depth 105
5.3.2.3 Number of Error Variants 107
5.3.2.4 Number of Mutations (Base Pair Changes) 110
5.3.2.5 Summary 113
5.3.3 Analyzing Simulated HMAS Samples with SIM 114
5.3.3.1 Strain Reconstruction Accuracy 114
5.3.3.2 Strain Detection 118
5.4 Discussion 121
5.4.1 Contributions 121
5.4.2 Implications for Public Health Applications 122
5.4.3 Limitations and Future Work 123
5.4.4 Strain Disambiguation Insights 124
6.1 Data availability 128
Appendix 3A StrainRecon and STIM 135
Appendix 3B Statistical analysis of temporal trends of MOI 135
Appendix 3C Relatedness Metrics 137
3C.1 FST 137
3C.2 IBD 137
3C.2.1 IBD Relatedness within and across years 138
3C.2.2 Within-host IBD strain relatedness differentiation 141
Appendix 3D Verification of relationship between within-host IBD and MOI 142
Appendix 3E Effective Population Size 144
Appendix 4A Terminology: Similarity 146
Appendix 4B Molecular Correction Methods 147
4B.1 3/3 WHO Allele Matching Method 147
4B.1.1 Simulated False Positive Rate as a function of MOI and Allele Frequency 147
4B1.1.2 Simulating Samples from Uganda 148
4B.2 StrainMatch-T 150
4B.2.1 Algorithm 150
4B.2.2 Indeterminacy 151
4B.2.3 Sensitivity Analysis 154
4B.3 Marker-matching using 24-SNP Data 157
4B.3.1 Algorithm 157
4B.3.2 Calibration and Sensitivity Testing 157
4B.4: StrainMatch-B 160
4B.1 Algorithm 160
4B.2 Initial Development with In-Silico 24-SNP Data 161
4B.2.1 Feature Selection 162
4B.2.2 Parameter Tuning 168
4B.2.3 Summary 171
Appendix 4C Field Study Results 172
4C.1.1 Estimated Bounds on True Number of Recrudescences and Reinfections 172
4C.1.2 MOI Analysis 172
Appendix 5A: Validating Assumptions 173
5A.1 Data Fidelity Assumptions 173
5A.2 Allele Frequency Consistency Assumption 177
Appendix 5B: Predicting and Measuring the Error Rate Using Previous Models 181
Appendix 5C: Conover-Iman Statistical Tests for the Effect of Strain Composition on Reconstruction Accuracy 183
About this Dissertation
| School | |
|---|---|
| Department | |
| Degree | |
| Submission | |
| Language |
|
| Research Field | |
| Keyword | |
| Committee Chair / Thesis Advisor | |
| Committee Members |
Primary PDF
| Thumbnail | Title | Date Uploaded | Actions |
|---|---|---|---|
|
File download under embargo until 12 July 2026 | 2025-12-11 10:13:44 -0500 | File download under embargo until 12 July 2026 |
Supplemental Files
| Thumbnail | Title | Date Uploaded | Actions |
|---|