Statistical Methods for Analyzing Microbiome Data Open Access

Zhu, Zhengyi (Fall 2022)

Permanent URL:


Data from studies of the microbiome are accumulating at a rapid rate. The relative ease of conducting a census of bacteria by sequencing the 16S rRNA gene has led to many studies that examine the association between microbiome and health states or outcomes. Many microbiome studies have complex design features (e.g. paired, clustered or longitudinal data) or complexities that frequently arise in medical studies (e.g. the presence of confounding covariates). In this dissertation, we propose novel statistical methods for solving three different problems in microbiome studies – testing microbiome association on matched-set data, combining test results on multiple data scales, and estimating variance-covariance matrix for longitudinal data with missing values.

In the first topic, we address the need for statistical methods for analyzing microbiome data comprised of matched sets, to test hypotheses against traits of interest that vary between members of a set. Matched-set data arise frequently in microbiome studies (e.g. pre- and post-treatment samples from a set of individuals, or data from case participants matched to one or more control participants using important confounding variables). Existing methods can not accommodate complex data such as those with unequal sample sizes across sets, confounders varying within sets, and continuous traits of interest. By leveraging PERMANOVA, a commonly used distance-based method for testing hypotheses at the community level, and the linear decomposition model (LDM) that unifies the community-level and OTU-level tests into one framework, we present a new strategy for analyzing matched-set data. We propose to include an indicator variable for each set as covariates, so as to constrain comparisons between samples within a set, and also permute traits within each set, which can account for exchangeable sample correlations. The flexible nature of PERMANOVA and the LDM allows discrete or continuous traits or interactions to be tested, within-set confounders to be adjusted, and unbalanced data to be fully exploited. We design a wide range of simulations to compare our proposed strategy to alternative strategies, including the commonly used one that utilizes restricted permutation only. We also use simulation to explore optimal designs for matched-set studies. We use our method to analyze data from two real studies to illustrate its flexibility for a variety of matched-set microbiome data.

In the second topic, we propose an approach to integrative analysis of different microbiome data scales using the LDM. Previously, LDM was developed for testing hypotheses (both the community level and the individual taxon level) about the microbiome on 3 scales separately - the relative abundance scale, the arcsin-root-transformed relative abundance scale, the presence-absence scale. LDM also offered an omnibus test (LDM-omni) that combined the results of the relative abundance and arcsin-root-transformed relative abundance scale. In some scenarios, we have observed that the presence-absence analysis worked better than the initial omnibus test. This suggests the need to develop a new omnibus test that combines results from all three data scales. In order for the omnibus global test to use the best scale at each taxon, we propose an omnibus test based on various p-value combination methods to combine the taxon-level LDM p-values into a statistic we could add to the global LDM test, thus offering optimal power across scenarios with different association mechanisms. The omnibus test is available for the wide range of data types and analyses that are supported by LDM.

In the third topic, we tackle the problem of estimating the variance-covariance matrix of the longitudinal measurements at each taxon. A major challenge of analyzing longitudinal measurements is induced by incomplete data. Incomplete data is a result of missing measurements, e.g., patients are followed for a period of time but miss some of the visits. In such cases, empirical estimation of variance-covariance matrix may not be positive-definite, which is a key feature of a variance-covariance matrix. Thus, there is a need for statistical methods for longitudinal data with missing values, to estimate positive-definite variance-covariance matrix, that accommodate non-normal data distributions, complex missingness patterns and possible constraints on the data (e.g., centered measurements that sum to 0). We develop an algorithm based on a non-parametric model that iteratively optimizes variance-covariance matrix estimation towards the empirical one while parameterizes it in a way such that our variance-covariance matrix estimation is always positive semi-definite. We use simulations and data from a real longitudinal microbiome study to illustrate that our proposed algorithm is robust in a wide range of scenarios.

Table of Contents

1. Introduction

1.1 Overview of microbiome data 

1.2 Statistical analysis of microbiome data 

1.3 Integrative analysis of microbiome data 

1.4 Matched-set microbiome data 

1.5 Longitudinal microbiome data 

2. Constraining PERMANOVA and LDM to within-set comparisons by projection improves the efficiency of analyses of matched sets of microbiome data 

2.1 Introduction 

2.2 Methods 

2.3 Results 

2.3.1 Simulation studies 

2.3.2 Simulation results 

2.3.3 Analysis of the MsFLASH data 

2.3.4 Analysis of the Alzheimer’s disease data 

2.4 Discussion 

2.5 Conclusions 

3. Integrative analysis of relative abundance data and presence-absence data of the microbiome using the LDM 

3.1 Introduction 

3.2 Methods 

3.2.1 Taxon-level omnibus test 

3.2.2 Community-level omnibus test 

3.3 Results 

3.3.1 Simulation studies 

3.3.2 Testing Association in the URT microbiome dataset 

3.4 Conclusion 

4. Estimating a Variance-Covariance Matrix with Incomplete Data 

4.1 Introduction 

4.2 Methods 

4.2.1 An Estimator for Incomplete Data 

4.2.2 Parameterization 

4.2.3 Convexity 

4.2.4 Different scenarios of rank and constraint Stratum 0 (Full Data Stratum), Constraints, and Choice of Parameters All Other Strata 

4.3 Results 

4.3.1 Simulation and real data analysis 

4.4 Discussion 

A. Appendix for Chapter 2 

B. Appendix for Chapter 3 

B.1 Choosing among p-value combination methods 

B.2 Two models for simulating microbiome-trait associations 

C. Appendix for Chapter 4 

C.1 Derivatives of the Loss Function 

C.1.1 Derivatives of the Stein Loss Function 

C.1.2 Derivatives of L2 Loss Function 


About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files