Statistical Methods for Analyzing Compositional Human Microbiome Data Open Access

Hu, Yingtian (Summer 2022)

Permanent URL:


With recent development in high-throughput sequencing technologies, human microbiome data are becoming more readily available, which drives people's interest in studying relationships between human microbiome and host diseases. Despite the fact that research in this field is booming, due to complex features of microbiome data, namely, compositionality, high dimensionality, sparsity, overdispersion, and experimental bias, researchers face many statistical challenges when analyzing the data. In particular, compositionality of microbiome data refers to the fact that the sequencing depth (library size) of each sample is noninformative, and converting the read counts into relative abundances yields compositional data. In this dissertation, we propose novel statistical methods for analyzing compositional human microbiome data. The dissertation is composed of three topics. 

In the first topic, we address the problem of detecting differentially abundant bacterial taxa, i.e., taxa whose abundances are associated with the trait (condition) of interest. Our goal is to detect the taxa that initially respond to the condition change, not the taxa that show changes in relative abundance because of the compositional constraint. In this case, the null hypothesis that is tested at a taxon is that the ratio of the relative abundances at the taxon against some null taxon is unchanged. Existing methods tend to produce excessive false positive findings because they may improperly handle the sparsity of data, incorrectly identify the reference taxon, and fail to account for the experimental bias. To address these issues, we develop a novel method for compositional analysis of differential abundance, based on a robust version of logistic regression that we call LOCOM (LOgistic COMpositional analysis). Our method circumvents the use of pseudocount, does not require the reference taxon to be null, and does not require normalization of the data. Further, it is applicable to a variety of microbiome studies with binary or continuous traits of interest and can account for potentially confounding covariates. We present simulation results to explicitly demonstrate the advantages of our proposed methods in terms of higher sensitivity and well-controlled false discovery rate (FDR) compared with other methods. We apply our method to two real microbiome datasets and compare with existing methods. LOCOM identifies more biologically meaningful differential abundant taxa. 

In the second topic, we evaluate the impact of interactive bias on compositional analysis methods in testing differential abundance of taxa. Microbiome data are subject to experimental bias. However, this important feature has often been ignored in the development of statistical methods for analyzing microbiome data. McLaren, Willis and Callahan (2019) proposed a model (which we call the MWC model) for how such bias affects the measured taxonomic profiles, which assumes no taxon-taxon interactions. Our newly developed method, LOCOM, is robust to the experimental bias that follows the MWC model. However, there is evidence for taxon-taxon interactions, so it is of interest to re-evaluate LOCOM and other compositional analysis methods in the presence of the interactive bias. We propose a model to describe the experimental bias in the measurement of a taxon that allows the contributions from the other taxa. Using this model, we conduct simulation studies to evaluate the impact of such experimental bias on the performance of LOCOM, as well as other compositional analysis methods. Our simulation results indicate that LOCOM is robust to any main bias and a reasonable range of interactive bias. The other methods tend to have inflated FDR even when there is only main bias. LOCOM maintains the highest sensitivity among all methods even when the other methods cannot control the FDR.

In the third topic, we study the association between microbiome composition and survival outcomes. Existing methods for survival outcomes are restricted to testing associations at the community level and do not provide results at the individual taxon level. An ad hoc approach testing taxon-level association using the Cox proportional hazard model may not perform well in the microbiome setting with sparse count data and small sample sizes. Here we develop a unified approach, an extension of the linear decomposition model (LDM) that allows testing both community-level and taxon-level association, to test survival outcomes. We propose to use the Martingale residuals or the deviance residuals obtained from the Cox model as continuous covariates in the LDM. We further construct tests that combine the results of analyzing each set of residuals separately. We also extend PERMANOVA, the most commonly used distance-based method for testing community-level hypotheses, to handle survival outcomes in a similar manner. Simulation results demonstrate that the LDM-based tests preserve the FDR for testing individual taxa and have good sensitivity. The LDM-based community-level tests and PERMANOVA-based tests have comparable or better power than competing methods. An analysis of data on the association of the gut microbiome and the time to acute graft-versus-host disease reveals several dozen associated taxa and improved community-level tests.

Table of Contents


1.1 Overview of human microbiome data

1.2 Association analysis of microbiome data

1.2.1 Two biological models

1.2.2 Community-level tests

1.2.3 Taxon-level association tests

1.2.4 Unified approach to testing associations at both the community and taxon levels

1.3 Emperimental bias in microbiome data

1.4 Outline

2.LOCOM: A logistic regression model for testing differential abundance in compositional microbiome data with false discovery rate control

2.1 Introduction

2.2 Methodology

2.2.1 Motivation

2.2.2 Multivaraite logistic regression model

2.2.3 Testing hypothesis at individual taxa

2.2.4 Testing the global hypothesis

2.3 Simulations

2.3.1 Simulation Studies

2.3.2 Simulation Results

2.4 Data Analysis

2.4.1 URT microbiome data

2.4.2 PPI microbiome data

2.5 Discussions

3.Impact of experimental bias on compositional analysis of microbiome data

3.1 Introduction

3.2 Methodology

3.2.1 MWC model for experimental bias

3.2.2 A general model for experimental bias

3.3 Simulations

3.3.1 Simulation studies

3.3.2 Simulation results

3.4 Discussion

4.Testing microbiome associations with survival times at both the community and individual taxon levels

4.1 Introduction

4.2 Methodology

4.3 Simulations

4.3.1 Simulation designs

4.3.2 Simulation results

4.4 Data analysis

4.4.1 Analysis of the aGVHD data

4.5 Discussion

A Appendix for Chapter 2

B Appendix for Chapter 3

C Appendix for Chapter 4


About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files