Development of Statistical Methods for Multiple-Hypotheses Testing Público

Li, Yunxiao (Fall 2019)

Permanent URL: https://etd.library.emory.edu/concern/etds/m900nv540?locale=es

Published

Abstract

In this dissertation, I develop three novel statistical methods for solving multiple- hypotheses testing problems.

In the first topic, we propose a bottom-up approach to testing hypotheses that have a tree-structured dependency structure. Our motivating example comes from testing the associations between a trait of interest and groups of microbes that have been organized into operational taxonomic units (OTUs). Given p-values from association tests for each individual OTU, we would like to know if we can declare that a certain species, genus, or higher taxonomic group can be considered to be associated with the trait. For this prob- lem, a bottom-up testing algorithm that starts at the lowest level of the tree (OTUs) and proceeds upward through successively higher taxonomic groups (species, genus, family etc.) is required. We develop such a bottom-up testing algorithm that controls the error rate of decisions made at higher levels in the tree, conditioning on findings at lower levels in the tree. We further show that our algorithm controls the false discovery rate based on the global null hypothesis that no taxa are associated with the trait. By simulation, we also show that our approach is better at finding driver taxa, the highest level taxa below which there are dense association signals. We illustrate our approach using data from a study of the microbiome among patients with ulcerative colitis and healthy controls.

In the second topic, we consider the resampling-based multiple testing problems. In multiple testing literature, the standard procedures for correcting multiplicity, such as the Benjamini and Hochberg (1995) procedure, usually require knowledge of the ideal p-values. In many biological applications such as microbiome studies, the ideal p-values cannot be computed analytically but only approximated by resampling methods. The resampling-based p-values coupled with a multiplicity-correction procedure generally produce difficult lists of rejections when the resampling algorithm is initiated by different random seeds, hence lacking of reproducibility. The existing method of Gandy and Hahn (2014, 2016) that aimed to control the Monte Carlo (MC) error (i.e., disagreement with the decisions based on ideal p-values) rate is extremely conservative and tends to make zero rejection. We focus on the type-I MC error which occurs when we reject a hypothesis that should be accepted according to the ideal decisions (in other words, this rejection is not reproducible). We develop a two-step algorithm based on resampling replicates to make decisions while controlling the type-I MC error rate. Through extensive simulation studies, we demonstrate substantial power improvement compared to the existing method.

In the third topic, we propose a class of algorithms for sequential resampling-based multiple testing. Resampling-based tests are known to be computationally intensive. Se- quential algorithms provide efficient and accurate estimation to ideal p-values in resampling- based tests, by allowing early stopping of generating resampling samples as long as ev- idence suggests that a hypothesis should be classified to rejection or acceptance region. However, most existing sequential methods in this fi (e.g., Sandve et al. (2011)) cannot guarantee reproducibility of test decisions. The only sequential methods that addressed this issue were developed by Gandy and Hahn (2014, 2016). We develop novel sequential testing algorithms by incorporating a step-wise decision process and improved sequential confidence intervals. Performances of our proposed methods are assessed through both synthetic and real data.

1 Background and Literature Review 1

1.1 Testing Tree-Structured Hypotheses in Microbiome Data . . . . . . . . . 4

1.2 Resampling-Based Multiple Testing and Monte Carlo Error . . . . . . . 7

1.3 Sequential Resampling-Based Multiple Testing . . . . . . . . . . . . . . . . 11

2 A Bottom-up Approach to Testing Hypotheses That Have a Branching Tree

Dependence Structure, with False Assignment Rate Control 16

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 Bottom-up Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.3 Testing to Control FAR: Unweighted Proposal . . . . . . . . . 23

2.2.4 Testing to Control FAR: Weighted Procedure . . . . . . . . . . 25

2.2.5 Bottom-up Testing on Incomplete Trees . . . . . . . . . . . . . . . 27

2.2.6 Bottom-up Testing with Separate FAR Control . . . . . . . . . 28

2.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.1 Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.2 Accuracy and Pinpointing Driver Nodes . . . . . . . . . . . . . . . 34

2.4 IBD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Controlling Type-I Monte Carlo Error Rate in Resampling-Based Multiple Testing 43

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.1 The Empirical Strength Probability (ESP) Approach . . . . . . 45

3.2.2 The Two-Step Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4 Application to Prostate Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 Sequential Resampling-Based Multiple Testing Procedure That Controls Monte Carlo Error Rate 66

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.1 Duality Between Sequential Testing and Confidence Set . . 67

4.2.2 Sequential Confidence Intervals Based on Group Sequential Approaches. 69

4.2.3 Step-Wise Procedure to Control Family-Wise MCER . . . . . 74

4.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4 Application to Prostate Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5 Summary and Future Directions 86

A Appendix for Chapter 2 89

B Appendix for Chapter 3 103

C Appendix for Chapter 4 110

Bibliography 115

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Biostatistics
Subfield / Discipline	Biostatistics - MPH & MSPH
Degree	Ph.D.
Submission	Dissertation
Language	English
Research Field	Statistics Biology, Biostatistics
Palabra Clave	multiple testing resampling false discovery rate
Committee Chair / Thesis Advisor	Yijuan Hu, Emory University Tianwei Yu, Emory University Glen A. Satten, Centers for Disease Control and Prevention
Committee Members	Zhaohui Qin, Emory University

Última modificación

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Development of Statistical Methods for Multiple-Hypotheses Testing ()	2019-11-13 23:17:13 -0500	Download

Development of Statistical Methods for Multiple-Hypotheses Testing Público

Li, Yunxiao (Fall 2019)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files