Development of Statistical Methods for Multiple-Hypotheses Testing Público
Li, Yunxiao (Fall 2019)
Abstract
In this dissertation, I develop three novel statistical methods for solving multiple- hypotheses testing problems.
In the first topic, we propose a bottom-up approach to testing hypotheses that have a tree-structured dependency structure. Our motivating example comes from testing the associations between a trait of interest and groups of microbes that have been organized into operational taxonomic units (OTUs). Given p-values from association tests for each individual OTU, we would like to know if we can declare that a certain species, genus, or higher taxonomic group can be considered to be associated with the trait. For this prob- lem, a bottom-up testing algorithm that starts at the lowest level of the tree (OTUs) and proceeds upward through successively higher taxonomic groups (species, genus, family etc.) is required. We develop such a bottom-up testing algorithm that controls the error rate of decisions made at higher levels in the tree, conditioning on findings at lower levels in the tree. We further show that our algorithm controls the false discovery rate based on the global null hypothesis that no taxa are associated with the trait. By simulation, we also show that our approach is better at finding driver taxa, the highest level taxa below which there are dense association signals. We illustrate our approach using data from a study of the microbiome among patients with ulcerative colitis and healthy controls.
In the second topic, we consider the resampling-based multiple testing problems. In multiple testing literature, the standard procedures for correcting multiplicity, such as the Benjamini and Hochberg (1995) procedure, usually require knowledge of the ideal p-values. In many biological applications such as microbiome studies, the ideal p-values cannot be computed analytically but only approximated by resampling methods. The resampling-based p-values coupled with a multiplicity-correction procedure generally produce difficult lists of rejections when the resampling algorithm is initiated by different random seeds, hence lacking of reproducibility. The existing method of Gandy and Hahn (2014, 2016) that aimed to control the Monte Carlo (MC) error (i.e., disagreement with the decisions based on ideal p-values) rate is extremely conservative and tends to make zero rejection. We focus on the type-I MC error which occurs when we reject a hypothesis that should be accepted according to the ideal decisions (in other words, this rejection is not reproducible). We develop a two-step algorithm based on resampling replicates to make decisions while controlling the type-I MC error rate. Through extensive simulation studies, we demonstrate substantial power improvement compared to the existing method.
In the third topic, we propose a class of algorithms for sequential resampling-based multiple testing. Resampling-based tests are known to be computationally intensive. Se- quential algorithms provide efficient and accurate estimation to ideal p-values in resampling- based tests, by allowing early stopping of generating resampling samples as long as ev- idence suggests that a hypothesis should be classified to rejection or acceptance region. However, most existing sequential methods in this fi (e.g., Sandve et al. (2011)) cannot guarantee reproducibility of test decisions. The only sequential methods that addressed this issue were developed by Gandy and Hahn (2014, 2016). We develop novel sequential testing algorithms by incorporating a step-wise decision process and improved sequential confidence intervals. Performances of our proposed methods are assessed through both synthetic and real data.
Table of Contents
1 Background and Literature Review 1
1.1 Testing Tree-Structured Hypotheses in Microbiome Data . . . . . . . . . 4
1.2 Resampling-Based Multiple Testing and Monte Carlo Error . . . . . . . 7
1.3 Sequential Resampling-Based Multiple Testing . . . . . . . . . . . . . . . . 11
2 A Bottom-up Approach to Testing Hypotheses That Have a Branching Tree
Dependence Structure, with False Assignment Rate Control 16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Bottom-up Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Testing to Control FAR: Unweighted Proposal . . . . . . . . . 23
2.2.4 Testing to Control FAR: Weighted Procedure . . . . . . . . . . 25
2.2.5 Bottom-up Testing on Incomplete Trees . . . . . . . . . . . . . . . 27
2.2.6 Bottom-up Testing with Separate FAR Control . . . . . . . . . 28
2.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Accuracy and Pinpointing Driver Nodes . . . . . . . . . . . . . . . 34
2.4 IBD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Controlling Type-I Monte Carlo Error Rate in Resampling-Based Multiple Testing 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 The Empirical Strength Probability (ESP) Approach . . . . . . 45
3.2.2 The Two-Step Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Application to Prostate Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Sequential Resampling-Based Multiple Testing Procedure That Controls Monte Carlo Error Rate 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Duality Between Sequential Testing and Confidence Set . . 67
4.2.2 Sequential Confidence Intervals Based on Group Sequential Approaches. 69
4.2.3 Step-Wise Procedure to Control Family-Wise MCER . . . . . 74
4.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Application to Prostate Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5 Summary and Future Directions 86
A Appendix for Chapter 2 89
B Appendix for Chapter 3 103
C Appendix for Chapter 4 110
Bibliography 115
About this Dissertation
School | |
---|---|
Department | |
Subfield / Discipline | |
Degree | |
Submission | |
Language |
|
Research Field | |
Palavra-chave | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Development of Statistical Methods for Multiple-Hypotheses Testing () | 2019-11-13 23:17:13 -0500 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|