Novel Statistical Methods for Analyzing Gene Expression Data Restricted; Files Only
Dai, Qile (Spring 2025)
Abstract
Gene expression analysis lies at the center of modern biomedical research, serving as a bridge between genetic variation, molecular function, and disease mechanisms. However, the rapid expansion of gene expression datasets—spanning bulk, single-cell, and spatial transcriptomics—has exposed critical limitations in analytical flexibility, statistical rigor, and biological interpretability. This dissertation addresses these challenges through three contributions: two methodological innovations and one integrative application.
The first contribution is OTTERS, a flexible transcriptome-wide association study (TWAS) framework that leverages summary-level eQTL data, reducing reliance on individual-level reference datasets and enabling broader, more powerful TWAS analyses. OTTERS was evaluated using both simulated data and real-world applications with eQTLGen summary statistics (n = 31,684). In both settings, OTTERS demonstrated improved power in identifying disease-associated genes, outperforming existing methods such as FUSION, which require individual-level reference panels.
The second contribution is STACCato, a tensor-based regression method for detecting cell–cell communication (CCC) events that differ across biological conditions (e.g., disease status) using single-cell RNA-seq data. Unlike existing approaches, STACCato adjusts for confounding variables (e.g., batch effects and demographics) and models dependencies among CCC events, enabling statistically rigorous inference. When applied to lupus (n = 193), autism (n = 23), and simulated datasets, STACCato consistently outperformed alternative methods, demonstrating both robustness and practical utility.
The third contribution applies STACCato to examine how CCC varies across cortical layers of the brain in Alzheimer’s disease dementia (ADD), using spatially annotated single-nucleus RNA-seq data from the ROS/MAP cohort. By incorporating spatial structure, this analysis reveals microglial dysregulation specifically enriched in cortical Layer 5, highlighting the importance of spatial context in understanding neurodegenerative disease mechanisms.
Together, these contributions advance gene expression analysis by introducing new computational methods and demonstrating their utility across diverse data types and biological contexts, enabling deeper biological insights and facilitating discovery in genomics and precision medicine.
Table of Contents
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 The role of gene expression analysis in modern biomedical research 1
1.1.2 Evolution of gene expression profiling technologies . . . . . . . 2
1.1.3 Utilities of gene expression data for studying complex human diseases and drug discovery . . . . . . . . . . . . . . . . . . . 3
1.1.4 Limitations of existing analytical approaches for studying gene expression data . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.5 Contributions of this dissertation . . . . . . . . . . . . . . . . 8
1.2 Outline of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 OTTERS: a Powerful TWAS Framework Leveraging Summary-level Reference Data 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Introduction to TWAS . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Limitations of previous TWAS methods . . . . . . . . . . . . 13
2.1.3 Introduction to OTTERS . . . . . . . . . . . . . . . . . . . . 14
2.1.4 Overview of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Methods and Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Overview of OTTERS framework . . . . . . . . . . . . . . . . 16
2.2.1.1 Traditional two-stage TWAS analysis . . . . . . . . . 18
2.2.1.2 TWAS Stage I analysis using summary-level reference data . . . . . . . . . . . . . . . 19
2.2.1.3 OTTERS p-value by ACAT-O Test . . . . . . . . . . 24
2.2.2 Simulation settings . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Real data applications . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3.1 GTEx V8 dataset . . . . . . . . . . . . . . . . . . . . 27
2.2.3.2 eQTLGen consortium dataset . . . . . . . . . . . . . 27
2.2.3.3 UKBB GWAS data of cardiovascular disease . . . . . 28
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 GReX imputation accuracy in GTEx V8 blood samples . . . . 31
2.3.3 TWAS of cardiovascular disease . . . . . . . . . . . . . . . . . 33
2.3.4 Computational time . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.1 Data and materials availability . . . . . . . . . . . . . . . . . 46
2.5.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.3 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 STACCato: Identifying condition-related cell-cell communication events using supervised tensor analysis 64
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.1.1 Introduction to cell-cell communication . . . . . . . . . . . . . 64
3.1.2 Limitation of previous methods to identify condition-related CCC 65
3.1.3 Introduction to STACCato . . . . . . . . . . . . . . . . . . . . 67
3.1.4 Overview of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 68
3.2 Methods and Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.1 STACCato framework . . . . . . . . . . . . . . . . . . . . . . 68
3.2.1.1 Overview of STACCato framework . . . . . . . . . . 68
3.2.1.2 Generation of communication scores . . . . . . . . . 70
3.2.1.3 Regression models for identifying condition-related CCC events . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.1.4 Tensor-based regression for joint inference of condition-related CCC events. . . . . . . . . . . . . . . . . . . 72
3.2.1.5 Accounting for dependencies among all CCC events. 74
3.2.2 Applying STACCato to SLE and ASD scRNA-seq datasets . . 76
3.2.2.1 scRNA-seq dataset of SLE patients and controls . . . 76
3.2.2.2 scRNA-seq dataset of ASD patients and controls . . 77
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.1 Applying STACCato to identify CCC events associated with SLE 78
3.3.2 Applying STACCato to identify CCC events associated with ASD 79
3.3.3 Evaluating the impact of confounding variables for identifying condition-related CCC events . . . . . . . . . . . . . . . . . . 83
3.3.4 Comparing STACCato to the separate regression approach . . 84
3.3.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.6 Computational considerations . . . . . . . . . . . . . . . . . . 88
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.5.1 Supplementary methods . . . . . . . . . . . . . . . . . . . . . 93
3.5.1.1 Rank selection by STACCato . . . . . . . . . . . . . 93
3.5.1.2 QR-adjusted optimization algorithm used by STACCato 93
3.5.2 Data and materials availability . . . . . . . . . . . . . . . . . 95
3.5.3 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.5.4 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4 Cell-Cell Communication Patterns in Alzheimer’s Disease Dementia and Mild Cognitive Impairment Vary by Cortical Layers 107
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1.1 Previous studies on the molecular and cellular characterization of ADD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1.2 The importance of spatial context in understanding ADD pathology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.1.3 Overview of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . 109
4.2 Methods and Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2.1 ROS/MAP snRNA-seq data . . . . . . . . . . . . . . . . . . . 110
4.2.2 Cortical layer inference for ROS/MAP snRNA-seq data using CeLEry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2.3 STACCato analysis of ADD- and MCI-related CCC events . . 112
4.2.3.1 snRNA-seq data preprocessing . . . . . . . . . . . . 112
4.2.3.2 Infer CCC strengths using LIANA+ . . . . . . . . . 113
4.2.3.3 STACCato framework for identifying CCC events . . 115
4.2.4 Comparison of CCC events across layers . . . . . . . . . . . . 116
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3.1 Cell distributions across inferred cortical layers in the ROS/MAP snRNA-seq data . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3.2 Detection of ADD-related CCC events . . . . . . . . . . . . . 119
4.3.3 Top significant ADD-related CCC events . . . . . . . . . . . . 120
4.3.4 Unique CCC patterns positively associated with ADD risk in Layer 5 involving microglia . . . . . . . . . . . . . . . . . . . . 125
4.3.5 MCI-related CCC events with consistent effects across cortical layers and WM . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.3.6 Comparison of ADD and MCI effects on CCC events . . . . . 127 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.5.1 Data and materials availability . . . . . . . . . . . . . . . . . 133
4.5.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.5.3 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5 Summary and Future Directions 142
Bibliography 145
About this Dissertation
School | |
---|---|
Department | |
Subfield / Discipline | |
Degree | |
Submission | |
Language |
|
Research Field | |
Parola chiave | |
Committee Chair / Thesis Advisor | |
Committee Members |

Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
![]() |
File download under embargo until 22 May 2031 | 2025-04-22 14:34:44 -0400 | File download under embargo until 22 May 2031 |
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|