Statistical Modeling and Learning in Single Cell RNA Sequencing Data Open Access

Su, Kenong (Spring 2021)

Permanent URL:


The Single-cell RNA-sequencing (scRNA-seq) has emerged as a powerful tool to explore biology at the unitary resolution of life. It has successfully deepened our understanding of various biological problems such as cell populations, gene regulations, and cellular transcriptional states. It also opens a door for investigating complex biological systems such as brain regions and immune responses. Furthermore, it leads to the discovery of new and rare cell types, which benefits for the discovering drug targets and decoding disease etiologies in clinical studies.

Even though researchers are inspired by the success of the scRNA-seq, there still exist difficulties with the respect to the data analysis. Specifically, in the scRNA-seq gene expression profiles, the sparsity of excessive zero expressions, the heterogeneity across and within cell types, and confounding batch effects together contribute to the analytical challenges. To deal with these concerns, we have developed algorithms and pipelines for different research aspects in scRNA-seq data.

With the advance of high-throughput techniques, nowadays we are able to perform transcriptome sequencing for a massive number of cells experimentally. To facilitate the analysis on the large-scale scRNA-seq data, one commonly performed task is cell clustering, which enables the quantitative characterization of cell types. An essential step in scRNA-seq clustering is to select a set of most representative genes (referred to as “features”) whose expression patterns will be adopted for proper cell clustering. Currently, almost all existing scRNA-seq clustering tools include a simple unsupervised feature selection step (e.g., statistical moments of gene-wise expression distribution) and uses random top number (e.g., 1000) of features for clustering. Therefore, it is more reasonable to designate a rigorous approach for better feature selection. We created an algorithm named FEAture SelecTion (FEAST) specifically designed for selecting the most informative genes in the context of scRNA-seq clustering. We demonstrated that applying FEAST can significantly improve the cell clustering accuracy, and outperformed other feature selection methods embedded in the state-of-art scRNA-seq clustering methods such as Seurat and SC3.

Furthermore, determining the sample size for adequate power to detect statistical significance is a crucial step at the design stage for high-throughput experiments. Due to the unique sparse and heterogeneous characters presented in scRNA-seq, there are few tools explicitly designed for scRNA-seq experiments to address this topic. We developed POWSC pipeline, a simulation-based approach to provide power evaluation and sample size estimation in the context of differential expression (DE) analysis. POWSC provides a variety of power evaluations including stratified and marginal power analyses for DE genes characterized by two forms (phase transition or magnitude tuning), under different comparison scenarios. Additionally, we also devised the POWCLUST workflow as an extension of POWSC with a focus on assessing power for clustering. POWCLUST is able to recover the underlining information for cell type hierarchies and cell type proportions with a proper sample size estimation.

Overall, I designed new algorithms and pipelines including FEAST and POWSC for accurately selecting features and adequately evaluating power in scRNA-seq. We showcase that FEAST can assist to find more representative genes and POWSC can potentially be served as a guideline for scRNA-seq experiment design.

Table of Contents

1 Introduction 

1.1 Single-Cell RNA sequencing and challenges  

1.2 ScRNA-seq clustering analysis

1.3 ScRNA-seq power evaluation

1.4 Outline

2 FEASTS: Accurate Feature Selection Improves Single Cell RNA-seq Cell Clustering 7

2.1 Introduction

2.1.1 Feature selection in scRNA-seq cell clustering  

2.1.2 Feature evaluation in scRNA-seq cell clustering  

2.2 Method

2.2.1 Preprocess and normalization  

2.2.2 The consensus clustering

2.2.3 Gene-level significance inference

2.2.4 Determine the optimized feature set

2.3 Result  

2.3.1 Overview of FEAST  

2.3.2 Datasets  

2.3.3 Consensus clustering improves the signal   

2.3.4 FEAST selects features better than other unsupervised approaches

2.3.5     FEAST optimize the feature set through validation

2.3.6      FEAST improves the clustering accuracy  

2.3.7     Test FEAST on larger datasets  

2.4 Discussion

3.    POWSC: Simulation, Power Evaluation, and Sample Size Recommendation for Single Cell RNA-seq

3.1 Introduction  

3.2 Method  

3.2.1 Parameter estimator  

3.2.2 Data simulator  

3.2.3 Power assessor

3.3 Result  

3.3.1 Overview  

3.3.2 POWSC accurately simulates scRNA-seq data  

3.3.3 POWSC provides recommended sample size for two-group comparison

3.3.4 POWSC provides recommended sample size for cross cell type comparisons

3.3.5 POWSC offers a strategy to balance sample size and sequencing depth  

3.3.6 POWSC handles the perturbation of cell compositions  

3.3.7 Extend POWSC to the context of clustering  

3.4 Discussion

4.    Besides scRNA-seq: the design of iPath  

4.1 Introduction

4.2 Method  

4.2.1 Data sources

4.2.2 Overview of the iPath approach

4.2.3 Calculation of iESs  

4.2.4 Definition of perturbed tumor samples  

4.2.5 Performance comparison among sample-level gene set analysis methods

4.3 Result  

4.3.1 Overview  

4.3.2 Identifying perturbed pathways  

4.3.3 Identifying prognostic biomarker pathways  

4.3.4 Pan-cancer view on prognostic biomarker pathways identified

4.3.5 Selected prognostic biomarker pathways identified

4.3.6 Links to distinct patterns shown in pathology imaging

4.3.7 Comparison with GSEA  

4.3.8 Comparison with other sample-level gene set analysis methods

4.3.9 Comparison with the Human Pathology Atlas  

4.3.10 Connection with the mutations in cancer driver genes  

4.4 Discussion

5 Future research plans

5.1 Single-cell multi-omics data integration

5.2 Single-cell spatial transcriptomics   

Appendix A Appendix for Chapter 2   

A.1 Test datasets  

A.2 Test datasets with large number of cells

A.3 The comparison of F-statistics distributions   

A.4 Compare FEAST to other feature selection approaches  

A.5 Top features selected by CV and Kurtosis  

A.6 Feature set validation by TSCAN clustering outcomes  

A.7 Features selected by FEAST Improves the clustering accuracy for TSCAN, SIMLR, and SHARP

A.8 Computational performance of FEAST

Appendix B Appendix for Chapter 3

B.1 Power analysis for Form II DE with respect to zero fractions

B.2 Multiple cell types in Glioblastoma

B.3 Multiple cell types in another Glioblastoma (GSE84465)

B.4 Test on 10X platform

Appendix C Appendix for Chapter 4

C.1 The comparison between iPath and Human Pathology Atlas (HPA) on KIRC cancer type

C.2 Selected prognostic C2 and GO pathways in BRCA

C.3 Test iPath on negative-control genesets



About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files