Cell type identification in single-cell genomics and its applications Open Access
Ma, Wenjing (Spring 2023)
Abstract
Advances in techniques for measuring genomics in cell-level resolution provide great opportunities to uncover cellular heterogeneity in genomic features of interest at the level of individual cells. Initiated by the introduction of single-cell RNA-sequencing (scRNA-seq), which measure transcriptomics information, single-cell techniques have been expanded to encompass other epigenomic modalities as well. Among all scientific goals in single-cell genomics studies, precise cell type identification (celltyping) is a fundamental and crucial step in analyzing single-cell genomics data. Supervised cell typing methods have become increasingly popular due to their superior accuracy, robustness, and efficiency. In our dissertation, we primarily focus on the development and application of supervised cell typing methods.
The dissertation starts with evaluating key factors for supervised celltyping methods developed for scRNA-seq data. After performing extensive real data analyses, we suggest combining all individuals from available datasets to construct the reference dataset and using the multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. This benchmark study not only offers valuable insights and suggestions for method developers but also lays the groundwork for our subsequent research endeavors.
We then developed a novel computational method with open-source software called Cellcano, which is specifically designed for the single-cell technique that profiles chromatin accessibility (scATAC-seq). Cellcano is based on a two-round supervised learning algorithm and provides significantly improved accuracy, robustness, and computational efficiency compared to existing tools. We have also explored the possibilities of using scRNA-seq data as references to perform a supervised manner of celltyping and data integration for scATAC-seq.
Upon accurate identification of distinct cell types, specific markers unique to each cell type can be extracted to enable diverse applications and downstream analyses. Based on cell-type-specific marker genes, we developed a method named LRcell to identify cellular activities associated with psychiatric disorders.
The computational and statistical methods employed in this dissertation are designed to provide a comprehensive understanding of cell-type-specificity. We anticipate that this research will contribute to the understanding of cellular functions in biological mechanisms and disease progression, potentially providing valuable insights for biomedical researchers.
Table of Contents
1 Introduction 1
1.1 Single-cell genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Supervised celltyping in single-cell genomics . . . . . . . . . . . . . . 2
1.3 Applications after accurate celltyping . . . . . . . . . . . . . . . . . . 4
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Evaluating key factors of supervised celltyping for scRNA-seq data 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Factors under evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 The choice of prediction model . . . . . . . . . . . . . . . . . 7
2.2.2 Choice of predictive features . . . . . . . . . . . . . . . . . . . 8
2.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 F-test on reference datasets along with MLP achieves the best overall performance . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Impact of data preprocessing . . . . . . . . . . . . . . . . . . . 13
2.3.3 Condition effect . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Pooling references improves the prediction results . . . . . . . 16
2.3.5 Purifying references does not improve the prediction results . 21
2.3.6 Computational performance . . . . . . . . . . . . . . . . . . . 23
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Cellcano: a supervised celltyping method for scATAC-seq 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Cellcano model . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 The Knowledge Distiller model . . . . . . . . . . . . . . . . . 31
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 The Cellcano framework . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 The choice of using gene score as input . . . . . . . . . . . . . 35
3.3.3 Properties of Cellcano anchors . . . . . . . . . . . . . . . . . . 36
3.3.4 Cellcano outperforms existing supervised scATAC-seq celltyping methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.5 Cellcano works better than prediction with batch effect removed 42
3.3.6 Cellcano is computationally efficient and scalable . . . . . . . 43
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Applications of cell-type-specificity 47
4.1 Integration of single-cell RNA-seq and ATAC-seq data . . . . . . . . 47
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 LRcell: detecting the source of differential expression at the sub-celltype level from bulk RNA-seq data . . . . . . . . . . . . . . . . . . . 59
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Discussion 77
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Future research plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Appendix A Appendix for Chapter 2 84
A.1 Performance Grain / Loss Calculation . . . . . . . . . . . . . . . . . 84
A.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.2.1 Analysis details for comparing condition effects . . . . . . . . 85
A.2.2 Analysis details for comparing pooling effect . . . . . . . . . . 86
A.2.3 Datasets for pooling saturation analysis . . . . . . . . . . . . . 86
A.2.4 Analysis details for purifications . . . . . . . . . . . . . . . . . 86
A.3 Analyses details on pooling saturation . . . . . . . . . . . . . . . . . 87
A.4 Number of features has an impact on performance . . . . . . . . . . . 88
Appendix B Appendix for Chapter 3 98
B.1 Data preprocessing by ArchR . . . . . . . . . . . . . . . . . . . . . . 98
B.2 An introduction to different ArchR gene score models . . . . . . . . . 99
B.3 Majority voting strategy . . . . . . . . . . . . . . . . . . . . . . . . . 100
B.4 Details on datasets processing . . . . . . . . . . . . . . . . . . . . . . 101
Appendix C Appendix for Chapter 4 113
C.1 Bulk RNA-seq data preprocessing . . . . . . . . . . . . . . . . . . . . 113
C.2 DEGs detection using DEseq2 . . . . . . . . . . . . . . . . . . . . . . 113
C.3 scRNA-seq data preprocessing . . . . . . . . . . . . . . . . . . . . . . 114
C.4 MSigDB marker genes . . . . . . . . . . . . . . . . . . . . . . . . . . 115
C.5 Differential expressed genes (DEGs) detection using Limma-Voom . . 115
Bibliography 123
About this Dissertation
| School | |
|---|---|
| Department | |
| Degree | |
| Submission | |
| Language | 
 | 
| Research Field | |
| Keyword | |
| Committee Chair / Thesis Advisor | |
| Committee Members | 
Primary PDF
| Thumbnail | Title | Date Uploaded | Actions | 
|---|---|---|---|
|  | Cell type identification in single-cell genomics and its applications () | 2023-04-28 16:08:56 -0400 |  | 
Supplemental Files
| Thumbnail | Title | Date Uploaded | Actions | 
|---|