EDClust: An EM-MM hybrid method for cell clustering in population-level single cell RNA sequencing Open Access

Wei, Xin (Spring 2021)

Permanent URL: https://etd.library.emory.edu/concern/etds/cn69m5362?locale=en
Published

Abstract

Single-cell RNA sequencing (scRNA-seq) technology has revolutionized the genomics research by enabling the measurement of the transcriptomic profile at the level of single cells. One of the most fundamental problems in scRNA-seq data analysis is cell clustering, for which a rather large number of methods have been developed. With the increasing application of scRNA-seq in larger scale studies, people face the problem of cell clustering when the scRNA-seq data are from more than one subject. One challenge in analyzing such data is the subject-specific systematic variations: heterogeneity from multiple subjects may have a significant impact on the clustering accuracy. However, existing methods addressing such effect suffered from several limitations. In this work, we develop a novel statistical method named ‘EDClust’ for scRNA-seq cell clustering when data are from multiple subjects. EDClust models the sequence read counts by a mixture of Dirichlet-Multinomial distributions, and explicitly accounts for the cell type heterogeneity, subject heterogeneity, and the clustering uncertainty. An EM-MM hybrid algorithm is derived for maximizing the data likelihood and clustering the cells. We perform a series of simulation studies to evaluate the proposed method and demonstrate the outstanding performance of EDClust. Comprehensive benchmarking on four real scRNA-seq datasets with various tissue types and species demonstrates the substantial accuracy improvement of EDClust compared to the existing methods.

Table of Contents

1 . Introduction ...................................... 1

2 . Methods ...................................... 3

2.1 Data model ...................................... 3

2.2 The EM-MM hybrid algorithm for maximum likelihood. . . . . . . . . . . . . . . 4

2.3 Feature selection ................................... 7

2.4 Determine the initial values.............................. 7

2.5 Software implementation............................... 8

3 . Simulation studies ...................................... 10

4 . Real data analyses ...................................... 11

4.1 Mouse Retina dataset................................. 12

4.2 Baron Pancreas dataset ................................ 12

4.3 Human Skin dataset.................................. 14

4.4 Mouse Lung dataset.................................. 16

4.5 Computational performance ............................. 17

5 . Discussion ...................................... 17

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Subfield / Discipline
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files