K-mer Based Clustering for UMI Single Cell RNA-seq Data Open Access
Chen, Luxiao (Spring 2018)
Abstract
Abstract
Motivation:
High-throughput RNA sequencing (RNA-seq) is a technology to quantify the gene expression. It
has been widely used in various areas of biological and clinical studies. Traditional RNA-seq
(“bulk” RNA-seq) operates on the mRNA from a large number of cells, thus the measurement is
an averaged expression levels of the input cells. For heterogeneous samples, the bulk RNA-seq
fails to provide more detailed information for gene expression variation. Single cell RNA-seq
(scRNA-seq) has recently emerged with the technological developments. It profiles the expression
for each single cell, thus provide information for understanding the transcriptomic regulation and
variation at cellular level. There are a number of data analysis challenges in analyzing the scRNAseq
data, among them the cell clustering is an important one.
Methods:
In this work, we aim to study the possibility of using DNA sequence content (the k-mer counts)
instead of gene expression values in cell clustering. We first discussed the relationship between
gene counts, UMI RNA-seq transcript counts and k-mer counts by giving out the mathematics
expression of gene/k-mer counts related to transcript counts. Then we performed simulation to
demonstrate the difficulty of scRNA-seq with low expression counts in particular from unique
molecular identifier (UMI), potential advantage of using gene/k-mer count instead of transcript
counts and comparison of clustering results between gene counts matrix and k-mer counts matrix.
Results:
We showed that gene/k-mer counts matrix is a transformation of UMI scRNA-seq transcript counts
matrix. It can enlarge the value in expression matrix but may lose alternative splicing information
stored in transcript counts. By comparing the performance of gene count matrix and k-mer count
matrix with different signal noise ratios (SNR). We found long k-mer (k = 8, 9, 10) performs better
than short (k = 5, 6, 7) k-mer. However, under same SNR scenario, gene count matrix still performs
better in most scenarios.
Table of Contents
Table of Contents
Introduction .............................................................................................................................. 1
1. Genome and transcriptome ...................................................................................................... 1
2. High-throughput technology to measure gene expression ....................................................... 2
3. Single cell sequencing ................................................................................................................ 5
4. K-mer ........................................................................................................................................ 7
5. Relationship between gene counts, transcript counts and k-mer counts ................................. 8
5.1 Gene counts and k-mer counts – transformation of transcript counts .................................. 8
5.2 Disadvantage and potential advantage of gene/k-mer count ................................................ 11
5.3 Characteristics of gene/k-mer count transformation............................................................ 14
6. Purpose and content of this work ........................................................................................... 16
6.1 Purpose .................................................................................................................................. 16
6.2 Contents ................................................................................................................................. 17
Methods ................................................................................................................................... 17
1. General pipeline ...................................................................................................................... 17
2. Materials and software............................................................................................................ 18
2.1. Materials ............................................................................................................................... 18
2.2. Software and packages ......................................................................................................... 18
3. Simulation data preparation ................................................................................................... 19
3.1 Generation of gene expression matrix ................................................................................... 19
3.2 Estimation of sequences distribution on each gene ............................................................... 21
3.3 Selection of k-mer length ....................................................................................................... 22
3.4 Generation of k-mer expression matrix ................................................................................ 23
3.5 Clustering results comparison ............................................................................................... 23
4. Simulation methods description ............................................................................................. 24
4.1 Influence of low expression count to clustering results......................................................... 24
4.2 Influence of summing up expression counts with same/opposite expression patterns in
different cells to clustering results .............................................................................................. 25
4.3 Clustering results comparison with different parameters .................................................... 27
Results ..................................................................................................................................... 28
1. Influence of low expression count to clustering results .......................................................... 28
2. Influence of summing up expression counts with same/opposite expression patterns in
different cells to clustering results................................................................................................... 28
3. Clustering results comparison with different parameters ..................................................... 30
Discussion ................................................................................................................................ 31
References ............................................................................................................................... 33
Appendix ................................................................................................................................. 35
About this Master's Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members | |
Partnering Agencies |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
K-mer Based Clustering for UMI Single Cell RNA-seq Data () | 2018-04-08 23:01:24 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|