K-mer Based Clustering for UMI Single Cell RNA-seq Data Open Access

Chen, Luxiao (Spring 2018)

Permanent URL: https://etd.library.emory.edu/concern/etds/tx31qh70j?locale=pt-BR%2A
Published

Abstract

Abstract

Motivation:

High-throughput RNA sequencing (RNA-seq) is a technology to quantify the gene expression. It

has been widely used in various areas of biological and clinical studies. Traditional RNA-seq

(“bulk” RNA-seq) operates on the mRNA from a large number of cells, thus the measurement is

an averaged expression levels of the input cells. For heterogeneous samples, the bulk RNA-seq

fails to provide more detailed information for gene expression variation. Single cell RNA-seq

(scRNA-seq) has recently emerged with the technological developments. It profiles the expression

for each single cell, thus provide information for understanding the transcriptomic regulation and

variation at cellular level. There are a number of data analysis challenges in analyzing the scRNAseq

data, among them the cell clustering is an important one.

Methods:

In this work, we aim to study the possibility of using DNA sequence content (the k-mer counts)

instead of gene expression values in cell clustering. We first discussed the relationship between

gene counts, UMI RNA-seq transcript counts and k-mer counts by giving out the mathematics

expression of gene/k-mer counts related to transcript counts. Then we performed simulation to

demonstrate the difficulty of scRNA-seq with low expression counts in particular from unique

molecular identifier (UMI), potential advantage of using gene/k-mer count instead of transcript

counts and comparison of clustering results between gene counts matrix and k-mer counts matrix.

Results:

We showed that gene/k-mer counts matrix is a transformation of UMI scRNA-seq transcript counts

matrix. It can enlarge the value in expression matrix but may lose alternative splicing information

stored in transcript counts. By comparing the performance of gene count matrix and k-mer count

matrix with different signal noise ratios (SNR). We found long k-mer (k = 8, 9, 10) performs better

than short (k = 5, 6, 7) k-mer. However, under same SNR scenario, gene count matrix still performs

better in most scenarios.

Table of Contents

Table of Contents

Introduction .............................................................................................................................. 1

1. Genome and transcriptome ...................................................................................................... 1

2. High-throughput technology to measure gene expression ....................................................... 2

3. Single cell sequencing ................................................................................................................ 5

4. K-mer ........................................................................................................................................ 7

5. Relationship between gene counts, transcript counts and k-mer counts ................................. 8

5.1 Gene counts and k-mer counts – transformation of transcript counts .................................. 8

5.2 Disadvantage and potential advantage of gene/k-mer count ................................................ 11

5.3 Characteristics of gene/k-mer count transformation............................................................ 14

6. Purpose and content of this work ........................................................................................... 16

6.1 Purpose .................................................................................................................................. 16

6.2 Contents ................................................................................................................................. 17

Methods ................................................................................................................................... 17

1. General pipeline ...................................................................................................................... 17

2. Materials and software............................................................................................................ 18

2.1. Materials ............................................................................................................................... 18

2.2. Software and packages ......................................................................................................... 18

3. Simulation data preparation ................................................................................................... 19

3.1 Generation of gene expression matrix ................................................................................... 19

3.2 Estimation of sequences distribution on each gene ............................................................... 21

3.3 Selection of k-mer length ....................................................................................................... 22

3.4 Generation of k-mer expression matrix ................................................................................ 23

3.5 Clustering results comparison ............................................................................................... 23

4. Simulation methods description ............................................................................................. 24

4.1 Influence of low expression count to clustering results......................................................... 24

4.2 Influence of summing up expression counts with same/opposite expression patterns in

different cells to clustering results .............................................................................................. 25

4.3 Clustering results comparison with different parameters .................................................... 27

Results ..................................................................................................................................... 28

1. Influence of low expression count to clustering results .......................................................... 28

2. Influence of summing up expression counts with same/opposite expression patterns in

different cells to clustering results................................................................................................... 28

3. Clustering results comparison with different parameters ..................................................... 30

Discussion ................................................................................................................................ 31

References ............................................................................................................................... 33

Appendix ................................................................................................................................. 35

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Partnering Agencies
Last modified

Primary PDF

Supplemental Files