Exploration of Normalization Methods on Bulk RNA-seq Data and Single-cell RNA-seq Data Open Access

Wang, Yawei (Spring 2021)

Permanent URL: https://etd.library.emory.edu/concern/etds/9s161740d?locale=en
Published

Abstract

Background: RNA-seq and single-cell RNA-seq are powerful new technologies in biomedical research. To eliminate the inherent technical errors associated with factors like sequencing depth and gene length, RNA-seq data from different samples need to be normalized so that they are comparable. However, the presence of abundant zeros in the data, especially in single-cell RNA-seq data, makes the normalization effect extremely challenging. 

 

Method and Materials: In the bulk RNA-seq normalization section, I used a novel normalization method, named Group method, and compared its performance with other bulk RNA-seq data normalization methods, Upper Quantile, Quantile, Median, TMM, and DESeq, by calculating Spearman correlation between normalized RNA-seq data and TaqMan qRT-PCR data. We also compared their effectiveness on simulated data and differential expression analysis respectively. For the single-cell RNA-seq part, I merge genes based on the KEGG pathway and use the Quantile method to normalize pathway-cell data, which was named the Pathway-Quantile method. I compared this method with log normalization method, scran, and Linnorm on 3k PBMC data (without spike-in genes) and human pancreas data (with spike-in genes) by using the results after UMAP reducing dimension and Seurat package, version 4.0.1 visualizing. 

 

Results: For simulated and real bulk RNA-Seq data, all normalization methods performed similarly in terms of the Spearman correlation between normalized real RNA-Seq data and MAQC TaqMan qRT-PCR data. And Group method does not perform better compared to other methods. For differential expression analysis, all methods showed similar performance. For single-cell RNA-seq data, Pathway-Quantile is better than pathway-level data, but its performance was inferior to other methods when test on 3k PBMC data.

 

Conclusion: We found the group method is competitive for normalizing bulk RNA-seq data. However, more studies are needed for normalizing single-cell RNA-seq data using the Group-Quantile method.

Table of Contents

1.    Introduction 1

2.    Data Source 3

2.1  Bulk RNA-seq data 3

2.1.1 Real data 3

2.1.2 MAQC TaqMan qRT-PCR data 4

2.1.3 simulated data 5

2.2 Single-cell RNA-seq data 5

2.2.1 Peripheral Blood Mononuclear Cell (PBMC) data 5

2.2.2 Single-cell RNA-seq data with spike-ins 5

2.2.3 KEGG pathway gene sets 6

3.    Methods 6

3.1 Bulk RNA-seq data normalization methods 6

3.1.1 Traditional normalization methods with real RNA-Seq data 7

3.1.2 Group method 8

3.1.3 Normalization with simulated data 9

3.1.4 Differential expression analysis  9

3.2 Sing-cell RNA-seq normalization methods10

3.2.1 Log Normalization 10

3.2.2 Linear Model and Normality Based Normalizing Transformation Method (Linnorm) 11

3.2.3 Scran method 12

3.2.4 Group-Quantile method 12

4.    Results12

4.1 Results of bulk RNA-seq data12

4.2 Results of single-cell RNA-seq data16

4.2.1 PBMC data visualization  16

4.2.2 Spike-in single-cell RNA-seq data visualization  17

5.    Conclusion and Discussion18

Reference20

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Subfield / Discipline
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Last modified

Primary PDF

Supplemental Files