Comparison of Imputation Methods on Metabolomics Data with Triplet Data Open Access

Xue, Xiangning (Spring 2019)

Permanent URL: https://etd.library.emory.edu/concern/etds/wm117q059?locale=pt-BR%2A

Published

Abstract

Missing value imputation in mass spectrometry-based metabolomics data is important for subsequent data analysis. There are many methods available for tackling the problem, most of which were initially developed for microarray or RNA sequencing data. Metabolomics data represent unique challenges in missing value imputation. Some missingness in the data are indeed missing, which we call true missings, while others may represent true non-existence of the metabolite, which can be called true zeros. It is difficult to differentiate the true missings from true zeros in the dataset. Most of the current imputation methods would impute all the missingness. In addition, assessment of imputation methods based on the knockout-impute scheme may not represent the true performance of the imputation methods on metabolomics data, as the true missingness mechanism is complicated. In this study, we utilized datasets with triplicate measures on each sample, which offers some unique advantage over the knockout-impute scheme. Taking one measurement from each sample at a time, the remaining two measurements offer information as to whether each missing location is more likely to be true missing or true zero. With this data set, we were able to evaluate the performance of different imputation methods, assessing their performance on true missing and true zeros. The result shows that SVD and LLS tend to have better performance with true missings, and scImpute performs better for the true-zeros but not as reliable for true missings.

1. Introduction. 1

2. Method. 4

2.1 The Data Set 4

2.2 Creating the Reference Matrices. 5

2.3 Imputation Method. 5

2.3.1 scImpute 5

2.3.2 K-nearest neighbors (KNN) 5

2.3.3 Bayesian principal component analysis (BPCA) 6

2.3.4 Singular Value Decomposition (SVDimpute) 6

2.3.5 Local least squares (LLS) 6

2.4 Imputation Scheme and Evaluation Criteria. 6

2.4.1 Imputation Scheme. 6

2.4.2 Normalized Root Mean Squared Error (NRMSE) 6

2.4.3 Log-transformed root mean squared error (LRMSE) 7

3. Result 8

3.1 1. Relationship between the number of missings and metabolic feature abundance 8

3.2 2. Optimal parameters for the imputation methods. 9

3.3 3. Correlation between imputed value and true value. 9

3.3.4. Imputation efficiency with different number of missing. 12

4. Conclusion and Discussion. 13

5. Reference. 15

About this Master's Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Rollins School of Public Health
Department	Biostatistics
Subfield / Discipline	Biostatistics - MPH & MSPH
Degree	M.P.H.
Submission	Master's Thesis
Language	English
Research Field	Statistics Biology, Biostatistics
Keyword	Imputation
Committee Chair / Thesis Advisor	Tianwei Yu, Emory University
Committee Members	Xiangqin Cui, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Comparison of Imputation Methods on Metabolomics Data with Triplet Data ()	2019-04-09 00:00:19 -0400	Download

Comparison of Imputation Methods on Metabolomics Data with Triplet Data Open Access

Xue, Xiangning (Spring 2019)

Abstract

Table of Contents

About this Master's Thesis

Primary PDF

Supplemental Files