Comparison of Imputation Methods on Metabolomics Data with Triplet Data Open Access

Xue, Xiangning (Spring 2019)

Permanent URL:


Missing value imputation in mass spectrometry-based metabolomics data is important for subsequent data analysis. There are many methods available for tackling the problem, most of which were initially developed for microarray or RNA sequencing data. Metabolomics data represent unique challenges in missing value imputation. Some missingness in the data are indeed missing, which we call true missings, while others may represent true non-existence of the metabolite, which can be called true zeros. It is difficult to differentiate the true missings from true zeros in the dataset. Most of the current imputation methods would impute all the missingness. In addition, assessment of imputation methods based on the knockout-impute scheme may not represent the true performance of the imputation methods on metabolomics data, as the true missingness mechanism is complicated. In this study, we utilized datasets with triplicate measures on each sample, which offers some unique advantage over the knockout-impute scheme. Taking one measurement from each sample at a time, the remaining two measurements offer information as to whether each missing location is more likely to be true missing or true zero. With this data set, we were able to evaluate the performance of different imputation methods, assessing their performance on true missing and true zeros. The result shows that SVD and LLS tend to have better performance with true missings, and scImpute performs better for the true-zeros but not as reliable for true missings.

Table of Contents

1. Introduction. 1

2. Method. 4

2.1 The Data Set 4

2.2 Creating the Reference Matrices. 5

2.3 Imputation Method. 5

2.3.1 scImpute 5

2.3.2 K-nearest neighbors (KNN) 5

2.3.3 Bayesian principal component analysis (BPCA) 6

2.3.4 Singular Value Decomposition (SVDimpute) 6

2.3.5 Local least squares (LLS) 6

2.4 Imputation Scheme and Evaluation Criteria. 6

2.4.1 Imputation Scheme. 6

2.4.2 Normalized Root Mean Squared Error (NRMSE) 6

2.4.3 Log-transformed root mean squared error (LRMSE) 7

3. Result 8

3.1 1. Relationship between the number of missings and metabolic feature abundance 8

3.2 2. Optimal parameters for the imputation methods. 9

3.3 3. Correlation between imputed value and true value. 9

3.3.4. Imputation efficiency with different number of missing. 12

4. Conclusion and Discussion. 13

5. Reference. 15

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files