Applying Weighted Random Forest Algorithm on Metabolic Pathways Selection Público

Xiao, Yi (Spring 2019)

Permanent URL: https://etd.library.emory.edu/concern/etds/6t053g98b?locale=es
Published

Abstract

Background: Functional analysis using high-resolution liquid chromatography−mass spectrometry (LC−MS) data involves data analysis based on metabolic pathways or the genome-scale metabolic network. It is critical in feature selection and interpretation of metabolomics data. One of the main challenges is the lack of the feature identity in the LC−MS data. When matching mass-to-charge ratio (m/z) values of the features to theoretical values, some features can be matched to multiple known metabolites. When multiple matching occurs, usually only one of the matches can be true. Current network/pathway analysis methods ignore the uncertainty in metabolite identification, which could lead to some pathways that are not related to disease outcome being selected by including erroneously matched features.

 

Methods: We explored three potential methods based on Random Forest to address the multi-match issue. All the three approaches attempt to down-weight the contribution of multi-matched features to the pathway. (1) Weighted tree approach 1: lowering the tree weight if percent of multi-matched features used in the tree is high; (2) weighted tree approach 2: compute tree weight based on both feature importance score and the features’ multi-match status; (3) weighted sampling approach: apply multi-match status of each feature in variable-importance Random Forest, which samples features at each node based on a prior probability.

 

Results: By conducting a series of simulation studies, we found that (1) using weighted tree approach 1, the differentiation between true/false pathways is not significantly different from unweighted random forest; (2) using weighted tree approach 2, the weighted random forest show significant lower MSE, but still doesn’t out-perform unweighted Random Forest in pathway selection; (3) the weighted sampling approach works best on distinguishing between pathway with multi-match true features and pathway with no multi-match true features.

 

Conclusion: the random forest prediction accuracy is not sensitive to the change of tree weight based on feature information. The weighted sampling approach works better. We decided to use multi-match information and importance score to adjust sampling probability. We expect to see the false pathways with more multi-match features to have lower prediction accuracy than the true pathways in which only part of true features are multi-match.

Table of Contents

1.   Introduction.................................................................................................................... 1

1.1   Development of statistical methods for Gene Set Analysis and its application in metabolomics 1

1.2   Random forest application and advancement............................................................. 2

1.3   Weighted random forest............................................................................................ 3

1.4   Applying Random Forest in pathway analysis.............................................................. 4

2.   Method.......................................................................................................................... 5

2.1   Weighted random forest............................................................................................ 5

2.2   Choice of weight........................................................................................................ 6

2.3   Simulation study........................................................................................................ 7

2.3.1   Feature level simulations..................................................................................... 7

2.3.2   Simulating the Pathways..................................................................................... 8

2.3.3   Tune Parameters................................................................................................. 8

3.   Result............................................................................................................................. 9

3.1   Tree level accuracy is impacted by the number of true predictors involved in the tree.. 9

3.2   Examine the number of multi-match feature and number of true predictors.............. 10

3.3 Comparison between wRF and RF in pathway selection................................................ 11

4. Discussion and Conclusion................................................................................................ 14

4.1   Adjust on the formula by adding feature importance................................................. 14

4.2   Simple comparison between the schemes................................................................ 16

4.3   Conclusion.............................................................................................................. 16

Reference............................................................................................................................ 18

 

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Subfield / Discipline
Degree
Submission
Language
  • English
Research Field
Palabra Clave
Committee Chair / Thesis Advisor
Committee Members
Última modificación

Primary PDF

Supplemental Files