Genotype prediction based on the gene expression data using Random Forest Open Access

Yu, Yanlong (Spring 2020)

Permanent URL:


Background: With rapid development of high throughput technologies, thousands of single nucleotide polymorphisms (SNPs) have been identified to be associated with human diseases. It’s known that SNPs located in regulatory regions are often eQTLs thatcan modulate gene expression. Generally, gene expression can beaffected by SNP mutations. But since gene expression data is more easily to access than genotype data. We want to explore the relationship between genotype and gene expression and make prediction on SNP genotype based on the gene expression data.

Method: We used random forests as our model to test the classification and prediction problems. First, we first generated a simulated dataset based on the real data to test the strategy. We used out-of-bag (OOB) error rate as our metric to test the simulated data. We next tested hundreds of SNPs and got their AUC values for comparison. For SNPs achieve the highest AUC scores, we conducted a feature importance test .

Result: For the simulation data, the OOB estimate of error rate is 21%. For the real data, the mean AUC scores for the 917 SNPs is 0.559 (std=0.108) and the mean OOB scores is 0. 658 (std=0.056). The max AUC score is 0.933 and OOB score is 0.860. Most of the AUC scores are between 0.5 and 0.7, the OOB scores are between  0.6 to 0.7. We also located important features in SNPs with the highest AUC.

Conclusion: Through this study, we can see that for some SNPs, it is possible to use gene expression data to infer its genotype. However, the majority of the SNPs can not be predicted accurately. Also, we find some features that significantly influence the SNP prediction. Further study is needed.

Table of Contents

Introduction 1

Method 3

Simulation 7

Data analysis 8

Discussion  13

Reference 16

Appendix 18

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files