Genotype prediction based on the gene expression data using Random Forest Open Access
Yu, Yanlong (Spring 2020)
Background: With rapid development of high throughput technologies, thousands of single nucleotide polymorphisms (SNPs) have been identified to be associated with human diseases. It’s known that SNPs located in regulatory regions are often eQTLs thatcan modulate gene expression. Generally, gene expression can beaffected by SNP mutations. But since gene expression data is more easily to access than genotype data. We want to explore the relationship between genotype and gene expression and make prediction on SNP genotype based on the gene expression data.
Method: We used random forests as our model to test the classification and prediction problems. First, we first generated a simulated dataset based on the real data to test the strategy. We used out-of-bag (OOB) error rate as our metric to test the simulated data. We next tested hundreds of SNPs and got their AUC values for comparison. For SNPs achieve the highest AUC scores, we conducted a feature importance test .
Result: For the simulation data, the OOB estimate of error rate is 21%. For the real data, the mean AUC scores for the 917 SNPs is 0.559 (std=0.108) and the mean OOB scores is 0. 658 (std=0.056). The max AUC score is 0.933 and OOB score is 0.860. Most of the AUC scores are between 0.5 and 0.7, the OOB scores are between 0.6 to 0.7. We also located important features in SNPs with the highest AUC.
Conclusion: Through this study, we can see that for some SNPs, it is possible to use gene expression data to infer its genotype. However, the majority of the SNPs can not be predicted accurately. Also, we find some features that significantly influence the SNP prediction. Further study is needed.
Table of Contents
Data analysis 8
About this Master's Thesis
|Subfield / Discipline|
|Committee Chair / Thesis Advisor|
|Genotype prediction based on the gene expression data using Random Forest ()||2020-04-23 23:12:35 -0400||