Genotype prediction based on the gene expression data using Random Forest Open Access

Yu, Yanlong (Spring 2020)

Permanent URL: https://etd.library.emory.edu/concern/etds/mw22v651x?locale=en

Published

Abstract

Background: With rapid development of high throughput technologies, thousands of single nucleotide polymorphisms (SNPs) have been identified to be associated with human diseases. It’s known that SNPs located in regulatory regions are often eQTLs thatcan modulate gene expression. Generally, gene expression can beaffected by SNP mutations. But since gene expression data is more easily to access than genotype data. We want to explore the relationship between genotype and gene expression and make prediction on SNP genotype based on the gene expression data.

Method: We used random forests as our model to test the classification and prediction problems. First, we first generated a simulated dataset based on the real data to test the strategy. We used out-of-bag (OOB) error rate as our metric to test the simulated data. We next tested hundreds of SNPs and got their AUC values for comparison. For SNPs achieve the highest AUC scores, we conducted a feature importance test .

Result: For the simulation data, the OOB estimate of error rate is 21%. For the real data, the mean AUC scores for the 917 SNPs is 0.559 (std=0.108) and the mean OOB scores is 0. 658 (std=0.056). The max AUC score is 0.933 and OOB score is 0.860. Most of the AUC scores are between 0.5 and 0.7, the OOB scores are between 0.6 to 0.7. We also located important features in SNPs with the highest AUC.

Conclusion: Through this study, we can see that for some SNPs, it is possible to use gene expression data to infer its genotype. However, the majority of the SNPs can not be predicted accurately. Also, we find some features that significantly influence the SNP prediction. Further study is needed.

Introduction 1

Method 3

Simulation 7

Data analysis 8

Discussion 13

Reference 16

Appendix 18

About this Master's Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Rollins School of Public Health
Department	Biostatistics
Subfield / Discipline	Biostatistics - MPH & MSPH
Degree	M.S.P.H.
Submission	Master's Thesis
Language	English
Research Field	Marine Geology Biology, Biostatistics
Keyword	Random Forest Gene expression SNPs Genotype
Committee Chair / Thesis Advisor	Qin Zhaohui "Steve" , Emory University
Committee Members	Liu Yuan, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Genotype prediction based on the gene expression data using Random Forest ()	2020-04-23 23:12:35 -0400	Download

Genotype prediction based on the gene expression data using Random Forest Open Access

Yu, Yanlong (Spring 2020)

Abstract

Table of Contents

About this Master's Thesis

Primary PDF

Supplemental Files