Methylation Imputation from HM450K Array to EPIC Array with Autoencoder and Nonnegative Matrix Factorization Open Access

Shen, Yang (Spring 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/n009w3752?locale=en
Published

Abstract

DNA methylation is an essential epigenetic modification that plays a crucial role in gene expression regulation and cellular differentiation. DNA methylation profiling has been widely used in research to determine the development of various human diseases, including cancer, cardiovascular disease, and neurological disorders. The HumanMethylation450K (HM450K) arrays and the Enhanced DNA Methylation Profiling (EPIC) arrays are two commonly used high-throughput technologies that enable genome-wide DNA methylation profiling.  The HM450K array covers approximately 450,000 CpG sites, while the EPIC array covers more than 850,000 CpG sites, and there's an overlap of around 440,000 CpG sites between the two arrays. In this study, our goal is to impute methylation levels from the HM450K array to the EPIC array to circumvent the need for expensive re-measurement using the EPIC array when HM450K array data is already available. Convolutional autoencoders and nonnegative matrix factorization (NMF) are both machine-learning techniques that are commonly used in the analysis of large-scale genomic data. Our approach involved using a convolutional autoencoder and an NMF model to capture the latent structure in the DNA methylation data and generate imputed values for all CpG sites in the EPIC arrays. We mainly focused on chromosome 18 to simplify our model. The overall RMSE was 0.0196, which was better than 0.04 from a simple linear regression model with nearby CpG sites. Our model was highly adaptable to other chromosomes and could easily adjust the dimensions of the results obtained from autoencoders to accommodate different chromosome sizes.

Table of Contents

Table of contents

1. Introduction 2

2. Method 3

2.1 Data Pre-processing 3

2.2 Dimension reduction: Convolutional Autoencoder 4

2.3 Nonnegative Matrix Factorization 6

3. Results 7

4. Discussions 12

References 14

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Subfield / Discipline
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files