Predicting anticancer drug sensitivity from high dimensional genomic data Open Access

Sreedhar, Shalini (Spring 2019)

Permanent URL:


Acute Myeloid Leukemia (AML) is a heterogeneous cancer with at least 11 genetic classes and more than 20 subsets. Due to the highly variable nature of the disease, there is a strong need for treatment based on individual’s genetic composure. This type of precision medicine for AML is relatively new due to the recent decrease in cost and increase in efficiency of genetic sequencing. In this study, the primary dataset used in making these predictions is the BeatAML dataset which provides RNA sequencing, gene mutation, and drug sensitivity information for 451 cell line samples and 122 small molecule drugs. This dataset was preprocessed through standard scaling and dimensionality reduction through principle component analysis. A deep neural network model was created to make drug sensitivity predictions on the gene sequencing data. The problem was first formed as a regression problem in order to predict specific sensitivity values for each drug. The problem was then simplified to binary classification in order to attempt to improve the accuracy of the predictions. Five drugs were chosen as the focus and the sensitivity values were discretized into 2 categories (levels) of sensitivity. This resulted in a high training accuracy (average = 0.98) and a lower testing accuracy (average = 0.62). The importance of generalization, dimensionality reduction, and equal testing and training sets was emphasized as methods that are most important when dealing with datasets with small sample sizes and large feature sizes. Future studies regarding anticancer drug sensitivity predictions should focus on regularization techniques in order to improve test set prediction performance. Feature importance was evaluated as a method of determining the biological significance found in these models. Pathway analysis was performed for each drug on the genes having the most importance in predicting drug sensitivity. The strongest correlations between the most important features and the pathway targeted by the drug were found for the drugs trametinib and selumetinib. Further work needs to be done to interpret these networks in order to improve understanding on how predictions are being made and increase the likelihood of their adoption in industry.

Table of Contents

1.   Introduction. 1

1.1     Acute myeloid leukemia. 1

1.2     Deep learning in genomics. 2

1.3     Dataset 3

2    Methods. 6

2.1     Initial model 6

2.2     Generalization methods. 7

2.3     Creating equal training and testing samples 9

2.4     Dealing with sparse dataset 10

2.5     Choosing specific drugs 10

2.6     Classification problem.. 11

3    Results. 15

3.0     Drug sensitivity correlations. 15

3.0     Drug selection and pathway analysis. 16

3.1     Regression neural network model 18

3.2     Classification neural network model 20

4    Discussion. 26

4.1     Conclusions. 26

References. 28

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files