Machine learning Application in Longitudinal Polycystic Kidney Disease (PKD) Function Prediction Open Access

Wang, Xinhang (Spring 2020)

Permanent URL: https://etd.library.emory.edu/concern/etds/k930bz290?locale=en
Published

Abstract

Background: Autosomal dominant polycystic kidney disease (ADPKD) is one of the most common genetic chronic kidney diseases. The evaluation is based on kidney function represented by estimated glomerular filtration rate (eGFR). The pathological progression of ADPKD is related with genetic factors, demographic and clinical information. Typical pattern of kidney function for ADPKD patients remains in normal range for a long term and followed by a sharp deterioration, making it hard to predict in early stages. The CRISP study monitored the eGFR value and other factors for 242 early stages ADPKD patients longitudinally.

Methods: We evaluated multiple machine learning methods in predicting eGFR values and yearly change of the CRISP cohort. Predictors include variables of demographics, biomarkers, and imaging dataset. Different years of records were used to evaluate the power of historical information. The cohort was divided into subgroups to test the model performance on patients with different kidney function levels. Several expensive predictors were included or excluded in the models, and important predictors were identified in their contribution to the prediction of eGFR and its decline.

Results: The R2 of machine learning models predicting Year 2 eGFR value were above 0.64 using Year 1 data, while for models predicting eGFR change the R2 were around 0. When subgrouping patients, the R2 was largest (0.64) for predicting eGFR value of patients with abnormal kidney function. The R2 were below 0.47 when predicting Year 6 eGFR value using Year 2 information. In predicting Year 3 eGFR value, adding more years of historical data or health information slightly improved R2 by 1-3%. Excluding PKD genotype or total kidney volume did not decrease the R2.

Discussion: Predicting eGFR value using previous year’s information is more powerful than prediction eGFR yearly change. The predictive models performed better for patients with abnormal kidney function in subgroup analysis. The prediction power for eGFR values decreased when projecting into the distant future. Including predictors of health information, and previous year eGFR change helped to a small improvement. The expensive predictors of PKD genotype and total kidney volume can be replaced by biomarker variables without affecting the prediction power.

Table of Contents

I. Introduction ...................................................................................................................... 1

II. Methods ........................................................................................................................... 4

2.1 Data and Preprocessing ................................................................................................ 4

2.2 Machine Learning Methods .......................................................................................... 5

2.2.1 Simple Linear Regression ..................................................................................... 5

2.2.2 Lasso Regression ................................................................................................. 5

2.2.3 Random Forest .................................................................................................... 6

2.2.4 Support Vector Machine ....................................................................................... 6

2.3 Model Construction and Study Design ........................................................................... 6

2.3.1 Predictors ........................................................................................................... 7

2.3.2 Model Construction ............................................................................................. 7

2.3.3 Study Design ....................................................................................................... 7

2.3.3.1 Prediction of eGFR value & eGFR change using one year of data ................... 8

2.3.3.2 Patients subgrouping according to kidney function level .............................. 8

2.3.3.3 Prediction of eGFR value using multiple years of information ....................... 9

2.3.3.4 Important variables identification ............................................................. 10

III. Results ........................................................................................................................... 11

3.1 Patients eGFR can be predicted well with previous year's information ........................... 12

3.2 Prediction accuracy decreases for more distant future time-points ................................ 12

3.3 Previous year eGFR change contributes to improve the prediction power ....................... 13

3.4 Subgrouping of patients lead to separated prediction performance ................................ 14

3.5 Effect of adding additional year’s historical information ............................................... 15

3.6 Expensive genotypes and image predictors can be replaced ........................................... 17

3.7 Including Health information improves the prediction power slightly ............................ 18

3.8 Important predictors for eGFR value and eGFR change ................................................. 19

IV. Discussion ...................................................................................................................... 20

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Subfield / Discipline
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files