Early Detection of Neonatal Infection in NICU Using Machine Learning Models Public

Chi, Zhuohan (Spring 2025)

Permanent URL: https://etd.library.emory.edu/concern/etds/cn69m5672?locale=fr
Published

Abstract

Neonatal infections remain a critical threat in intensive care settings, often progressing rapidly and silently within the first hours of admission. This study develops and evaluates an explainable machine learning framework to enable early prediction of infection risk in neonates, using high-resolution data from the MIMIC-III database. Two time windows were explored—30 and 120 minutes post-ICU admission—during which physiological and hematological variables were aggregated and preprocessed. Missing data were systematically analyzed and imputed using Iterative Imputation, and a comprehensive set of classification models were compared using stratified five-fold cross-validation. Results show that CatBoost achieved the highest F1-score (0.7634) in the 30-minute window, while Gradient Boosting outperformed others in the 120-minute window (F1 ≈ 0.7983), reflecting the impact of data availability on predictive performance. Feature importance and SHAP analysis revealed key indicators such as heart rate, white blood cell count, and temperature as significant contributors. These findings support a two-stage decision-support system that adapts to early and later clinical data, potentially improving timely diagnosis and reducing neonatal morbidity and mortality.

Table of Contents

CHAPTER 1: INTRODUCTION 1

CHAPTER 2: METHOD 3

2.1 Data Source 3

2.2 Data Preprocessing 4

2.3 Missing Value Imputation 5

2.4 Model Selection 7

2.5 Hyperparameter Optimization 8

2.6 Model Selection 10

CHAPTER 3: RESULT 12

3.1 Row Removal Threshold Result 13

3.2 Best Imputation Method 13

3.3 Best Machine Learning Model 14

3.4 Incremental Coverage Analysis 17

CHAPTER 4: DISCUSSION 19

4.1 Overview of Findings 19

4.2 Interpretations and Clinical Implications 20

4.3 Comparison with Existing Literature 21

4.4 Limitation 21

APPENDIX 23

Table 1 Baseline characteristics of the study cohort 23

Table 2: Comparison between Different Infection Detection Methods 26

Figure 1: Research Pipeline 27

Figure 2: Missing Value Distribution 28

Figure 3: Top Five ROC Curve in Model Selection of 120 Minutes Dataset 29

Figure 4: Top Five ROC Curve in Model Selection of 120 Minutes Dataset 30

REFERENCE 31

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Subfield / Discipline
Degree
Submission
Language
  • English
Research Field
Mot-clé
Committee Chair / Thesis Advisor
Committee Members
Dernière modification

Primary PDF

Supplemental Files