Unraveling the Impact of Fuzzy Similarity Algorithms on Missing Data Imputation of Heart Bypass Surgery Cohort Open Access

Shen, Hong-Jui (Spring 2024)

Permanent URL: https://etd.library.emory.edu/concern/etds/db78td600?locale=en


Objective: This thesis introduces the Fuzzy C-Means based Random Forest (FCRF) method, developed to address the limitations of existing data imputation techniques in public health datasets. Aimed at enhancing imputation accuracy, FCRF integrates fuzzy logic and similarity learning to navigate complex missing data mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).

Method: The performance of FCRF is evaluated against traditional imputation methods—Mean, K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations (MICE), and Iterative Imputation—using metrics like Average RMSE, Normalized RMSE, Mean Absolute Error (MAE), Weighted F1-Score, and Normalized Accuracy. This comparative analysis spans various missing data scenarios to assess each method's effectiveness comprehensively.

Results: Results show that FCRF exhibits competitive performance across all scenarios, particularly excelling in complex MNAR situations where conventional methods falter. Its methodological design, which combines clustering and predictive modeling, offers nuanced capabilities beneficial for public health research.

Conclusion: FCRF marks a significant advancement in data imputation, promising more accurate and reliable analyses for public health research. Future work will explore FCRF's impact on standard error and variance estimates to ensure the method's robustness, aiming to prevent potential biases in statistical inferences. This research contributes to enhancing data integrity, and supporting informed decision-making in public health.

Table of Contents

Chapter 1. Introduction and literature review

1.1 Relevance of Data Imputation In Public Health

1.2 Literature Review

1.3 Ethical Considerations

Chapter 2. Methodology

2.1 Software and Package Utilization

2.2 Data Collection and Processing

2.3 Variable Selection

2.4 Data Preprocessing

2.41 Missing Completely At Random (MCAR)

2.42 Missing At Random (MAR)

2.43 Missing Not At Random (MNAR)

2.5 Theoretical Foundation and Algorithmic Framework

2.6 Imputation Process

2.7 Evaluation Metrics

2.71 Continuous Variables Evaluation

2.72 Categorical and Binary Variables Evaluation

Chapter 3. Results

3.1 MCAR Data Scenario

3.2 MAR Data Scenario

3.3 MNAR Data Scenario

3.4 Discussion

Chapter 4. Conclusion and Future work



About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files