Unraveling the Impact of Fuzzy Similarity Algorithms on Missing Data Imputation of Heart Bypass Surgery Cohort Open Access

Shen, Hong-Jui (Spring 2024)

Permanent URL: https://etd.library.emory.edu/concern/etds/db78td600?locale=en

Published

Abstract

Objective: This thesis introduces the Fuzzy C-Means based Random Forest (FCRF) method, developed to address the limitations of existing data imputation techniques in public health datasets. Aimed at enhancing imputation accuracy, FCRF integrates fuzzy logic and similarity learning to navigate complex missing data mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).

Method: The performance of FCRF is evaluated against traditional imputation methods—Mean, K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations (MICE), and Iterative Imputation—using metrics like Average RMSE, Normalized RMSE, Mean Absolute Error (MAE), Weighted F1-Score, and Normalized Accuracy. This comparative analysis spans various missing data scenarios to assess each method's effectiveness comprehensively.

Results: Results show that FCRF exhibits competitive performance across all scenarios, particularly excelling in complex MNAR situations where conventional methods falter. Its methodological design, which combines clustering and predictive modeling, offers nuanced capabilities beneficial for public health research.

Conclusion: FCRF marks a significant advancement in data imputation, promising more accurate and reliable analyses for public health research. Future work will explore FCRF's impact on standard error and variance estimates to ensure the method's robustness, aiming to prevent potential biases in statistical inferences. This research contributes to enhancing data integrity, and supporting informed decision-making in public health.

Chapter 1. Introduction and literature review

1.1 Relevance of Data Imputation In Public Health

1.2 Literature Review

1.3 Ethical Considerations

Chapter 2. Methodology

2.1 Software and Package Utilization

2.2 Data Collection and Processing

2.3 Variable Selection

2.4 Data Preprocessing

2.41 Missing Completely At Random (MCAR)

2.42 Missing At Random (MAR)

2.43 Missing Not At Random (MNAR)

2.5 Theoretical Foundation and Algorithmic Framework

2.6 Imputation Process

2.7 Evaluation Metrics

2.71 Continuous Variables Evaluation

2.72 Categorical and Binary Variables Evaluation

Chapter 3. Results

3.1 MCAR Data Scenario

3.2 MAR Data Scenario

3.3 MNAR Data Scenario

3.4 Discussion

Chapter 4. Conclusion and Future work

Appendix

References

About this Master's Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Rollins School of Public Health
Department	Biostatistics
Subfield / Discipline	Biostatistics - MPH & MSPH
Degree	M.S.P.H.
Submission	Master's Thesis
Language	English
Research Field	Biology, Biostatistics
Keyword	Data Imputation, Public Health, Fuzzy C-Means, Random Forest, Missing Data, Machine Learning, Similarity Learning.
Committee Chair / Thesis Advisor	Rameshbabu Manyam, Emory University
Committee Members	Tarrant McPherson, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Unraveling the Impact of Fuzzy Similarity Algorithms on Missing Data Imputation of Heart Bypass Surgery Cohort ()	2024-04-08 11:45:24 -0400	Download

Unraveling the Impact of Fuzzy Similarity Algorithms on Missing Data Imputation of Heart Bypass Surgery Cohort Open Access

Shen, Hong-Jui (Spring 2024)

Abstract

Table of Contents

About this Master's Thesis

Primary PDF

Supplemental Files