Joel, L.O., Doorsamy, W. orcid.org/0000-0001-9043-9882 and Paul, B.S. (2025) A comparative study of imputation techniques for missing values in healthcare diagnostic datasets. International Journal of Data Science and Analytics. ISSN 2364-415X
Abstract
Missing values are a common feature of real-world datasets, particularly in healthcare data. This can be challenging when applying machine learning algorithms, as most models perform poorly in the presence of incomplete data. The goal of this study is to evaluate the performance of seven imputation techniques: Mean Imputation, Median Imputation, Last Observation Carried Forward (LOCF), K-Nearest Neighbor (KNN) Imputation, Interpolation, MissForest, and Multiple Imputation by Chained Equations (MICE) on three healthcare datasets. Various levels of missing data were introduced—10%, 15%, 20%, and 25%—and the imputation techniques were used to fill in the gaps. The methods were compared using root mean squared error (RMSE) and mean absolute error (MAE). The results indicate that MissForest imputation performed best, followed by MICE. Additionally, we examined whether feature selection should be performed before or after imputation, using recall, precision, F1-score, and accuracy as evaluation metrics. The result suggests that performing imputation before feature selection is better. Since there is limited research on the order of imputation and feature selection, and ongoing debate among researchers, we hope the findings of this study will encourage data scientists and researchers to prioritize imputation before feature selection when working with datasets containing missing values.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © The Author(s) 2025. This is an open access article under the terms of the Creative Commons Attribution License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. |
Keywords: | Missing data imputation; Healthcare datasets; Machine learning; Imputation techniques; MissForest |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Electronic & Electrical Engineering (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 07 Jul 2025 14:50 |
Last Modified: | 07 Jul 2025 14:50 |
Status: | Published |
Publisher: | Springer Nature |
Identification Number: | 10.1007/s41060-025-00825-9 |
Sustainable Development Goals: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:228672 |