Alqulaity, M. and Yang, P. orcid.org/0000-0002-8553-7127 (2024) Enhanced conditional GAN for high-quality synthetic tabular data generation in mobile-based cardiovascular healthcare. Sensors, 24 (23). 7673. ISSN 1424-8220
Abstract
The generation of synthetic tabular data has emerged as a critical task in various fields, particularly in healthcare, where data privacy concerns limit the availability of real datasets for research and analysis. This paper presents an enhanced Conditional Generative Adversarial Network (GAN) architecture designed for generating high-quality synthetic tabular data, with a focus on cardiovascular disease datasets that encompass mixed data types and complex feature relationships. The proposed architecture employs specialized sub-networks to process continuous and categorical variables separately, leveraging metadata such as Gaussian Mixture Model (GMM) parameters for continuous attributes and embedding layers for categorical features. By integrating these specialized pathways, the generator produces synthetic samples that closely mimic the statistical properties of the real data. Comprehensive experiments were conducted to compare the proposed architecture with two established models: Conditional Tabular GAN (CTGAN) and Tabular Variational AutoEncoder (TVAE). The evaluation utilized metrics such as the Kolmogorov–Smirnov (KS) test for continuous variables, the Jaccard coefficient for categorical variables, and pairwise correlation analyses. Results indicate that the proposed approach attains a mean KS statistic of 0.3900, demonstrating strong overall performance that outperforms CTGAN (0.4803) and is comparable to TVAE (0.3858). Notably, our approach shows lowest KS statistics for key continuous features, such as total cholesterol (KS = 0.0779), weight (KS = 0.0861), and diastolic blood pressure (KS = 0.0957), indicating its effectiveness in closely replicating real data distributions. Additionally, it achieved a Jaccard coefficient of 1.00 for eight out of eleven categorical variables, effectively preserving categorical distributions. These findings indicate that the proposed architecture captures both distributions and dependencies, providing a robust solution in supporting mobile personalized cardiovascular disease prevention systems.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
Keywords: | tabular data; generative adversarial networks; synthetic data generation; cardiovascular disease; medical informatics; machine learning in healthcare |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 02 Dec 2024 09:03 |
Last Modified: | 02 Dec 2024 09:03 |
Published Version: | https://www.mdpi.com/1424-8220/24/23/7673 |
Status: | Published |
Publisher: | MDPI AG |
Refereed: | Yes |
Identification Number: | 10.3390/s24237673 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:220302 |