Enhancing supervised analysis of imbalanced untargeted metabolomics datasets using a CWGAN-GP framework for data augmentation

Francisco Traquete,Marta Sousa Silva,António E.N. Ferreira
DOI: https://doi.org/10.1016/j.compbiomed.2024.109414
IF: 7.7
2024-11-22
Computers in Biology and Medicine
Abstract:Untargeted metabolomics is an extremely useful approach for the discrimination of biological systems and biomarker identification. However, data analysis workflows are complex and face many challenges. Two of these challenges are the demand of high sample size and the possibility of severe class imbalance, which is particularly common in clinical studies. The latter can make statistical models less generalizable, increase the risk of overfitting and skew the analysis in favour of the majority class. One possible approach to mitigate this problem is data augmentation. However, the use of artificial data requires adequate data augmentation methods and criteria for assessing the quality of the generated data.In this work, we used Conditional Wasserstein Generative Adversarial Networks with Gradient Penalty (CWGAN-GPs) for data augmentation of metabolomics data. Using a set of benchmark datasets, we applied several criteria for the evaluation of the quality of generated data and assessed the performance of supervised predictive models trained with datasets that included such data. CWGAN-GP models generated realistic data with identical characteristics to real samples, mostly avoiding mode collapse. Furthermore, in cases of class imbalance, the performance of predictive models improved by supplementing the minority class with generated samples. This is evident for high quality datasets with well separated classes. Conversely, model improvements were quite modest for high class overlap datasets. This trend was confirmed by using synthetic datasets with different class separation levels. Data augmentation is a viable procedure to alleviate class imbalance problems but is not universally beneficial in metabolomics.
engineering, biomedical,computer science, interdisciplinary applications,mathematical & computational biology,biology
What problem does this paper attempt to address?