Synthetic data in biomedicine via generative artificial intelligence

Boris van Breugel,Tennison Liu,Dino Oglic,Mihaela van der Schaar
DOI: https://doi.org/10.1038/s44222-024-00245-7
2024-10-09
Nature Reviews Bioengineering
Abstract:The creation and application of data in biomedicine and healthcare often face privacy constraints, bias, distributional shifts, underrepresentation of certain groups and data scarcity. Some of these challenges may be addressed by synthetic data, which can be generated by deep generative models. In this Review, we highlight how data-driven synthetic data can be created not only to overcome privacy concerns associated with real data, but also to expand and improve real data. In particular, generative-model-based data augmentation can address data scarcity; synthetic data can improve data fairness and reduce bias by accounting for underrepresented groups; and unseen scenarios may be simulated with synthetic data. We further examine how biomedically relevant data, such as molecular, imaging and tabular data, may be created by foundation models through query-specific generation. We outline the challenges associated with ownership, publication, sharing and access of synthetic data. Importantly, we discuss approaches that can be applied to measure the quality of data generated by deep generative models to improve trust in synthetic data and the results derived from such data.
What problem does this paper attempt to address?