Synthetic Simplicity: Unveiling Bias in Medical Data Augmentation

Krishan Agyakari Raja Babu,Rachana Sathish,Mrunal Pattanaik,Rahul Venkataramani
2024-07-31
Abstract:Synthetic data is becoming increasingly integral in data-scarce fields such as medical imaging, serving as a substitute for real data. However, its inherent statistical characteristics can significantly impact downstream tasks, potentially compromising deployment performance. In this study, we empirically investigate this issue and uncover a critical phenomenon: downstream neural networks often exploit spurious distinctions between real and synthetic data when there is a strong correlation between the data source and the task label. This exploitation manifests as \textit{simplicity bias}, where models overly rely on superficial features rather than genuine task-related complexities. Through principled experiments, we demonstrate that the source of data (real vs.\ synthetic) can introduce spurious correlating factors leading to poor performance during deployment when the correlation is absent. We first demonstrate this vulnerability on a digit classification task, where the model spuriously utilizes the source of data instead of the digit to provide an inference. We provide further evidence of this phenomenon in a medical imaging problem related to cardiac view classification in echocardiograms, particularly distinguishing between 2-chamber and 4-chamber views. Given the increasing role of utilizing synthetic datasets, we hope that our experiments serve as effective guidelines for the utilization of synthetic datasets in model training.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue that when using synthetic data in medical image data augmentation, downstream neural networks may overly rely on superficial feature differences between synthetic and real data (i.e., simplicity bias) rather than task-related complex features, leading to performance degradation when the model is deployed. Specifically, when there is a strong correlation between the data source (real or synthetic) and the task labels, the model may exploit these irrelevant features, resulting in poor performance in practical applications. The paper demonstrates this issue through experiments and proposes considerations for using synthetic data in model training to ensure better generalization and avoid misclassification due to simplicity bias. This is particularly important in the medical field, as the application of medical imaging often involves higher risks.