Generation of synthetic data using breast cancer dataset and classification with resnet18

Dilsat Berin Aytar,Semra Gunduc
DOI: https://doi.org/10.48550/arXiv.2405.16286
2024-05-25
Abstract:Since technology is advancing so quickly in the modern era of information, data is becoming an essential resource in many fields. Correct data collection, organization, and analysis make it a potent tool for successful decision-making, process improvement, and success across a wide range of sectors. Synthetic data is required for a number of reasons, including the constraints of real data, the expense of collecting labeled data, and privacy and security problems in specific situations and domains. For a variety of reasons, including security, ethics, legal restrictions, sensitivity and privacy issues, and ethics, synthetic data is a valuable tool, particularly in the health sector. A deep learning model called GAN (Generative Adversarial Networks) has been developed with the intention of generating synthetic data. In this study, the Breast Histopathology dataset was used to generate malignant and negatively labeled synthetic patch images using MSG-GAN (Multi-Scale Gradients for Generative Adversarial Networks), a form of GAN, to aid in cancer identification. After that, the ResNet18 model was used to classify both synthetic and real data via Transfer Learning. Following the investigation, an attempt was made to ascertain whether the synthetic images behaved like the real data or if they are comparable to the original data.
Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issues of data insufficiency and privacy protection in breast cancer pathology image datasets by generating synthetic data to assist in cancer recognition. Specifically, the study utilizes MSG-GAN (a variant of Generative Adversarial Networks) to generate synthetic images labeled as malignant (IDC+) and non-malignant (IDC-) from a breast cancer histopathology dataset. After generating these synthetic images, a pre-trained ResNet18 model is used to classify both synthetic and real data through transfer learning. The main objectives of the study include: 1. **Generate high-fidelity synthetic images**: Use MSG-GAN to generate synthetic breast cancer pathology images that are highly similar to real images, overcoming the limitations and privacy issues of real data. 2. **Evaluate the quality of synthetic images**: Classify the generated synthetic images using the ResNet18 model to verify whether the synthetic images can mimic the behavior of real data and perform well in classification tasks. 3. **Improve the accuracy of cancer recognition**: Enhance the diversity and richness of the dataset by generating more synthetic data, thereby improving the model's performance in practical applications. The paper validates the quality of synthetic data through four different classification experiments and evaluates the classification results using metrics such as accuracy, precision, recall, and F1 score. The results show that when synthetic data is used as the training set, the model can learn the data distribution well; and when real data is used for testing, despite some differences, the synthetic data can still simulate real data relatively well. This indicates that synthetic data can, to some extent, replace real data for training and classification tasks.