Data augmentation in microscopic images for material data mining

Boyuan Ma,Xiaoyan Wei,Chuni Liu,Xiaojuan Ban,Haiyou Huang,Hao Wang,Weihua Xue,Stephen Wu,Mingfei Gao,Qing Shen,Michele Mukeshimana,Adnan Omer Abuassba,Haokai Shen,Yanjing Su
DOI: https://doi.org/10.1038/s41524-020-00392-6
IF: 12.256
2020-08-18
npj Computational Materials
Abstract:Abstract Recent progress in material data mining has been driven by high-capacity models trained on large datasets. However, collecting experimental data (real data) has been extremely costly owing to the amount of human effort and expertise required. Here, we develop a novel transfer learning strategy to address problems of small or insufficient data. This strategy realizes the fusion of real and simulated data and the augmentation of training data in a data mining procedure. For a specific task of grain instance image segmentation, this strategy aims to generate synthetic data by fusing the images obtained from simulating the physical mechanism of grain formation and the “image style” information in real images. The results show that the model trained with the acquired synthetic data and only 35% of the real data can already achieve competitive segmentation performance of a model trained on all of the real data. Because the time required to perform grain simulation and to generate synthetic data are almost negligible as compared to the effort for obtaining real data, our proposed strategy is able to exploit the strong prediction power of deep learning without significantly increasing the experimental burden of training data preparation.
materials science, multidisciplinary,chemistry, physical
What problem does this paper attempt to address?
The paper primarily aims to address a common issue in materials science research: the lack of sufficient high-quality data available for training machine learning models, especially deep learning models, due to the time-consuming and technically challenging nature of experimental data (real data) collection. To solve this problem, the authors propose a novel transfer learning strategy for data augmentation in materials data mining. Specifically, this study develops a new method to integrate actual acquired data (real data) with data generated through simulations (simulated data) to expand the training dataset. This method particularly focuses on the task of grain instance image segmentation, which involves identifying and isolating individual grains from microscopic images. To generate synthetic data, the method combines grain formation images obtained through physical mechanism simulations with "image style" information from real images. The research results show that a model trained with only 35% of real data plus the generated synthetic data can achieve image segmentation performance comparable to a model trained with all real data. This indicates that the method can leverage the powerful predictive capabilities of deep learning without significantly increasing the burden of experimental data preparation. Additionally, this strategy helps reduce the time and labor costs required to collect and annotate real microscopic images of materials. In summary, the main goal of the paper is to address the issue of data scarcity in the field of materials science and to improve the performance of machine learning models trained on limited real data through an innovative data augmentation method.