A robust synthetic data generation framework for machine learning in High-Resolution Transmission Electron Microscopy (HRTEM)

Luis Rangel DaCosta,Katherine Sytwu,Catherine Groschner,Mary Scott
2023-09-12
Abstract:Machine learning techniques are attractive options for developing highly-accurate automated analysis tools for nanomaterials characterization, including high-resolution transmission electron microscopy (HRTEM). However, successfully implementing such machine learning tools can be difficult due to the challenges in procuring sufficiently large, high-quality training datasets from experiments. In this work, we introduce Construction Zone, a Python package for rapidly generating complex nanoscale atomic structures, and develop an end-to-end workflow for creating large simulated databases for training neural networks. Construction Zone enables fast, systematic sampling of realistic nanomaterial structures, and can be used as a random structure generator for simulated databases, which is important for generating large, diverse synthetic datasets. Using HRTEM imaging as an example, we train a series of neural networks on various subsets of our simulated databases to segment nanoparticles and holistically study the data curation process to understand how various aspects of the curated simulated data -- including simulation fidelity, the distribution of atomic structures, and the distribution of imaging conditions -- affect model performance across several experimental benchmarks. Using our results, we are able to achieve state-of-the-art segmentation performance on experimental HRTEM images of nanoparticles from several experimental benchmarks and, further, we discuss robust strategies for consistently achieving high performance with machine learning in experimental settings using purely synthetic data.
Materials Science,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced when using machine - learning techniques for nanomaterial characterization in high - resolution transmission electron microscopy (HRTEM). Specifically, the authors point out that although machine - learning methods can provide efficient and automated analysis tools, in practical applications, the implementation of these methods is difficult due to the fact that experimental data sets are difficult to obtain and costly. The main challenges include: 1. **Acquisition of high - quality training data sets**: To train high - performance machine - learning models, a large amount of high - quality labeled data is required. However, manually generating a sufficiently large and diverse experimental data set is time - consuming and prone to introducing human and experimental biases, which affects the performance of the model. 2. **Diversity and coverage of data sets**: Experimental data are often difficult to comprehensively cover various experimental conditions and sample types, which limits the generalization ability of the model. Especially in nanomaterial characterization, the complexity and diversity of sample structures place higher requirements on data sets. 3. **Effective use of synthetic data**: Using synthetic data can overcome the problem of obtaining experimental data, but how to generate high - quality, diverse synthetic data and ensure its effectiveness in training models is a key issue. To solve the above problems, the authors developed a Python package named "Construction Zone" for rapidly generating complex nano - scale atomic structures and combining HRTEM simulations to generate large - scale synthetic databases. Through this method, they can systematically sample realistic nanomaterial structures and study the impact of different data set characteristics (such as simulation fidelity, atomic structure distribution, imaging condition distribution, etc.) on model performance. Finally, the authors demonstrated how to use pure synthetic data to train neural networks to achieve high - precision segmentation of nanoparticles in experimental HRTEM images and reached the best - existing - level performance.