Manuel González Lastre,Pablo Pou,Miguel Wiche,Daniel Ebeling,Andre Schirmeisen,Rubén Pérez
Abstract:Non--Contact Atomic Force Microscopy with CO--functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR--AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024--bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR--AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4\%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use high - resolution atomic force microscopy (HR - AFM) images for molecular recognition. Specifically, the authors propose a method. By extracting molecular fingerprints from HR - AFM images (using the 1,024 - bit extended - connectivity chemical fingerprint ECFP4), and then using these fingerprints to identify molecules through virtual screening. This method can not only provide structural information of molecules, but also information about their chemical composition, thus overcoming the limitations of previous HR - AFM - based molecular recognition methods.
### Main contributions:
1. **Molecular fingerprint extraction**: The authors developed a deep - learning model that can extract optimized molecular structure descriptors (ECFP4) from HR - AFM image stacks and identify molecules through virtual screening.
2. **High accuracy**: The retrieval accuracy of this model on theoretical images reaches 95.4%, and it can assign a confidence score (Tanimoto similarity) to each candidate molecule, thus providing reliability information of the identification results.
3. **Combining global information**: To further improve the recognition accuracy, the authors also introduced another deep - learning model, which can predict the chemical formula of molecules from the same HR - AFM image stack, increasing the recognition accuracy to 97.6%.
4. **Experimental verification**: The authors conducted limited experimental image tests, and the results show that this method has great potential in practical applications.
### Method overview:
- **Dataset**: Use the QUAM - AFM dataset, which contains 165 million HR - AFM images of quasi - planar organic molecules selected from PubChem.
- **Model architecture**: Developed two convolutional neural network (CNN) models:
- One is used to predict molecular fingerprints (ECFP4), adopting the EfficientNet - B0 architecture, and the input layer is modified to accept a stack of 10 constant - height HR - AFM images.
- The other is used to predict chemical formulas, adopting a similar architecture, but the final layer is a Dense layer with 10 neurons and the activation function is ReLU.
- **Training and evaluation**: The models are mainly trained and evaluated on simulated images, using the binary cross - entropy loss function and balanced positive weights. During the training process, a variety of data augmentation techniques are applied to simulate the influence of experimental conditions.
- **Virtual screening**: By calculating the Tanimoto similarity between the predicted fingerprints and all molecular fingerprints in the reference database, the candidate molecules with the highest similarity are selected.
### Results and discussion:
- **Prediction performance**: On the test set, the median Tanimoto similarity between the predicted ECFP4 and the true values is 0.95, indicating that the model can extract molecular fingerprints from HR - AFM images very accurately.
- **Molecular recognition**: Through virtual screening, the model can successfully identify molecules in most cases, and even if the predicted fingerprints are not completely correct, it can also provide enough information for identification.
- **Experimental verification**: Preliminary experimental image tests show that this method also performs well under actual conditions and has high application potential.
In conclusion, this paper proposes an efficient molecular recognition method based on HR - AFM images. By combining deep - learning and molecular fingerprint techniques, it significantly improves the accuracy and reliability of molecular recognition.