MICER: A Pre-trained Encoder-Decoder Architecture for Molecular Image Captioning

Jiacai Yi,Chengkun Wu,Xiaochen Zhang,Xinyi Xiao,Yanlong Qiu,Wentao Zhao,Tingjun Hou,Dongsheng Cao
DOI: https://doi.org/10.1093/bioinformatics/btac545
IF: 5.8
2022-08-06
Bioinformatics
Abstract:Motivation Automatic recognition of chemical structures from molecular images provides an important avenue for the rediscovery of chemicals. Traditional rule-based approaches that rely on expert knowledge and fail to consider all the stylistic variations of molecular images usually suffer from cumbersome recognition processes and low generalization ability. Deep learning-based methods that integrate different image styles and automatically learn valuable features are flexible, but currently under-researched and have limitations, and are therefore not fully exploited. Results MICER, an encoder-decoder-based, reconstructed architecture for molecular image captioning, combines transfer learning, attention mechanisms, and several strategies to strengthen effectiveness and plasticity in different datasets. The effects of stereochemical information, molecular complexity, data volume, and pre-trained encoders on MICER performance were evaluated. Experimental results show that the intrinsic features of the molecular images and the sub-model match have a significant impact on the performance of this task. These findings inspire us to design the training dataset and the encoder for the final validation model, and the experimental results suggest that the MICER model consistently outperforms the state-of-the-art methods on four datasets. MICER was more reliable and scalable due to its interpretability and transfer capacity and provides a practical framework for developing comprehensive and accurate automated molecular structure identification tools to explore unknown chemical space. Availability https://github.com/Jiacai-Yi/MICER Supplementary information Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?