ReactionDataExtractor: A Tool for Automated Extraction of Information from Chemical Reaction Schemes

Damian M. Wilary,Jacqueline M. Cole
DOI: https://doi.org/10.1021/acs.jcim.1c01017
IF: 6.162
2021-09-15
Journal of Chemical Information and Modeling
Abstract:Chemical reaction schemes are commonly used for visual encapsulation of chemical information. Figures of reaction schemes contain chemical transformations, the chemical species involved, as well as reaction conditions. From a data-mining point of view, they constitute rich sources, densely packed with knowledge. Yet, the challenge of automatically extracting data from them has remained largely untackled. This work presents ReactionDataExtractor, a software tool that can be used for the automatic extraction of information from multistep reaction schemes. Its capabilities include segmentation of reaction steps, regions containing reaction conditions, chemical diagrams, as well as optical character and structure recognition. A combination of rules and unsupervised machine-learning approaches is used, with bespoke detection algorithms that identify arrows, structures, labels, and conditions detection algorithms. It can be used as a low-maintenance tool for database generation capable of extracting data from large quantities of images supplied by the user. On assessment using a self-generated evaluation set, the tool achieved precision and recall metrics of between 67% and 91% in the six core areas of data extraction. The ReactionDataExtractor tool is released under the MIT license and is available to download from http://www.reactiondataextractor.org.The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.1c01017.List of all data and strict definitions of evaluation metrics at each stage (PDF)List of figures used in training set (ZIP)This article has not yet been cited by other publications.
chemistry, multidisciplinary, medicinal,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?