Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature

Sarveswara Rao Vangala,Sowmya Ramaswamy Krishnan,Navneet Bung,Dhandapani Nandagopal,Gomathi Ramasamy,Satyam Kumar,Sridharan Sankaran,Rajgopal Srinivasan,Arijit Roy
DOI: https://doi.org/10.1186/s13321-024-00928-8
2024-11-28
Journal of Cheminformatics
Abstract:With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.
chemistry, multidisciplinary,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?