Automatic extraction of FAIR data from publications using LLM

Guillaume Godin,Luc Patiny
DOI: https://doi.org/10.26434/chemrxiv-2023-05v1b-v2
2023-12-04
Abstract:Since the beginning of modern science, researchers have used a specific format to communicate their findings in a standardized language. Such formats help to ensure that results can be replicated and published. With the rise of digitalization, artificial intelligence has become increasingly important in combination with the scientific literature sources of data. This synergy serves as a foundation of robust models following central principles of FAIR (Findable, Accessible, Interoperable, Reusable) data. By having access to more precise data, it is reasonable to anticipate the development of improved models. Specifically, large neural networks have demonstrated a high level of responsiveness to the quality of the data used. Therefore, enhancing the data quality can potentially lead to a reduction in the size of neural networks. Large Language Models (LLMs) have proven to be incredibly effective at replicating human tasks. This is a significant improvement that not only automatizes process but also leads to better results. By combining human and LLM assistance, we can produce higher-quality content and solve repetitive tasks that would otherwise take years to complete. Those generative AI assistants can follow instructions to transform and extrapolate existing text. Our contribution outlines a method for automatically extracting experimental data of molecules from literature. Essentially by our prompt engineering, we demonstrate that this process can be made more cost-effective. Secondly, we use automated fact checking principles to ensure the original data quality as well as the data retrieval by LLM. Ultimately, our aim is to provide guidance for the publication of organic chemical experimental data to assist researchers and enhance FAIR data.
Chemistry
What problem does this paper attempt to address?
The paper mainly explores how to automatically extract data that follows the FAIR principles (Findable, Accessible, Interoperable, and Reusable) from chemical synthesis literature. Researchers use large language models (LLMs), such as ChatGPT, to process the experimental sections, which often contain structured data such as molecule names, reaction yields, and product forms. By creating flexible template structures (similar to YAML format), they are able to guide the LLM in data extraction and ensure consistency in the output data. Additionally, they apply automated fact-checking to verify the quality and accuracy of the retrieved raw data. Challenges mentioned in the paper include inconsistencies in experimental data reported by different authors and journals, as well as the complexity of the IUPAC nomenclature, which makes the conversion from molecule names to structures difficult. By adjusting the LLM prompts and using YAML format, they improve the efficiency and accuracy of automated extraction, reducing redundancy in data structures and thus lowering costs. In their experiments, they analyzed certain volumes of the open-access journal "Molecules" and found that data could be effectively extracted from approximately 72% of the experimental paragraphs. They also noted that while the cost of using tools like ChatGPT is lower than manual extraction, there are still fluctuations in data quality, which may require further optimization and control measures. The paper concludes by discussing future research directions, including improving LLM models, reducing costs, and employing more advanced techniques to enhance the accuracy and efficiency of data extraction to support chemical research and the publication of FAIR data.