Retrieval augmented generation for building datasets from scientific literature

Piyush Ranjan Maharana,Kavita Joshi
DOI: https://doi.org/10.26434/chemrxiv-2024-qjx32
2024-11-20
Abstract:In this work, we show that employing Retrieval Augmented Generation (RAG) with a Large Language Model (LLM) enables one to extract accurate data from scientific literature and construct datasets. The pipeline developed is simple and transferable to other scientific domains and can automate accu- rate structured data extraction. Quantization enables us to run LLMs on consumer hardware. Both Llama3-8B and Gemma2-9B with RAG give structured output consistently and with high accuracy as compared to direct prompting. Using the newly developed protocol, a dataset of metal hydrides for solid-state hydrogen storage was created. The accuracy obtained was > 93% in the cases tested. Thus, we demonstrate a pipeline to create datasets from scientific literature at minimal computational cost and high accuracy.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently and accurately extract data from scientific literature and construct high - quality data sets. Specifically, the author shows that by combining Retrieval Augmented Generation (RAG) with Large Language Model (LLM), accurate data can be extracted from scientific literature and data sets can be constructed. This method aims to reduce the labor intensity of manual data extraction, improve the accuracy and efficiency of data extraction, and at the same time reduce the computational cost. The paper mentions that most of the existing scientific data are embedded in research papers. With the sharp increase in the number of publications in various fields, manual data extraction has become extremely difficult. Therefore, it is necessary to develop automated tools to extract key data. By using RAG technology, the author can effectively utilize the capabilities of LLM, combine specific literature content, and generate structured output, thereby achieving automatic data extraction. In addition, the paper also explores the impact of different quantization schemes on model performance and how to run these large - language models on consumer - level hardware to reduce the dependence on high - performance computing resources.