A Quick, trustworthy spectral knowledge Q&A system leveraging retrieval-augmented generation on LLM

Jiheng Liang,Ziru Yu,Zujie Xie,Xiangyang Yu
2024-10-11
Abstract:Large Language Model (LLM) has demonstrated significant success in a range of natural language processing (NLP) tasks within general domain. The emergence of LLM has introduced innovative methodologies across diverse fields, including the natural sciences. Researchers aim to implement automated, concurrent process driven by LLM to supplant conventional manual, repetitive and labor-intensive work. In the domain of spectral analysis and detection, it is imperative for researchers to autonomously acquire pertinent knowledge across various research objects, which encompasses the spectroscopic techniques and the chemometric methods that are employed in experiments and analysis. Paradoxically, despite the recognition of spectroscopic detection as an effective analytical method, the fundamental process of knowledge retrieval remains both time-intensive and repetitive. In response to this challenge, we first introduced the Spectral Detection and Analysis Based Paper(SDAAP) dataset, which is the first open-source textual knowledge dataset for spectral analysis and detection and contains annotated literature data as well as corresponding knowledge instruction data. Subsequently, we also designed an automated Q\&A framework based on the SDAAP dataset, which can retrieve relevant knowledge and generate high-quality responses by extracting entities in the input as retrieval parameters. It is worth noting that: within this framework, LLM is only used as a tool to provide generalizability, while RAG technique is used to accurately capture the source of the <a class="link-external link-http" href="http://knowledge.This" rel="external noopener nofollow">this http URL</a> approach not only improves the quality of the generated responses, but also ensures the traceability of the knowledge. Experimental results show that our framework generates responses with more reliable expertise compared to the baseline.
Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the issue in the field of spectral detection, where researchers need to autonomously acquire relevant knowledge to determine the spectral techniques and chemometric methods used in experiments. This process is both time-consuming and repetitive. Although large language models (LLMs) have shown excellent performance in natural language processing tasks and have been introduced into the natural sciences to alleviate time and labor-intensive work in practical applications, these models often lack expertise in specific domains, especially in specialized fields like spectral detection. Moreover, most existing related datasets are primarily focused on the biological sciences and medical fields, while the spectral analysis field lacks open-source datasets. To address these issues, the authors first introduce the "Spectral Detection and Analysis-based Literature" (SDAAP) dataset, the first open-source textual knowledge dataset for spectral analysis and detection, which includes annotated literature data and related knowledge instruction data. Subsequently, the authors designed an automated question-answering framework based on the SDAAP dataset. This framework can parse entities and question formats in queries, use the parsing results as query parameters to retrieve relevant spectral detection knowledge, and generate high-quality answers. This approach not only improves the quality of generated answers but also ensures the traceability of knowledge, thereby addressing the issues of knowledge insufficiency and unreliability in the application of existing large language models in specialized fields.