Evaluating ChatGPT on Nuclear Domain-Specific Data

Muhammad Anwar,Mischa de Costa,Issam Hammad,Daniel Lau
2024-08-26
Abstract:This paper examines the application of ChatGPT, a large language model (LLM), for question-and-answer (Q&A) tasks in the highly specialized field of nuclear data. The primary focus is on evaluating ChatGPT's performance on a curated test dataset, comparing the outcomes of a standalone LLM with those generated through a Retrieval Augmented Generation (RAG) approach. LLMs, despite their recent advancements, are prone to generating incorrect or 'hallucinated' information, which is a significant limitation in applications requiring high accuracy and reliability. This study explores the potential of utilizing RAG in LLMs, a method that integrates external knowledge bases and sophisticated retrieval techniques to enhance the accuracy and relevance of generated outputs. In this context, the paper evaluates ChatGPT's ability to answer domain-specific questions, employing two methodologies: A) direct response from the LLM, and B) response from the LLM within a RAG framework. The effectiveness of these methods is assessed through a dual mechanism of human and LLM evaluation, scoring the responses for correctness and other metrics. The findings underscore the improvement in performance when incorporating a RAG pipeline in an LLM, particularly in generating more accurate and contextually appropriate responses for nuclear domain-specific queries. Additionally, the paper highlights alternative approaches to further refine and improve the quality of answers in such specialized domains.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of the lack of accuracy and reliability of large language models (LLMs) such as ChatGPT when answering professional questions in the field of nuclear energy. Specifically, the paper focuses on the following aspects: 1. **Limited Professional Knowledge**: Existing commercial and open-source LLMs have limited knowledge of the nuclear energy field without specific domain training, making them unable to meet the needs of daily operations. 2. **Context Window Limitations**: Even when provided with information during inference, the context window of LLMs is limited, making it difficult to encompass the vast amount of nuclear energy data accumulated over decades. 3. **Generation of Incorrect Information**: LLMs are prone to generating "hallucinations," i.e., information that appears reasonable but does not align with actual knowledge. This is a serious issue in the nuclear energy field, which requires high accuracy and reliability. To address these issues, the paper proposes and evaluates the use of the Retrieval-Augmented Generation (RAG) method to improve the performance of LLMs in the nuclear energy field. RAG enhances LLMs by retrieving relevant information from an external knowledge base at runtime, providing context to improve the accuracy and relevance of their answers. The paper evaluates the effectiveness of RAG by comparing answers generated directly by LLMs and those generated using the RAG method.