Abstract:Large language models (LLMs) have made a significant impact on the fields of general artificial intelligence. General purpose LLMs exhibit strong logic and reasoning skills and general world knowledge but can sometimes generate misleading results when prompted on specific subject areas. LLMs trained with domain-specific knowledge can reduce the generation of misleading information (i.e. hallucinations) and enhance the precision of LLMs in specialized contexts. Training new LLMs on specific corpora however can be resource intensive. Here we explored the use of a retrieval-augmented generation (RAG) model which we tested on literature specific to a biomedical research area. OpenAI's GPT-3.5, GPT-4, Microsoft's Prometheus, and a custom RAG model were used to answer 19 questions pertaining to diffuse large B-cell lymphoma (DLBCL) disease biology and treatment. Eight independent reviewers assessed LLM responses based on accuracy, relevance, and readability, rating responses on a 3-point scale for each category. These scores were then used to compare LLM performance. The performance of the LLMs varied across scoring categories. On accuracy and relevance, the RAG model outperformed other models with higher scores on average and the most top scores across questions. GPT-4 was more comparable to the RAG model on relevance versus accuracy. By the same measures, GPT-4 and GPT-3.5 had the highest scores for readability of answers when compared to the other LLMs. GPT-4 and 3.5 also had more answers with hallucinations than the other LLMs, due to non-existent references and inaccurate responses to clinical questions. Our findings suggest that an oncology research-focused RAG model may outperform general-purpose LLMs in accuracy and relevance when answering subject-related questions. This framework can be tailored to Q&A in other subject areas. Further research will help understand the impact of LLM architectures, RAG methodologies, and prompting techniques in answering questions across different subject areas.

GeneRAG: Enhancing Large Language Models with Gene-Related Task by Retrieval-Augmented Generation

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

Towards Optimizing a Retrieval Augmented Generation using Large Language Model on Academic Data

Retrieval-Augmented Generation for Large Language Models: A Survey

M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions

A Retrieval-Augmented Generation Based Large Language Model Benchmarked On a Novel Dataset

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

On the Role of Long-tail Knowledge in Retrieval Augmented Large Language Models

Benchmarking Large Language Models in Retrieval-Augmented Generation

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

DRAGIN: Dynamic Retrieval Augmented Generation based on the Information Needs of Large Language Models

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models