Biomedical knowledge graph-optimized prompt generation for large language models

Karthik Soman,Peter W Rose,John H Morris,Rabia E Akbas,Brett Smith,Braian Peetoom,Catalina Villouta-Reyes,Gabriel Cerono,Yongmei Shi,Angela Rizk-Jackson,Sharat Israni,Charlotte A Nelson,Sui Huang,Sergio E Baranzini
2024-05-14
Abstract:Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the challenges faced by large language models (LLMs) in knowledge-intensive domains such as biomedicine, particularly the issue of "hallucinations" when generating fact-based content (i.e., the generated language is grammatically correct but the content is inaccurate or not factual). To overcome these challenges, the paper proposes a new framework called "Knowledge Graph-Retrieval Augmented Generation (KG-RAG)." The KG-RAG framework combines a large-scale biomedical knowledge graph (SPOKE) with large language models (such as Llama-2-13b, GPT-3.5-Turbo, and GPT-4) to generate meaningful and knowledge-based biomedical text while maintaining accuracy. The key features of this framework include: 1. **Optimized Context Extraction**: KG-RAG uses a minimized graph schema to extract context and employs embedding methods to prune the context, thereby reducing token consumption by over 50% without affecting accuracy, achieving a cost-effective Retrieval Augmented Generation (RAG) implementation. 2. **Performance Improvement**: Under diverse biomedical prompts, KG-RAG significantly enhances the performance of LLMs, especially in generating fact-based responses. It also provides accurate source information and statistical evidence (if available) to support its claims. 3. **Benchmarking**: The effectiveness of the KG-RAG framework is demonstrated through benchmarking on manually reviewed datasets, such as biomedical true/false questions and multiple-choice questions. For example, on a challenging multiple-choice question dataset, the performance of the Llama-2 model improved significantly by 71%. In summary, the goal of this research is to enhance the ability of general language models to handle domain-specific issues by integrating explicit knowledge (from knowledge graphs) and implicit knowledge (from LLMs), particularly in terms of cost-effectiveness.