Retrieval Augmented Generation and Representative Vector Summarization for large unstructured textual data in Medical Education

S. S. Manathunga,Y. A. Illangasekara

2023-08-01

Abstract:Large Language Models are increasingly being used for various tasks including content generation and as chatbots. Despite their impressive performances in general tasks, LLMs need to be aligned when applying for domain specific tasks to mitigate the problems of hallucination and producing harmful answers. Retrieval Augmented Generation (RAG) allows to easily attach and manipulate a non-parametric knowledgebases to LLMs. Applications of RAG in the field of medical education are discussed in this paper. A combined extractive and abstractive summarization method for large unstructured textual data using representative vectors is proposed.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper primarily aims to address two key issues faced by large language models (LLMs) when applied in specific domains (such as medical education): hallucination and the risk of generating harmful answers. To tackle these problems, the paper proposes a method that combines Retrieval Augmented Generation (RAG) and Representative Vector Summarization (RVS). Specifically: 1. **RAG**: By integrating a non-parametric knowledge base (such as a vector database) with large language models, the accuracy and reliability of the model in handling domain-specific tasks are improved. This approach allows the model to reference external data sources when generating answers, thereby reducing the production of incorrect information. 2. **RVS**: For large amounts of unstructured text data, a new summarization method is proposed. This method first selects a certain number (\(k\)) of representative text fragments from the knowledge base and then uses these fragments to generate the final summary. This not only helps to overcome the context window limitation encountered when directly using LLMs to generate long document summaries but also ensures that the summary content more accurately reflects the main points of the original text. Additionally, the paper introduces how to implement the above techniques using docGPT, a document intelligence program written in Python, and validates its effectiveness on medical reference books through experiments. Overall, this study aims to provide an efficient tool for information retrieval and summarization in the field of medical education.

Retrieval Augmented Generation and Representative Vector Summarization for large unstructured textual data in Medical Education

Question-Answering Based Summarization of Electronic Health Records using Retrieval Augmented Generation

Development and Testing of Retrieval Augmented Generation in Large Language Models -- A Case Study Report

Retrieval-Augmented Generation for Large Language Models: A Survey

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

Meta Knowledge for Retrieval Augmented Large Language Models

MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering

Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

Enhanced Electronic Health Records Text Summarization Using Large Language Models

Transforming Healthcare Education: Harnessing Large Language Models for Frontline Health Worker Capacity Building using Retrieval-Augmented Generation

Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates

Towards a Robust Retrieval-Based Summarization System

Ontology-Constrained Generation of Domain-Specific Clinical Summaries

Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models

Explainable Biomedical Hypothesis Generation via Retrieval Augmented Generation enabled Large Language Models

Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

Bailicai: A Domain-Optimized Retrieval-Augmented Generation Framework for Medical Applications