Retrieval Augmented Generation and Representative Vector Summarization for large unstructured textual data in Medical Education

S. S. Manathunga,Y. A. Illangasekara
2023-08-01
Abstract:Large Language Models are increasingly being used for various tasks including content generation and as chatbots. Despite their impressive performances in general tasks, LLMs need to be aligned when applying for domain specific tasks to mitigate the problems of hallucination and producing harmful answers. Retrieval Augmented Generation (RAG) allows to easily attach and manipulate a non-parametric knowledgebases to LLMs. Applications of RAG in the field of medical education are discussed in this paper. A combined extractive and abstractive summarization method for large unstructured textual data using representative vectors is proposed.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily aims to address two key issues faced by large language models (LLMs) when applied in specific domains (such as medical education): hallucination and the risk of generating harmful answers. To tackle these problems, the paper proposes a method that combines Retrieval Augmented Generation (RAG) and Representative Vector Summarization (RVS). Specifically: 1. **RAG**: By integrating a non-parametric knowledge base (such as a vector database) with large language models, the accuracy and reliability of the model in handling domain-specific tasks are improved. This approach allows the model to reference external data sources when generating answers, thereby reducing the production of incorrect information. 2. **RVS**: For large amounts of unstructured text data, a new summarization method is proposed. This method first selects a certain number (\(k\)) of representative text fragments from the knowledge base and then uses these fragments to generate the final summary. This not only helps to overcome the context window limitation encountered when directly using LLMs to generate long document summaries but also ensures that the summary content more accurately reflects the main points of the original text. Additionally, the paper introduces how to implement the above techniques using docGPT, a document intelligence program written in Python, and validates its effectiveness on medical reference books through experiments. Overall, this study aims to provide an efficient tool for information retrieval and summarization in the field of medical education.