Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models

Zhongzhen Huang,Kui Xue,Yongqi Fan,Linjie Mu,Ruoyu Liu,Tong Ruan,Shaoting Zhang,Xiaofan Zhang
2024-04-27
Abstract:Large-scale language models (LLMs) have achieved remarkable success across various language tasks but suffer from hallucinations and temporal misalignment. To mitigate these shortcomings, Retrieval-augmented generation (RAG) has been utilized to provide external knowledge to facilitate the answer generation. However, applying such models to the medical domain faces several challenges due to the lack of domain-specific knowledge and the intricacy of real-world scenarios. In this study, we explore LLMs with RAG framework for knowledge-intensive tasks in the medical field. To evaluate the capabilities of LLMs, we introduce MedicineQA, a multi-round dialogue benchmark that simulates the real-world medication consultation scenario and requires LLMs to answer with retrieved evidence from the medicine database. MedicineQA contains 300 multi-round question-answering pairs, each embedded within a detailed dialogue history, highlighting the challenge posed by this knowledge-intensive task to current LLMs. We further propose a new \textit{Distill-Retrieve-Read} framework instead of the previous \textit{Retrieve-then-Read}. Specifically, the distillation and retrieval process utilizes a tool calling mechanism to formulate search queries that emulate the keyword-based inquiries used by search engines. With experimental results, we show that our framework brings notable performance improvements and surpasses the previous counterparts in the evidence retrieval process in terms of evidence retrieval accuracy. This advancement sheds light on applying RAG to the medical domain.
Computation and Language
What problem does this paper attempt to address?
This paper discusses the challenges of applying large language models (LLMs) in medical consultations, particularly their shortcomings in dealing with inaccurate facts (hallucinations) and temporal misalignment. To address these issues, the researchers propose a method called Retrieval-Augmented Generation (RAG), which involves incorporating external knowledge to assist answer generation. However, applying RAG in the medical field poses difficulties due to the lack of domain expertise and complexity of real-world scenarios. To this end, the paper proposes a new benchmark test called MedicineQA, which is a dataset consisting of multi-turn dialogues simulating real-world drug consultations. This dataset aims to evaluate the performance of LLMs in the medical domain, particularly in knowledge-intensive tasks. The researchers also introduce an improved framework called Distill-Retrieve-Read, which replaces the traditional Retrieve-then-Read approach by utilizing tool invocation mechanisms to construct search queries, simulating keyword queries in search engines. Experimental results demonstrate that the proposed Distill-Retrieve-Read framework significantly improves performance in evidence retrieval accuracy and surpasses the previous RAG method. This work provides new insights into the application of RAG in the medical field and contributes to enhancing the accuracy and reliability of LLMs in handling medical consultations.