MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

Binxu Li,Tiankai Yan,Yuanting Pan,Jie Luo,Ruiyang Ji,Jiayuan Ding,Zhe Xu,Shilong Liu,Haoyu Dong,Zihao Lin,Yixin Wang
2024-10-05
Abstract:Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap, this paper introduces the first agent explicitly designed for the medical field, named \textbf{M}ulti-modal \textbf{Med}ical \textbf{Agent} (MMedAgent). We curate an instruction-tuning dataset comprising six medical tools solving seven tasks across five modalities, enabling the agent to choose the most suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent exhibits efficiency in updating and integrating new medical tools. Codes and models are all available.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the limitations of Multimodal Large Language Models (MLLMs) in handling various tasks in the medical field. Specifically, although existing MLLMs perform well on specific tasks, they fall short when dealing with multiple tasks across different medical imaging modalities. To overcome this limitation, the paper proposes the first multimodal AI agent specifically designed for the medical field—MMedAgent. This agent integrates various specialized tools to tackle multiple medical tasks, thereby achieving seamless handling of various complex tasks. Additionally, MMedAgent demonstrates efficiency in updating and integrating new medical tools and outperforms existing open-source methods and the closed-source model GPT-4o in multiple medical tasks. By constructing an instruction-tuning dataset, MMedAgent can select appropriate tools based on user input and integrate the outputs of these tools to accurately respond to user requests.