MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Chenyu Wang,Weixin Luo,Qianyu Chen,Haonan Mai,Jindi Guo,Sixun Dong,Xiaohua,Xuan,Zhengxin Li,Lin Ma,Shenghua Gao
2024-01-24
Abstract:Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs' perceiving tool-use ability is limited to a single text query, which may result in ambiguity in understanding the users' real intentions. LLMs are expected to eliminate that by perceiving the visual- or auditory-grounded instructions' information. Therefore, in this paper, we propose MLLM-Tool, a system incorporating open-source LLMs and multi-modal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model's capability, we collect a dataset featured by consisting of multi-modal input tools from HuggingFace. Another important feature of our dataset is that our dataset also contains multiple potential choices for the same instruction due to the existence of identical functions and synonymous functions, which provides more potential solutions for the same query. The experiments reveal that our MLLM-Tool is capable of recommending appropriate tools for multi-modal instructions. Codes and data are available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the limitations of large language models (LLMs) in tool usage. Specifically, current LLMs primarily rely on single text queries to understand user intent, which can lead to ambiguity and errors in understanding. To overcome this issue, the paper proposes a new system—MLLM-Tool, which combines a multimodal encoder and open-source large-scale language models, enabling LLMs to perceive multimodal input instructions and correctly select tools that match the task. ### Main Contributions 1. **Development of MLLM-Tool**: A multimodal tool agent system capable of perceiving visual or auditory instruction information, achieved by integrating a multimodal encoder and open-source large-scale language models. 2. **Construction of a Multimodal Input Benchmark**: Used to evaluate LLMs' awareness and selection ability of external tools, this benchmark includes content-aware multimodal instructions and one-to-many instruction-answer pairs. 3. **Design of Evaluation Metrics**: Through extensive subset ablation experiments on multiple popular LLMs, the fine-tuned MLLM-Tool achieved an accuracy of 88.19% in tool selection, demonstrating the effectiveness of the method. ### Problems Addressed - **Text Ambiguity**: Current LLMs rely solely on text input and cannot effectively handle ambiguity and vagueness in text. For example, when a user requests to generate an image, without visual input, LLMs may not determine the specific generation conditions. - **Multimodal Input**: By introducing multimodal inputs (such as images, videos, audio), MLLM-Tool can better understand user intent and select appropriate tools. - **Multiple Option Support**: Existing tool learning benchmarks do not support multiple potential solutions for the same instruction, whereas MLLM-Tool's dataset supports multiple options for each instruction, closer to real-world scenarios. ### Methodology 1. **Dataset Construction**: - **API Collection**: Crawled high-quality APIs from the HuggingFace platform, ensuring each API has detailed descriptions and example code. - **Instruction-Answer Pair Generation**: Used GPT-4 to generate 20 queries for each API call, retaining 10 high-quality queries after manual screening. - **Instruction Matching**: Constructed one-to-many instruction-answer pairs, supporting multiple potential solutions for the same instruction. 2. **System Architecture**: - **Multimodal Encoder**: Used ImageBind as the primary multimodal encoder, unifying data from different modalities into a single embedding space. - **Large-Scale Language Models**: Selected high-performing LLMs (such as Vicuna, Llama, etc.), fine-tuned using Low-Rank Adaptation (LoRA) to reduce the number of learnable parameters. 3. **Evaluation Metrics**: - **Accuracy**: Calculated whether the model's output API is in the real API list. - **Hallucination Rate**: Evaluated whether the model's output API exists in the corpus to detect model fabrication. - **Recall Rate**: Assessed the proportion of real APIs covered by the model in multiple inferences. - **Format Accuracy**: Ensured the output format conforms to JSON standards, facilitating user navigation and selection. ### Experimental Results Experiments conducted on the constructed dataset showed that MLLM-Tool achieved an accuracy of 88.19% in tool selection, demonstrating its effectiveness in handling multimodal inputs and understanding user intent. Additionally, the experiments validated the system's superior performance in addressing text ambiguity and supporting multiple options. ### Conclusion By combining multimodal encoders and large-scale language models, MLLM-Tool effectively addresses the limitations of current LLMs in tool usage, particularly excelling in handling text ambiguity and multimodal inputs. The development of this system provides new directions and methods for future tool agent research.