MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen,Deyao Zhu,Xiaoqian Shen,Xiang Li,Zechun Liu,Pengchuan Zhang,Raghuraman Krishnamoorthi,Vikas Chandra,Yunyang Xiong,Mohamed Elhoseiny
2023-11-08
Abstract:Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at <a class="link-external link-https" href="https://minigpt-v2.github.io/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenges faced by multimodal large language models when handling various vision-language tasks, such as image description, visual question answering (VQA), and visual localization. Specifically, the research goal is to construct a unified interface that can efficiently complete various vision-language tasks and distinguish different task types through simple multimodal instructions. The paper proposes the MiniGPT-v2 model, which introduces specific task identifiers to reduce the ambiguity of multimodal instructions, thereby improving the model's learning efficiency and execution effectiveness for different tasks. Experimental results show that MiniGPT-v2 performs excellently in multiple benchmark tests, outperforming other general vision-language models.