VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Chris Kelly,Luhui Hu,Bang Yang,Yu Tian,Deshun Yang,Cindy Yang,Zaoshan Huang,Zihao Li,Jiayin Hu,Yuexian Zou
2024-03-14
Abstract:With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of how to combine Large Language Models (LLMs) with foundational visual models to achieve open-world visual perception. Specifically, the authors propose VisionGPT, a visual language understanding agent based on a general multimodal framework. It aims to integrate state-of-the-art foundational models and automate the collaboration process between these models to enhance efficiency, flexibility, and generalization capabilities in visual tasks. Key features of VisionGPT include: 1. **Utilizing Large Language Models as the core**: LLMs (e.g., Llama-2) are used as the core component to parse user requests, breaking down natural language instructions into specific action plans, and subsequently invoking the appropriate foundational visual models. 2. **Multi-source output integration**: Outputs from different foundational models are automatically integrated to form a comprehensive response to the user. 3. **Adaptation to various applications**: Capable of being applied to various application scenarios such as image understanding, generation, or editing under text conditions. In this way, VisionGPT can handle complex visual tasks, such as instance segmentation, and improve processing efficiency through collaborative work between models. Additionally, VisionGPT is highly flexible, allowing for the easy integration of the latest foundational models to meet evolving technological demands.