VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Chris Kelly,Luhui Hu,Bang Yang,Yu Tian,Deshun Yang,Cindy Yang,Zaoshan Huang,Zihao Li,Jiayin Hu,Yuexian Zou

2024-03-14

Abstract:With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily focuses on addressing the issue of how to combine Large Language Models (LLMs) with foundational visual models to achieve open-world visual perception. Specifically, the authors propose VisionGPT, a visual language understanding agent based on a general multimodal framework. It aims to integrate state-of-the-art foundational models and automate the collaboration process between these models to enhance efficiency, flexibility, and generalization capabilities in visual tasks. Key features of VisionGPT include: 1. **Utilizing Large Language Models as the core**: LLMs (e.g., Llama-2) are used as the core component to parse user requests, breaking down natural language instructions into specific action plans, and subsequently invoking the appropriate foundational visual models. 2. **Multi-source output integration**: Outputs from different foundational models are automatically integrated to form a comprehensive response to the user. 3. **Adaptation to various applications**: Capable of being applied to various application scenarios such as image understanding, generation, or editing under text conditions. In this way, VisionGPT can handle complex visual tasks, such as instance segmentation, and improve processing efficiency through collaborative work between models. Additionally, VisionGPT is highly flexible, allowing for the easy integration of the latest foundational models to meet evolving technological demands.

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

GroundingGPT:Language Enhanced Multi-modal Grounding Model

InfMLLM: A Unified Framework for Visual-Language Tasks.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Effectiveness Assessment of Recent Large Vision-Language Models