Abstract:In the current landscape of artificial intelligence, foundation models serve as the bedrock for advancements in both language and vision domains. OpenAI GPT-4 has emerged as the pinnacle in large language models (LLMs), while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models such as Meta's SAM and DINO, and YOLOS. However, the financial and computational burdens of training new models from scratch remain a significant barrier to progress. In response to this challenge, we introduce UnifiedVisionGPT, a novel framework designed to consolidate and automate the integration of SOTA vision models, thereby facilitating the development of vision-oriented AI. UnifiedVisionGPT distinguishes itself through four key features: (1) provides a versatile multimodal framework adaptable to a wide range of applications, building upon the strengths of multimodal foundation models; (2) seamlessly integrates various SOTA vision models to create a comprehensive multimodal platform, capitalizing on the best components of each model; (3) prioritizes vision-oriented AI, ensuring a more rapid progression in the CV domain compared to the current trajectory of LLMs; and (4) introduces automation in the selection of SOTA vision models, generating optimal results based on diverse multimodal inputs such as text prompts and images. This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, generalization, and performance. Our implementation, along with the unified multimodal framework and comprehensive dataset, is made publicly available at <a class="link-external link-https" href="https://github.com/LHBuilder/SA-Segment-Anything" rel="external noopener nofollow">this https URL</a>.

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

GiT: Towards Generalist Vision Transformer through Universal Language Interface

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

12-in-1: Multi-task vision and language representation learning

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Optimizing Multi-Task Learning for Enhanced Performance in Large Language Models

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis

GroundingGPT:Language Enhanced Multi-modal Grounding Model

InfMLLM: A Unified Framework for Visual-Language Tasks.