Abstract:In the current landscape of artificial intelligence, foundation models serve as the bedrock for advancements in both language and vision domains. OpenAI GPT-4 has emerged as the pinnacle in large language models (LLMs), while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models such as Meta's SAM and DINO, and YOLOS. However, the financial and computational burdens of training new models from scratch remain a significant barrier to progress. In response to this challenge, we introduce UnifiedVisionGPT, a novel framework designed to consolidate and automate the integration of SOTA vision models, thereby facilitating the development of vision-oriented AI. UnifiedVisionGPT distinguishes itself through four key features: (1) provides a versatile multimodal framework adaptable to a wide range of applications, building upon the strengths of multimodal foundation models; (2) seamlessly integrates various SOTA vision models to create a comprehensive multimodal platform, capitalizing on the best components of each model; (3) prioritizes vision-oriented AI, ensuring a more rapid progression in the CV domain compared to the current trajectory of LLMs; and (4) introduces automation in the selection of SOTA vision models, generating optimal results based on diverse multimodal inputs such as text prompts and images. This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, generalization, and performance. Our implementation, along with the unified multimodal framework and comprehensive dataset, is made publicly available at <a class="link-external link-https" href="https://github.com/LHBuilder/SA-Segment-Anything" rel="external noopener nofollow">this https URL</a>.

Towards General Purpose Vision Systems

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

GiT: Towards Generalist Vision Transformer through Universal Language Interface

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks

Learning A Low-Level Vision Generalist via Visual Task Prompt

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

A General Purpose Neural Architecture for Geospatial Systems

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Effectiveness Assessment of Recent Large Vision-Language Models

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Challenges and Prospects in Vision and Language Research

Vision Transformer Adapters for Generalizable Multitask Learning

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Grounded Intuition of GPT-Vision's Abilities with Scientific Images