Abstract:This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at \url{

What problem does this paper attempt to address?

The paper aims to propose a new framework called GiT (Generalist Vision Transformer), which addresses various vision tasks through a simple multi-layer Transformer architecture and attempts to bridge the architectural gap between vision and language processing. Specifically, the goals of GiT include: 1. **Unified foundational framework for vision modeling**: Through a concise multi-layer Transformer structure, GiT can seamlessly integrate various vision-centric tasks, especially those often overlooked, such as object detection and semantic segmentation, thanks to its efficient general language interface. 2. **Achieving multi-task capabilities similar to large language models (LLMs)**: GiT leverages parameter sharing and unified learning objectives to achieve multi-task processing capabilities similar to large language models, achieving the best and mutually reinforcing general performance on 5 representative benchmarks. 3. **Strong generalization ability**: GiT fully adopts a one-stage joint training strategy, similar to the approach in large language models, and is trained on 27 publicly available datasets, thereby achieving strong zero-shot and few-shot performance across various tasks. To achieve these goals, GiT employs the following key technologies: - **General language interface**: All vision tasks are integrated into a unified representation through an autoregressive framework, with targets represented as token sequences based on a standard vocabulary. - **Multi-task templates and parallel decoding**: By dividing the image into multiple sub-regions and processing each sub-region simultaneously, GiT can efficiently handle tasks at different perceptual scales. - **Multi-layer Transformer architecture**: GiT is built on a ViT structure with window attention mechanisms to handle language sequences and high-resolution images, including some global attention blocks to facilitate feature propagation. Through these designs, GiT not only simplifies the model structure but also demonstrates strong performance across various vision tasks.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Git: Towards generalist vision transformer through universal language interface

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

GhostViT: Expediting Vision Transformers Via Cheap Operations

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

ViT-MVT: A Unified Vision Transformer Network for Multiple Vision Tasks.

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

Super Vision Transformer

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

MG-ViT: A Multi-Granularity Method for Compact and Efficient Vision Transformers

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Global Context Vision Transformers

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

FViT: A Focal Vision Transformer with Gabor Filter

ViTAR: Vision Transformer with Any Resolution