Abstract:The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image patches, in comparison to the increasing receptive fields of CNNs across layers and other alternatives (e.g., large kernels and atrous convolution). In this work, for the first time we explore the global context learning potentials of ViTs for dense visual prediction (e.g., semantic segmentation). Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information, critical for dense prediction tasks. We first demonstrate that encoding an image as a sequence of patches, a vanilla ViT without local convolution and resolution reduction can yield stronger visual representation for semantic segmentation. For example, our model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission) and performs competitively on Cityscapes. However, the basic ViT architecture falls short in broader dense prediction applications, such as object detection and instance segmentation, due to its lack of a pyramidal structure, high computational demand, and insufficient local context. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification.

Camouflaged Object Segmentation with Transformer

TransVOS: Video Object Segmentation with Transformers

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

A Simple yet Effective Network based on Vision Transformer for Camouflaged Object and Salient Object Detection

CoSformer: Detecting Co-Salient Object with Transformers

Vision Transformer with Convolutions Architecture Search

Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

When CNN meet with ViT: decision-level feature fusion for camouflaged object detection

Segmenting Transparent Object in the Wild with Transformer

Unifying Global-Local Representations in Salient Object Detection with Transformer

Vision Transformers: From Semantic Segmentation to Dense Prediction

An Extendable, Efficient and Effective Transformer-based Object Detector

DctViT: Discrete Cosine Transform Meet Vision Transformers

SOTR: Segmenting Objects with Transformers

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Efficient Transformer for Remote Sensing Image Segmentation

GroupTransNet: Group transformer network for RGB-D salient object detection