Abstract:Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

ULTra: Unveiling Latent Token Interpretability in Transformer Based Understanding

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Semantic Segmentation of UAV Images Based on Transformer Framework with Context Information

Understanding Video Transformers via Universal Concept Discovery

Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability

Beyond Intuition: Rethinking Token Attributions Inside Transformers

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

Locality-Enhanced Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images.

Interpreting and Controlling Vision Foundation Models via Text Explanations

Representation Separation for Semantic Segmentation with Vision Transformers

C2F-Explainer: Explaining Transformers Better Through a Coarse-to-Fine Strategy

Explainability of Vision Transformers: A Comprehensive Review and New Perspectives

Transformer-based land use and land cover classification with explainability using satellite imagery

Language-Aware Vision Transformer for Referring Segmentation

Transformer-Based Semantic Segmentation for Extraction of Building Footprints from Very-High-Resolution Images

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation