Abstract:Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

TEVL: Trilinear Encoder for Video-language Representation Learning

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Unified Video-Language Pre-training with Synchronized Audio

Efficient Transfer Learning for Video-language Foundation Models

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA

VideoTRM: Pre-training for Video Captioning Challenge 2020

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Jointly Modeling Embedding and Translation to Bridge Video and Language

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning.

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Unifying Specialized Visual Encoders for Video Language Models

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Language-Aware Vision Transformer for Referring Segmentation

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks