Abstract:Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework capable of handling both image and video inputs, with enhanced segmentation capabilities for the unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

Towards Language-guided Visual Recognition Via Dynamic Convolutions

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Convolutional networks and applications in vision

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension

Dual Convolutional LSTM Network for Referring Image Segmentation

Robust Visual Reasoning Via Language Guided Neural Module Networks

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

Robust Domain Generalization for Multi-modal Object Recognition

Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension

Multi-View Spatial-Temporal Network for Continuous Sign Language Recognition

Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization.

Superpixel Semantics Representation and Pre-training for Vision-Language Task

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction

Conformer: Local Features Coupling Global Representations for Visual Recognition

HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog

Language-Aware Vision Transformer for Referring Segmentation

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

12-in-1: Multi-Task Vision and Language Representation Learning