Abstract:Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern convolution neural networks (CNNs). In this survey, we have reviewed over 100 of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios. Because of their differences on training settings and dedicated vision tasks, we have also evaluated and compared all these existing visual Transformers under different configurations. Furthermore, we have revealed a series of essential but unexploited aspects that may empower such visual Transformers to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between the visual Transformers and the sequential ones. Finally, two promising research directions are suggested for future investment. We will continue to update the latest articles and their released source codes at.

VSET: A MULTIMODAL TRANSFORMER FOR VISUAL SPEECH ENHANCEMENT

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Multiresolution and Multimodal Speech Recognition with Transformers

Transavs: End-To-End Audio-Visual Segmentation With Transformer

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

AVSegFormer: Audio-Visual Segmentation with Transformer

SETransformer: Speech Enhancement Transformer

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Siamese Vision Transformers are Scalable Audio-visual Learners

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

A Survey of Visual Transformers

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Contextual Dependency Vision Transformer for Spectrogram-Based Multivariate Time Series Analysis