Abstract:Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we present a new simple and generic architecture, termed Vision Outlooker (VOLO), which implements a novel outlook attention operation that dynamically conduct the local feature aggregation mechanism in a sliding window manner across the input image. Unlike self-attention that focuses on modeling global dependencies of local features at a coarse level, our outlook attention targets at encoding finer-level features, which is critical for recognition but ignored by self-attention. Outlook attention breaks the bottleneck of self-attention whose computation cost scales quadratically with the input spatial dimension, and thus is much more memory efficient. Compared to our Tokens-To-Token Vision Transformer (T2T-ViT), VOLO can more efficiently encode fine-level features that are essential for high-performance visual recognition. Experiments show that with only 26.6 M learnable parameters, VOLO achieves 84.2% top-1 accuracy on ImageNet-1 K without using extra training data, 2.7% better than T2T-ViT with a comparable number of parameters. When the model size is scaled up to 296 M parameters, its performance can be further improved to 87.1%, setting a new record for ImageNet-1 K classification. In addition, we also take the proposed VOLO as pretr- ined models and report superior performance on downstream tasks, such as semantic segmentation. Code is available at https://github.com/sail-sg/volo.

Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Vision-Language Adaptive Mutual Decoder for OOV-STR

Divert More Attention to Vision-Language Tracking

VOLO: Vision Outlooker for Visual Recognition

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Towards Better Vision-Inspired Vision-Language Models

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

RES-StS: Referring Expression Speaker via Self-training with Scorer for Goal-Oriented Vision-Language Navigation

Case report: adverse granulomatous reaction (Granuloma formation) and pseudomonas superinfection after lip augmentation by the new filler DermaLive®

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

VoCo-LLaMA: Towards Vision Compression with Large Language Models