Abstract:In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

Transgaze: exploring plain vision transformers for gaze estimation

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Gaze Estimation using Transformer

Glance-and-Gaze Vision Transformer

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Gaze Estimation Based on Convolutional Structure and Sliding Window-Based Attention Mechanism

FViT: A Focal Vision Transformer with Gabor Filter

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

ViTPose++: Vision Transformer for Generic Body Pose Estimation

A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

Global Context Vision Transformers

DeepViT: Towards Deeper Vision Transformer

Vision Transformer with Sparse Scan Prior

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

GiT: Towards Generalist Vision Transformer through Universal Language Interface

ASAFormer: Visual tracking with convolutional vision transformer and asymmetric selective attention

SimViT: Exploring a Simple Vision Transformer with sliding windows