Abstract:Texture, a significant visual attribute in images, has been extensively investigated across various image recognition applications. Convolutional Neural Networks (CNNs), which have been successful in many computer vision tasks, are currently among the best texture analysis approaches. On the other hand, Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition, causing a paradigm shift in the field. However, ViTs have so far not been scrutinized for texture recognition, hindering a proper appreciation of their potential in this specific setting. For this reason, this work explores various pre-trained ViT architectures when transferred to tasks that rely on textures. We review 21 different ViT variants and perform an extensive evaluation and comparison with CNNs and hand-engineered models on several tasks, such as assessing robustness to changes in texture rotation, scale, and illumination, and distinguishing color textures, material textures, and texture attributes. The goal is to understand the potential and differences among these models when directly applied to texture recognition, using pre-trained ViTs primarily for feature extraction and employing linear classifiers for evaluation. We also evaluate their efficiency, which is one of the main drawbacks in contrast to other methods. Our results show that ViTs generally outperform both CNNs and hand-engineered models, especially when using stronger pre-training and tasks involving in-the-wild textures (images from the internet). We highlight the following promising models: ViT-B with DINO pre-training, BeiTv2, and the Swin architecture, as well as the EfficientFormer as a low-cost alternative. In terms of efficiency, although having a higher number of GFLOPs and parameters, ViT-B and BeiT(v2) can achieve a lower feature extraction time on GPUs compared to ResNet50.

Self-Supervised Vision Transformers for Writer Retrieval

HTR-VT: Handwritten Text Recognition with Vision Transformer

An Empirical Study of Training Self-Supervised Vision Transformers

Self-supervised Video Retrieval Transformer Network

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer.

Analyzing Local Representations of Self-supervised Vision Transformers

RegionViT: Regional-to-Local Attention for Vision Transformers

Improving Vision Transformers by Revisiting High-Frequency Components

Vision Transformer with Super Token Sampling

Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Semi-supervised Vision Transformers at Scale

Vision Transformer: Vit and its Derivatives

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis