Abstract:Image Classification is a fundamental task in the field of computer vision that frequently serves as a benchmark for gauging advancements in Computer Vision. Over the past few years, significant progress has been made in image classification due to the emergence of deep learning. However, challenges still exist, such as modeling fine-grained visual information, high computation costs, the parallelism of the model, and inconsistent evaluation protocols across datasets. In this paper, we conduct a comprehensive survey of existing papers on Vision Transformers for image classification. We first introduce the popular image classification datasets that influenced the design of models. Then, we present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks followed by the adoption of vision transformers, as they have demonstrated success in capturing intricate patterns and long-range dependencies within images. Finally, we discuss open problems and shed light on opportunities for image classification to facilitate new research ideas.

What problem does this paper attempt to address?

The paper primarily explores the application of Vision Transformers in image classification tasks and provides a comprehensive review of existing related research. The paper aims to address the following core issues: 1. **Evaluating and improving the performance of Vision Transformers in image classification tasks**: With the development of deep learning technology, especially the application of attention mechanisms, Vision Transformers have demonstrated powerful capabilities in image classification tasks. However, how to better utilize these models and how to overcome the challenges they encounter in practical applications are important directions for current research. 2. **Systematically reviewing the development history of Vision Transformers and their application in image classification**: The paper chronologically outlines the development process of Vision Transformer models and provides detailed introductions to representative models such as Vision Transformer (ViT), Swin Transformer, DeiT, CaiT, and iGPT. This section aims to provide readers with a comprehensive perspective on the development of Vision Transformer technology. 3. **Identifying the challenges in current research and future research opportunities**: Although Vision Transformers have achieved significant results in many benchmarks, there are still some unresolved issues, such as high data requirements, high computational costs, and limited model generalization capabilities. The paper discusses these issues and proposes possible research directions to promote further development in this field. 4. **Comparative analysis of the accuracy and efficiency of different methods**: The paper also compares the performance of various widely adopted methods on the same datasets, which helps identify which methods are more effective or efficient in specific scenarios. In summary, this paper attempts to address how to optimize the performance of these models, identify current challenges, and explore future research directions by comprehensively reviewing the application and development of Vision Transformers in image classification tasks.

A Comprehensive Study of Vision Transformers in Image Classification Tasks

A Survey on Vision Transformer

Transformers in Vision: A Survey

Three things everyone should know about Vision Transformers

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

3D Vision with Transformers: A Survey

A Survey on Visual Transformer

A Survey of Visual Transformers

Vision Transformers: State of the Art and Research Challenges

A Comprehensive Survey of Transformers for Computer Vision

A survey of the Vision Transformers and their CNN-Transformer based Variants

Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review

Transformers Meet Visual Learning Understanding: A Comprehensive Review

Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review

Vision Transformers for Remote Sensing Image Classification

Vision Transformers in Medical Computer Vision -- A Contemplative Retrospection

Analyzing Vision Transformers for Image Classification in Class Embedding Space

Transformers in computational visual media: A survey

Vision Transformers for Computational Histopathology

Vision transformers in domain adaptation and domain generalization: a study of robustness