A Comprehensive Study of Vision Transformers in Image Classification Tasks

Mahmoud Khalil,Ahmad Khalil,Alioune Ngom
2023-12-05
Abstract:Image Classification is a fundamental task in the field of computer vision that frequently serves as a benchmark for gauging advancements in Computer Vision. Over the past few years, significant progress has been made in image classification due to the emergence of deep learning. However, challenges still exist, such as modeling fine-grained visual information, high computation costs, the parallelism of the model, and inconsistent evaluation protocols across datasets. In this paper, we conduct a comprehensive survey of existing papers on Vision Transformers for image classification. We first introduce the popular image classification datasets that influenced the design of models. Then, we present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks followed by the adoption of vision transformers, as they have demonstrated success in capturing intricate patterns and long-range dependencies within images. Finally, we discuss open problems and shed light on opportunities for image classification to facilitate new research ideas.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores the application of Vision Transformers in image classification tasks and provides a comprehensive review of existing related research. The paper aims to address the following core issues: 1. **Evaluating and improving the performance of Vision Transformers in image classification tasks**: With the development of deep learning technology, especially the application of attention mechanisms, Vision Transformers have demonstrated powerful capabilities in image classification tasks. However, how to better utilize these models and how to overcome the challenges they encounter in practical applications are important directions for current research. 2. **Systematically reviewing the development history of Vision Transformers and their application in image classification**: The paper chronologically outlines the development process of Vision Transformer models and provides detailed introductions to representative models such as Vision Transformer (ViT), Swin Transformer, DeiT, CaiT, and iGPT. This section aims to provide readers with a comprehensive perspective on the development of Vision Transformer technology. 3. **Identifying the challenges in current research and future research opportunities**: Although Vision Transformers have achieved significant results in many benchmarks, there are still some unresolved issues, such as high data requirements, high computational costs, and limited model generalization capabilities. The paper discusses these issues and proposes possible research directions to promote further development in this field. 4. **Comparative analysis of the accuracy and efficiency of different methods**: The paper also compares the performance of various widely adopted methods on the same datasets, which helps identify which methods are more effective or efficient in specific scenarios. In summary, this paper attempts to address how to optimize the performance of these models, identify current challenges, and explore future research directions by comprehensively reviewing the application and development of Vision Transformers in image classification tasks.