RepViT: Revisiting Mobile CNN From ViT Perspective

Ao Wang,Hui Chen,Zijia Lin,Jungong Han,Guiguang Ding,Hengjun Pu
DOI: https://doi.org/10.48550/arXiv.2307.09283
2023-07-18
Computer Vision and Pattern Recognition
Abstract:Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency. The code and trained models are available at \url{https://github.com/jameslahm/RepViT}.
What problem does this paper attempt to address?
The paper aims to address the performance gap between lightweight Convolutional Neural Networks (CNNs) and lightweight Vision Transformers (ViTs), emphasizing the application potential of lightweight CNNs on mobile devices. Specifically, although lightweight ViTs perform excellently on various computer vision tasks with low latency, the differences between their architectures and those of lightweight CNNs have not been fully studied. Therefore, the paper gradually enhances the mobile-friendliness of standard lightweight CNNs by integrating effective design choices from lightweight ViTs. Ultimately, the paper proposes a new series of pure lightweight CNNs—RepViT, which not only surpasses existing lightweight ViT models in multiple vision tasks but also achieves over 80% Top-1 accuracy on the iPhone 12 while maintaining a low latency of 1ms. This is the first time a lightweight model has reached this level. The goal of the paper is to narrow the gap between lightweight CNNs and lightweight ViTs and to highlight the potential of the former for deployment on mobile devices.