RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization

Mingshu Zhao,Yi Luo,Yong Ouyang
2024-07-20
Abstract:In the realm of resource-constrained mobile vision tasks, the pursuit of efficiency and performance consistently drives innovation in lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). While ViTs excel at capturing global context through self-attention mechanisms, their deployment in resource-limited environments is hindered by computational complexity and latency. Conversely, lightweight CNNs are favored for their parameter efficiency and low latency. This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications. We introduce RepNeXt, a novel model series integrates multi-scale feature representations and incorporates both serial and parallel structural reparameterization (SRP) to enhance network depth and width without compromising inference speed. Extensive experiments demonstrate RepNeXt's superiority over current leading lightweight CNNs and ViTs, providing advantageous latency across various vision benchmarks. RepNeXt-M4 matches RepViT-M1.5's 82.3\% accuracy on ImageNet within 1.5ms on an iPhone 12, outperforms its AP$^{box}$ by 1.3 on MS-COCO, and reduces parameters by 0.7M. Codes and models are available at <a class="link-external link-https" href="https://github.com/suous/RepNeXt" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the efficiency and performance issues in mobile visual tasks under resource-constrained environments. Specifically, the research goal is to find complementary advantages between lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) and to develop a versatile visual backbone network suitable for resource-limited applications. To achieve this goal, the paper proposes RepNeXt, a novel multi-scale CNN model that enhances network depth and width through Serial and Parallel Structure Re-parameterization (SRP) without sacrificing inference speed. The key contributions of the paper include: 1. Proposing a simple yet effective visual backbone network, RepNeXt, whose design remains consistent in both inner stage blocks and downsampling layers, achieving competitive or superior performance to existing methods using only basic operational units. 2. Utilizing the Serial and Parallel SRP mechanism to increase the network's depth and width during training, effectively enhancing representation capability without affecting inference speed. 3. Demonstrating that a simple multi-scale CNN (without channel attention modules) can outperform complex architectures or complex operations obtained through neural architecture search by clever design, excelling in various visual tasks.