RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization

Mingshu Zhao,Yi Luo,Yong Ouyang

2024-07-20

Abstract:In the realm of resource-constrained mobile vision tasks, the pursuit of efficiency and performance consistently drives innovation in lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). While ViTs excel at capturing global context through self-attention mechanisms, their deployment in resource-limited environments is hindered by computational complexity and latency. Conversely, lightweight CNNs are favored for their parameter efficiency and low latency. This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications. We introduce RepNeXt, a novel model series integrates multi-scale feature representations and incorporates both serial and parallel structural reparameterization (SRP) to enhance network depth and width without compromising inference speed. Extensive experiments demonstrate RepNeXt's superiority over current leading lightweight CNNs and ViTs, providing advantageous latency across various vision benchmarks. RepNeXt-M4 matches RepViT-M1.5's 82.3\% accuracy on ImageNet within 1.5ms on an iPhone 12, outperforms its AP$^{box}$ by 1.3 on MS-COCO, and reduces parameters by 0.7M. Codes and models are available at <a class="link-external link-https" href="https://github.com/suous/RepNeXt" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the efficiency and performance issues in mobile visual tasks under resource-constrained environments. Specifically, the research goal is to find complementary advantages between lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) and to develop a versatile visual backbone network suitable for resource-limited applications. To achieve this goal, the paper proposes RepNeXt, a novel multi-scale CNN model that enhances network depth and width through Serial and Parallel Structure Re-parameterization (SRP) without sacrificing inference speed. The key contributions of the paper include: 1. Proposing a simple yet effective visual backbone network, RepNeXt, whose design remains consistent in both inner stage blocks and downsampling layers, achieving competitive or superior performance to existing methods using only basic operational units. 2. Utilizing the Serial and Parallel SRP mechanism to increase the network's depth and width during training, effectively enhancing representation capability without affecting inference speed. 3. Demonstrating that a simple multi-scale CNN (without channel attention modules) can outperform complex architectures or complex operations obtained through neural architecture search by clever design, excelling in various visual tasks.

RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization

RepViT: Revisiting Mobile CNN From ViT Perspective

RepECN: Making ConvNets Better Again for Efficient Image Super-Resolution

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations

InceptionNeXt: When Inception Meets ConvNeXt

RapidNet: Multi-Level Dilated Convolution Based Mobile Backbone

Lightweight Vision Transformer with Cross Feature Attention

RDPNet: a Single-Path Lightweight CNN with Re-Parameterization for CPU-type Edge Devices

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

nnMobileNet: Rethinking CNN for Retinopathy Research

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

RepGhost: A Hardware-Efficient Ghost Module via Re-parameterization

Rethinking Vision Transformers for MobileNet Size and Speed

ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices.

CSFNet: a compact and efficient convolution-transformer hybrid vision model

ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding

Lightweight Super-Resolution Reconstruction Vision Transformers of Remote Sensing Image Based on Structural Re-Parameterization

Reparameterizable Multibranch Bottleneck Network for Lightweight Image Super-Resolution