Abstract:This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at <a class="link-external link-https" href="https://github.com/zhangzjn/EMOv2" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper aims to solve the problem of performance optimization of lightweight models in resource - constrained scenarios, especially how to improve the performance of models in various downstream tasks when the number of parameters is fixed. Specifically, the paper focuses on developing parameter - efficient and lightweight models for dense prediction tasks and making trade - offs among the number of parameters, the number of floating - point operations (FLOPs), and performance. The goal is to establish a new frontier on lightweight models of 5M scale, which is suitable for various downstream tasks, such as visual recognition, dense prediction, and image generation. ### Specific problems that the paper attempts to solve include: 1. **Design and Optimization of Lightweight Models**: - **Parameter Efficiency**: How to reduce the number of parameters while maintaining the performance of the model. - **Computational Efficiency**: How to reduce the number of floating - point operations (FLOPs) of the model to adapt to the computing power of mobile devices. - **Performance Improvement**: How to improve the performance of the model on various tasks by improving the model structure when the number of parameters is fixed. 2. **Introduction of Global Modeling Ability**: - **Limitations of Traditional CNNs**: Traditional lightweight CNN models (such as models based on Inverted Residual Block (IRB)) perform poorly in high - resolution downstream tasks due to the lack of global modeling ability. - **Introduction of Attention Mechanisms**: By introducing the Multi - Head Self - Attention (MHSA) mechanism, the global modeling ability of the model is enhanced, thereby improving performance. 3. **Unified Design of Lightweight Models**: - **Meta Mobile Block (MMBlock)**: Rethink the basic modules of lightweight models from a unified perspective, and abstract the IRB in CNN and the MHSA/FFN modules in Transformer into a general Meta Mobile Block. - **Improved Inverted Residual Mobile Block (i2RMB)**: Based on MMBlock, a more modern and improved Inverted Residual Mobile Block is designed, which further improves the performance of the model. 4. **Application of the Model in Multiple Tasks**: - **Image Classification**: On the ImageNet dataset, the EMOv2 - 5M model achieves a Top - 1 accuracy of 82.9%, significantly exceeding CNN and attention models of the same scale. - **Object Detection**: Using the RetinaNet framework, the EMOv2 - 5M model achieves 41.5 mAP in the object detection task, which is 2.6 points higher than the previous EMO - 5M model. - **Video Recognition**: On the Kinetics - 400 dataset, the V - EMO - v2 model achieves a Top - 1 accuracy of 65.2% with 5.9M parameters, which is significantly better than other lightweight models. - **Image Segmentation and Generation**: Based on the UNet and DiT architectures, the U - EMO - v2 and D - EMO - v2 models are constructed respectively, and significant performance improvements are achieved in multiple downstream tasks. ### Summary This paper solves the problem of how to improve the performance of models in resource - constrained scenarios through the design and optimization of lightweight models, especially through the introduction of attention mechanisms and unified module design. The EMOv2 model proposed in the paper performs well in multiple tasks, providing new ideas and methods for the design of lightweight models.

EMOv2: Pushing 5M Vision Model Frontier

Rethinking Mobile Block for Efficient Attention-based Models

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

EMU: Effective Multi-Hot Encoding Net for Lightweight Scene Text Recognition with a Large Character Set.

A Lightweight YOLOv5-Based Model with Feature Fusion and Dilation Convolution for Image Segmentation

Lightweight Vision Transformer with Cross Feature Attention

MixMobileNet: A Mixed Mobile Network for Edge Vision Applications

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Efficient Optimized YOLOv8 Model with Extended Vision

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Lightweight Image Super-Resolution with Expectation-Maximization Attention Mechanism

Real-time object detection method based on YOLOv5 and efficient mobile network

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

Towards Vision Mixture of Experts for Wildlife Monitoring on the Edge

ExtremeMETA: High-speed Lightweight Image Segmentation Model by Remodeling Multi-channel Metamaterial Imagers

An improved lightweight object detection algorithm for YOLOv5