CSFNet: a compact and efficient convolution-transformer hybrid vision model
Jian Feng,Peng Wu,Renjie Xu,Xiaoming Zhang,Tao Wang,Xuan Li
DOI: https://doi.org/10.1007/s11042-024-18417-3
IF: 2.577
2024-02-13
Multimedia Tools and Applications
Abstract:The Vision Transformer (ViT) has demonstrated impressive performance in various visual tasks, but its high computational requirements limit its applicability on edge devices. Conversely, convolutional neural networks (CNNs) are commonly used in mobile applications, but their static and weak global properties hinder their performance. In this work, we propose a lightweight, high-density predictive classification hybrid-based model called CSFNet, which combines good local inductive bias capability with long-distance modeling property. To establish local-global information association, we introduce two layered structures. Firstly, we use the Local-Attention Block (LAB) with adaptive kernels and channel expansion ratio to aggregate n n local information layer by layer, capturing multi-stage detail features and inducing efficient local inductive properties. Secondly, we introduce a linear complexity Channel-Spatial Fusion Attention (CSFA) that projects the attention matrix from both channel and tokens dimensions. The relationships between tokens are aggregated stage by stage to encode efficient contextual association information using low-rank matrix and element-by-element operations to reduce computational complexity. Experimental results demonstrate that our proposed CSFNet-XXS/XS/S models with 1.4M/2.4M/5.6M parameters and 0.3G/0.5G/1.1G multiply-adds (MAdds) achieve 70.23%/74.91%/78.82% top-1 accuracy on ImageNet-1k with competitive performance compared to recent mainstream methods. Furthermore, CSFNet performs well on small-scale datasets, MS-COCO2017 and ADE-20K.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering