CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Chunlei Meng,Jiacheng Yang,Wei Lin,Bowen Liu,Hongda Zhang,chun ouyang,Zhongxue Gan
2024-10-15
Abstract:Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the complementary nature and integration efficiency of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in feature extraction. Specifically: 1. **Balancing Local and Global Feature Extraction**: - **Advantages of CNNs**: CNNs excel in local feature extraction, efficiently capturing local spatial hierarchies in images, making them suitable for various image classification tasks. - **Advantages of ViTs**: ViTs, through self-attention mechanisms, can capture long-range dependencies and are adept at extracting global contextual information. 2. **Limitations of Existing Methods**: - **Limitations of CNNs**: The limited receptive field of small convolutional kernels restricts their ability to capture global information. - **Limitations of ViTs**: Although ViTs perform well in capturing long-range dependencies, they are weaker in handling local information and are prone to overfitting on small-scale datasets. 3. **Challenges in Integrated Architectures**: - **Separate Modules**: Existing integrated architectures typically treat CNNs and ViTs as independent modules, integrating features through fusion blocks, which may lead to information loss. - **Computational Complexity**: The high computational complexity of multi-scale feature extraction and self-attention mechanisms affects the model's efficiency and performance. ### Solution To overcome the above issues, the paper proposes the CNN-Transformer Aggregation Network (CTA-Net), which includes the following two key modules: 1. **Lightweight Multi-Scale Feature Fusion Multi-Head Self-Attention Module (LMF-MHSA)**: - **Multi-Scale Feature Fusion**: Extracts multi-scale features using different convolutional kernel sizes, enhancing the model's sensitivity to features of various scales. - **Lightweight Self-Attention Mechanism**: Optimizes computational resources through depthwise separable convolutions and linear projections, reducing the number of parameters and computational load, thereby improving model efficiency. 2. **Reverse Reconstruction CNN Variant Module (RRCV)**: - **Reverse Embedding and Reconstruction**: Embeds the vectors generated by the Transformer back into the feature map, reduces dimensions through pointwise convolutions, and re-embeds the processed feature map into the Transformer framework, avoiding information loss caused by intermediate fusion blocks. - **CNN Variants**: Tests standard CNNs, residual modules, and depthwise separable convolution modules, verifying the optimization effects of different convolution strategies in local feature extraction. ### Experimental Results - **Performance Improvement**: CTA-Net achieves significant performance improvements on multiple small-scale datasets (e.g., CIFAR-10, CIFAR-100, APTOS2019, RFMiD2020), particularly excelling in TOP-1 accuracy, parameter count, and computational efficiency compared to existing CNN and ViT variant models. - **Resource Efficiency**: While maintaining high performance, CTA-Net has a lower parameter count (20.32M) and computational complexity (2.83B FLOPs), making it particularly suitable for resource-constrained environments. ### Conclusion CTA-Net effectively combines the strengths of CNNs and ViTs, addressing the shortcomings of existing methods in local and global feature extraction, computational efficiency, and model performance, providing an efficient and lightweight solution for visual tasks on small-scale datasets.