Abstract:Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the complementary nature and integration efficiency of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in feature extraction. Specifically: 1. **Balancing Local and Global Feature Extraction**: - **Advantages of CNNs**: CNNs excel in local feature extraction, efficiently capturing local spatial hierarchies in images, making them suitable for various image classification tasks. - **Advantages of ViTs**: ViTs, through self-attention mechanisms, can capture long-range dependencies and are adept at extracting global contextual information. 2. **Limitations of Existing Methods**: - **Limitations of CNNs**: The limited receptive field of small convolutional kernels restricts their ability to capture global information. - **Limitations of ViTs**: Although ViTs perform well in capturing long-range dependencies, they are weaker in handling local information and are prone to overfitting on small-scale datasets. 3. **Challenges in Integrated Architectures**: - **Separate Modules**: Existing integrated architectures typically treat CNNs and ViTs as independent modules, integrating features through fusion blocks, which may lead to information loss. - **Computational Complexity**: The high computational complexity of multi-scale feature extraction and self-attention mechanisms affects the model's efficiency and performance. ### Solution To overcome the above issues, the paper proposes the CNN-Transformer Aggregation Network (CTA-Net), which includes the following two key modules: 1. **Lightweight Multi-Scale Feature Fusion Multi-Head Self-Attention Module (LMF-MHSA)**: - **Multi-Scale Feature Fusion**: Extracts multi-scale features using different convolutional kernel sizes, enhancing the model's sensitivity to features of various scales. - **Lightweight Self-Attention Mechanism**: Optimizes computational resources through depthwise separable convolutions and linear projections, reducing the number of parameters and computational load, thereby improving model efficiency. 2. **Reverse Reconstruction CNN Variant Module (RRCV)**: - **Reverse Embedding and Reconstruction**: Embeds the vectors generated by the Transformer back into the feature map, reduces dimensions through pointwise convolutions, and re-embeds the processed feature map into the Transformer framework, avoiding information loss caused by intermediate fusion blocks. - **CNN Variants**: Tests standard CNNs, residual modules, and depthwise separable convolution modules, verifying the optimization effects of different convolution strategies in local feature extraction. ### Experimental Results - **Performance Improvement**: CTA-Net achieves significant performance improvements on multiple small-scale datasets (e.g., CIFAR-10, CIFAR-100, APTOS2019, RFMiD2020), particularly excelling in TOP-1 accuracy, parameter count, and computational efficiency compared to existing CNN and ViT variant models. - **Resource Efficiency**: While maintaining high performance, CTA-Net has a lower parameter count (20.32M) and computational complexity (2.83B FLOPs), making it particularly suitable for resource-constrained environments. ### Conclusion CTA-Net effectively combines the strengths of CNNs and ViTs, addressing the shortcomings of existing methods in local and global feature extraction, computational efficiency, and model performance, providing an efficient and lightweight solution for visual tasks on small-scale datasets.

CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

CTA-Net: A Gaze Estimation Network Based on Dual Feature Aggregation and Attention Cross Fusion

CCTSS: the Combination of CNN and Transformer with Shared Sublayer for Detection and Classification

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

CMT: Convolutional Neural Networks Meet Vision Transformers

CTC-Net: A Novel Coupled Feature-Enhanced Transformer and Inverted Convolution Network for Medical Image Segmentation

CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection

FCT: Fusing CNN and Transformer for Scene Classification

MCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

DctViT: Discrete Cosine Transform Meet Vision Transformers

CTCNet: A CNN-Transformer Cooperation Network for Face Image Super-Resolution

CSFNet: a compact and efficient convolution-transformer hybrid vision model

CTMFNet: CNN and Transformer Multiscale Fusion Network of Remote Sensing Urban Scene Imagery

CTCFNet: CNN-Transformer Complementary and Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation

Convolutional transformer network for fine-grained action recognition

LACTNet: A Lightweight Real-time Semantic Segmentation Network Based on Aggregation CNN and Transformer

A transformer-CNN parallel network for image guided depth completion

CCTNet: CNN and Cross-Shaped Transformer Hybrid Network for Remote Sensing Image Semantic Segmentation

HT-Net: hierarchical context-attention transformer network for medical ct image segmentation