Abstract:Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide promising latency and performance trade-offs, support a variety of tasks, scale efficiently with respect to the amounts of data and compute, leverage available data from other tasks, and efficiently support various hardware. To this end, we introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. We revisit the key design principles of hybrid architectures and propose a simple and effective \emph{asymmetric} architecture, where the distribution of convolutional and transformer blocks is \emph{asymmetric}, containing more convolutional blocks in the earlier stages, followed by more transformer blocks in later stages. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. Notably, even without any computation optimization for transformer blocks, our models still yield faster inference speed than existing works featuring efficient attention mechanisms, highlighting the advantages and the value of our approach.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the key decision - making problems in neural network architecture design, especially the combination between convolutional neural networks (CNN) and Transformers. Specifically, the authors propose AsCAN (Asymmetric Convolution - Attention Networks), a new hybrid architecture, to achieve the following goals: 1. **Efficient performance - latency trade - off**: By asymmetrically distributing convolutional blocks and Transformer blocks, the model shows a better balance between performance and inference speed in different tasks. 2. **Support for multiple tasks**: Including image recognition, segmentation, conditional image generation and other tasks, and it can be extended to large - scale text - to - image generation tasks. 3. **Optimization of computing resource utilization**: Even without accelerating and optimizing the attention mechanism, the model can be faster than existing works, demonstrating its efficiency on hardware. 4. **Improvement of training efficiency**: A multi - stage training pipeline is proposed to reduce the training cost of large - scale text - to - image diffusion models. ### Key innovation points - **Asymmetric architecture design**: Use more convolutional blocks in the early stage and more Transformer blocks in the later stage. This asymmetric design helps to better capture local features and global dependencies. - **Applicable to multiple tasks**: It not only performs excellently in image classification tasks, but can also be applied to image generation tasks, such as text - to - image generation. - **Efficient inference speed**: Even without optimizing the attention mechanism, the model can still achieve a faster inference speed. - **Multi - stage training strategy**: By pre - training on a small - scale dataset first and then fine - tuning on a large - scale dataset, the training efficiency is improved and the consumption of computing resources is reduced. ### Conclusion By introducing the AsCAN architecture, this paper successfully solves the performance - latency trade - off problems existing in the current hybrid architectures and shows superior performance in multiple tasks. In addition, the proposed multi - stage training strategy also significantly reduces the training cost of large - scale models.

AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

MCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection

HiCAN: Hierarchical Convolutional Attention Network for Sequence Modeling.

SCTANet: A Spatial Attention-Guided CNN-Transformer Aggregation Network for Deep Face Image Super-Resolution

Convolutional Attention Networks for Scene Text Recognition

CANet: Comprehensive Attention Network for video-based action recognition

Asymmetric Network Combining CNN and Transformer for Building Extraction from Remote Sensing Images

Image Super-Resolution Based on Adaptive Cascading Attention Network.

Class attention network for image recognition

Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search

Constructive Autoassociative Neural Network for Facial Recognition

Efficient Lightweight Attention Network for Face Recognition

ASANet: Asymmetric Semantic Aligning Network for RGB and SAR image land cover classification

DAS: A Deformable Attention to Capture Salient Information in CNNs

CVANet: Cascaded visual attention network for single image super-resolution

THCANet: Two-layer hop cascaded asymptotic network for robot-driving road-scene semantic segmentation in RGB-D images

A CNN-Transformer Network Combining CBAM for Change Detection in High-Resolution Remote Sensing Images

iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency

Attention based lightweight asymmetric network for real-time semantic segmentation

A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks