AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

Anil Kag,Huseyin Coskun,Jierun Chen,Junli Cao,Willi Menapace,Aliaksandr Siarohin,Sergey Tulyakov,Jian Ren
2024-11-08
Abstract:Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide promising latency and performance trade-offs, support a variety of tasks, scale efficiently with respect to the amounts of data and compute, leverage available data from other tasks, and efficiently support various hardware. To this end, we introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. We revisit the key design principles of hybrid architectures and propose a simple and effective \emph{asymmetric} architecture, where the distribution of convolutional and transformer blocks is \emph{asymmetric}, containing more convolutional blocks in the earlier stages, followed by more transformer blocks in later stages. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. Notably, even without any computation optimization for transformer blocks, our models still yield faster inference speed than existing works featuring efficient attention mechanisms, highlighting the advantages and the value of our approach.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the key decision - making problems in neural network architecture design, especially the combination between convolutional neural networks (CNN) and Transformers. Specifically, the authors propose AsCAN (Asymmetric Convolution - Attention Networks), a new hybrid architecture, to achieve the following goals: 1. **Efficient performance - latency trade - off**: By asymmetrically distributing convolutional blocks and Transformer blocks, the model shows a better balance between performance and inference speed in different tasks. 2. **Support for multiple tasks**: Including image recognition, segmentation, conditional image generation and other tasks, and it can be extended to large - scale text - to - image generation tasks. 3. **Optimization of computing resource utilization**: Even without accelerating and optimizing the attention mechanism, the model can be faster than existing works, demonstrating its efficiency on hardware. 4. **Improvement of training efficiency**: A multi - stage training pipeline is proposed to reduce the training cost of large - scale text - to - image diffusion models. ### Key innovation points - **Asymmetric architecture design**: Use more convolutional blocks in the early stage and more Transformer blocks in the later stage. This asymmetric design helps to better capture local features and global dependencies. - **Applicable to multiple tasks**: It not only performs excellently in image classification tasks, but can also be applied to image generation tasks, such as text - to - image generation. - **Efficient inference speed**: Even without optimizing the attention mechanism, the model can still achieve a faster inference speed. - **Multi - stage training strategy**: By pre - training on a small - scale dataset first and then fine - tuning on a large - scale dataset, the training efficiency is improved and the consumption of computing resources is reduced. ### Conclusion By introducing the AsCAN architecture, this paper successfully solves the performance - latency trade - off problems existing in the current hybrid architectures and shows superior performance in multiple tasks. In addition, the proposed multi - stage training strategy also significantly reduces the training cost of large - scale models.