HyCTor: A Hybrid CNN-Transformer Network Accelerator with Flexible Weight/Output Stationary Dataflow and Multi-Core Extension
Shuai Yuan,Weifeng He,Zhenhua Zhu,Fangxin Liu,Zhuoran Song,Guohao Dai,Guanghui He,Yanan Sun
DOI: https://doi.org/10.1109/tcad.2024.3490173
IF: 2.9
2024-01-01
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:Hybrid convolutional neural network (CNN) and Transformer networks are emerging in computer vision, combining convolutional, linear, and attention layers to achieve high accuracies with moderate model sizes. Developing the accelerators for hybrid networks is pivotal to simultaneously optimize the static matrix multiplication (MM) in convolutional and linear layers, as well as dynamic MM in attention layers. However, the existing accelerators are primarily designed for either CNNs or Transformers, resulting in increased data movement to support dynamic MM and potential under-utilization of hardware for static MM. To enhance computational performance and energy efficiency for hybrid networks, we propose HyCTor, an accelerator featuring flexible output-stationary (OS) and weight-stationary (WS) dataflows, along with a multi-core extension for higher throughput. The parrallel array of HyCTor supports interlayer slicing and intra-layer splicing to improve the utilization for static MM, and enables seamless switching between OS and WS dataflow to minimize the data movement in dynamic MM. By leveraging structured sparsity in OS dataflow and unstructured sparsity in WS dataflow, the computational efficiency is further boosted for each layer through flexible dataflow selection based on the sparsity ratio. Besides, a novel QuadLoop-Mesh topology is proposed to address the complex data dependencies in hybrid networks and minimize data transmission distances in the multi-core HyCTor. Experimental results on ResNet-18, ViTB, and TransIAR-AF show that the proposed single-core HyCTor achieves 1.83×, 1.65×, and 2.41× speedup than state-of-the-art (SOTA) accelerators with 100% utilization rate in most layers, and 3.82× to 38.5× speedup than RTX4090 GPU. The energy efficiency of HyCTor is improved by 1.81× to 8.77× compared with SOTA accelerators. Moreover, the 4-core HyCTor achieves speedups of 3.32×, 2.58×, and 2.91×, while the 16-core HyCTor achieves speedups of 7.05×, 4.05×, and 9.64× compared to 1-core HyCTor on three networks.