An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT

Haikuo Shao,Huihong Shi,Wendong Mao,Zhongfeng Wang
2024-03-29
Abstract:Vision Transformers (ViTs) have achieved significant success in computer vision. However, their intensive computations and massive memory footprint challenge ViTs' deployment on embedded devices, calling for efficient ViTs. Among them, EfficientViT, the state-of-the-art one, features a Convolution-Transformer hybrid architecture, enhancing both accuracy and hardware efficiency. Unfortunately, existing accelerators cannot fully exploit the hardware benefits of EfficientViT due to its unique architecture. In this paper, we propose an FPGA-based accelerator for EfficientViT to advance the hardware efficiency frontier of ViTs. Specifically, we design a reconfigurable architecture to efficiently support various operation types, including lightweight convolutions and attention, boosting hardware utilization. Additionally, we present a time-multiplexed and pipelined dataflow to facilitate both intra- and inter-layer fusions, reducing off-chip data access costs. Experimental results show that our accelerator achieves up to 780.2 GOPS in throughput and 105.1 GOPS/W in energy efficiency at 200MHz on the Xilinx ZCU102 FPGA, which significantly outperforms prior works.
Hardware Architecture,Machine Learning
What problem does this paper attempt to address?
This paper focuses on how to deploy vision Transformer (ViTs) more efficiently on embedded devices, especially for the latest efficient ViT model called EfficientViT. EfficientViT combines convolution and Transformer architecture to improve accuracy and hardware efficiency. However, existing accelerators cannot fully leverage the hardware advantages of EfficientViT due to its unique architecture. The paper proposes a reconfigurable accelerator based on Field-Programmable Gate Array (FPGA) to enhance the hardware efficiency of ViTs. The design includes a configurable architecture that efficiently supports various operations types, including lightweight convolution and attention, to improve hardware utilization. Additionally, the paper presents a time-multiplexing and pipelined data flow method to promote fusion computation within neighboring lightweight convolutions and attention, reducing the cost of external memory data access. Experimental results show that the accelerator achieves a throughput of up to 780.2 GOPS on the Xilinx ZCU102 FPGA and an energy efficiency of 105.1 GOPS/W, surpassing previous works. With these optimizations, the accelerator can efficiently execute EfficientViT and address the deployment challenges of EfficientViT on resource-constrained devices.