CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators

Wenjie Li,Aokun Hu,Ningyi Xu,Guanghui He
DOI: https://doi.org/10.1109/tc.2024.3398488
IF: 3.183
2024-01-01
IEEE Transactions on Computers
Abstract:As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.
What problem does this paper attempt to address?