MARCA: Mamba Accelerator with ReConfigurable Architecture

Jinhao Li,Shan Huang,Jiaming Xu,Jun Liu,Li Ding,Ningyi Xu,Guohao Dai
2024-09-16
Abstract:We propose a Mamba accelerator with reconfigurable architecture, MARCA.We propose three novel approaches in this paper. (1) Reduction alternative PE array architecture for both linear and element-wise operations. For linear operations, the reduction tree connected to PE arrays is enabled and executes the reduction operation. For element-wise operations, the reduction tree is disabled and the output bypasses. (2) Reusable nonlinear function unit based on the reconfigurable PE. We decompose the exponential function into element-wise operations and a shift operation by a fast biased exponential algorithm, and the activation function (SiLU) into a range detection and element-wise operations by a piecewise approximation algorithm. Thus, the reconfigurable PEs are reused to execute nonlinear functions with negligible accuracy loss.(3) Intra-operation and inter-operation buffer management strategy. We propose intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations. We conduct extensive experiments on Mamba model families with different sizes.MARCA achieves up to 463.22$\times$/11.66$\times$ speedup and up to 9761.42$\times$/242.52$\times$ energy efficiency compared to Intel Xeon 8358P CPU and NVIDIA Tesla A100 GPU implementations, respectively.
Hardware Architecture,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on three challenges encountered in accelerating the Mamba computing process: 1. **Incompatibility between element - level operations and Tensor Core**: Linear operations (such as matrix multiplication) and element - level operations are the two main types of operations in Mamba. As the sequence length increases, the proportion of time for element - level operations rises significantly (for example, it exceeds 60% when the input length is 2048). Since these operations do not require reduction operations, this is incompatible with existing tensor - core - based architectures (for example, the reduction speed is only 1/16). 2. **Non - linear function units occupy a large area**: The optimized non - linear function units (such as exponential function units) still occupy more than 30% of the area of the processing unit (PE). This results in a large area overhead. 3. **Element - level operations require a large amount of memory access but have limited data sharing**: Linear operations and element - level operations show huge differences in computational intensity (for example, nearly 3 orders of magnitude) and read - write ratio (for example, more than 3 orders of magnitude) in Mamba. Due to the limited data sharing of element - level operations, existing methods (such as blocking) are ineffective for them. In response to these challenges, the author proposes a Mamba accelerator with a reconfigurable architecture - MARCA, and proposes three innovative methods to solve these problems: 1. **Optional reduction PE array architecture supporting linear and element - level operations**: For linear operations, enable the reduction tree connected to the PE array to perform reduction operations; for element - level operations, disable the reduction tree and the results are directly bypassed and output. 2. **Reusable non - linear function units based on reconfigurable PE**: Decompose the exponential function into element - level operations and shift operations through the fast bias - exponent algorithm, and decompose the activation function (SiLU) into range detection and element - level operations through the piecewise approximation algorithm. In this way, the reconfigurable PE can be used to execute non - linear functions with almost no loss of precision. 3. **Buffer management strategies within and between operations**: A buffer management strategy within operations is proposed to maximize the input data sharing in linear operations; and a buffer management strategy between operations is proposed to maximize the output data sharing in element - level operations. Through these methods, MARCA has achieved significant performance improvement and energy - efficiency improvement on Mamba models of different scales.