SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors

Mariam Rakka,Jinhao Li,Guohao Dai,Ahmed Eltawil,Mohammed E. Fouda,Fadi Kurdahi
2024-11-27
Abstract:Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.
Hardware Architecture,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the computational and memory overhead problems faced by large language models (LLMs) when deployed on resource - constrained devices, especially the bottleneck of the Softmax operation. Specifically: 1. **Computational and Memory Overhead**: Despite the progress in compression techniques, non - linear operators such as Softmax and LayerNorm are still bottlenecks because they are very sensitive to quantization. This makes it difficult to deploy LLMs on resource - constrained edge devices. 2. **Bottleneck of Softmax Operation**: Softmax is part of the attention mechanism and takes up a large amount of running time when processing long - sequence inputs. For example, in the Llama2 - 7b model, when the sequence length is 16384, Softmax takes up as much as 38% of the running time. Moreover, after accelerating GEMM - based operations, non - GEMM operations (including Softmax) become the main bottleneck of execution time. 3. **Quantization Challenges**: Although quantization can reduce memory usage and accelerate computation, Softmax is very sensitive to quantization, especially in the quantization of activation functions. In existing research, there is a lack of an integer low - precision Softmax approximation method that does not affect the inference performance of LLMs, as well as the corresponding efficient integer - customized hardware. To solve these problems, the authors propose **SoftmAP**, a software - hardware co - design method that implements integer low - precision Softmax using In - Memory Compute (IMC) hardware. This method can significantly improve energy efficiency without sacrificing performance, making the deployment of LLMs on resource - constrained devices more feasible. ### Main Contributions 1. **Precision Sensitivity Analysis**: For the first time, a precision sensitivity analysis of integer low - precision Softmax approximation was carried out, and the optimal mixed low - precision implementation that does not affect the perplexity of LLMs was determined. 2. **Acceleration Mapping**: A mapping method was proposed to accelerate the integer low - precision Softmax with the optimal mixed precision on associated processors (APs). 3. **Performance Evaluation**: The energy and latency of integer low - precision Softmax of Llama2 - 7b, Llama2 - 13b and Llama2 - 70b models were evaluated on AP, RTX3090 GPU and A100 GPU. The results show that AP has significant improvements in energy and latency compared to GPU. Through these contributions, SoftmAP provides an effective method for solving the deployment problems of LLMs on resource - constrained devices.