Abstract:Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the computational and memory overhead problems faced by large language models (LLMs) when deployed on resource - constrained devices, especially the bottleneck of the Softmax operation. Specifically: 1. **Computational and Memory Overhead**: Despite the progress in compression techniques, non - linear operators such as Softmax and LayerNorm are still bottlenecks because they are very sensitive to quantization. This makes it difficult to deploy LLMs on resource - constrained edge devices. 2. **Bottleneck of Softmax Operation**: Softmax is part of the attention mechanism and takes up a large amount of running time when processing long - sequence inputs. For example, in the Llama2 - 7b model, when the sequence length is 16384, Softmax takes up as much as 38% of the running time. Moreover, after accelerating GEMM - based operations, non - GEMM operations (including Softmax) become the main bottleneck of execution time. 3. **Quantization Challenges**: Although quantization can reduce memory usage and accelerate computation, Softmax is very sensitive to quantization, especially in the quantization of activation functions. In existing research, there is a lack of an integer low - precision Softmax approximation method that does not affect the inference performance of LLMs, as well as the corresponding efficient integer - customized hardware. To solve these problems, the authors propose **SoftmAP**, a software - hardware co - design method that implements integer low - precision Softmax using In - Memory Compute (IMC) hardware. This method can significantly improve energy efficiency without sacrificing performance, making the deployment of LLMs on resource - constrained devices more feasible. ### Main Contributions 1. **Precision Sensitivity Analysis**: For the first time, a precision sensitivity analysis of integer low - precision Softmax approximation was carried out, and the optimal mixed low - precision implementation that does not affect the perplexity of LLMs was determined. 2. **Acceleration Mapping**: A mapping method was proposed to accelerate the integer low - precision Softmax with the optimal mixed precision on associated processors (APs). 3. **Performance Evaluation**: The energy and latency of integer low - precision Softmax of Llama2 - 7b, Llama2 - 13b and Llama2 - 70b models were evaluated on AP, RTX3090 GPU and A100 GPU. The results show that AP has significant improvements in energy and latency compared to GPU. Through these contributions, SoftmAP provides an effective method for solving the deployment problems of LLMs on resource - constrained devices.

SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Hardware-Efficient SoftMax Architecture With Bit-Wise Exponentiation and Reciprocal Calculation

Hardware-oriented Algorithms for Softmax and Layer Normalization of Large Language Models

ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters

Efficient Hardware Architecture of Softmax Layer in Deep Neural Network

Towards Efficient IMC Accelerator Design Through Joint Hardware-Workload Co-optimization

LionHeart: A Layer-based Mapping Framework for Heterogeneous Systems with Analog In-Memory Computing Tiles

On the Viability of using LLMs for SW/HW Co-Design: An Example in Designing CiM DNN Accelerators

BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition

COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning.

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

Pack my weights and run! Minimizing overheads for in-memory computing accelerators

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Overflow-free Compute Memories for Edge AI Acceleration

Hardware/Software co-design with ADC-Less In-memory Computing Hardware for Spiking Neural Networks

MF-Net: Compute-In-Memory SRAM for Multibit Precision Inference Using Memory-Immersed Data Conversion and Multiplication-Free Operators

Hardware-Aware Softmax Approximation for Deep Neural Networks