Abstract:Large language models (LLMs) with Transformer architectures have become phenomenal in natural language processing, multimodal generative artificial intelligence, and agent-oriented artificial intelligence. The self-attention module is the most dominating sub-structure inside Transformer-based LLMs. Computation using general-purpose graphics processing units (GPUs) inflicts reckless demand for I/O bandwidth for transferring intermediate calculation results between memories and processing units. To tackle this challenge, this work develops a fully customized vanilla self-attention accelerator, AttentionLego, as the basic building block for constructing spatially expandable LLM processors. AttentionLego provides basic implementation with fully-customized digital logic incorporating Processing-In-Memory (PIM) technology. It is based on PIM-based matrix-vector multiplication and look-up table-based Softmax design. The open-source code is available online:

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high I/O bandwidth requirement in the calculation process of the self - attention module in large - scale language models (LLMs). Specifically, when using general - purpose graphics processing units (GPUs) for calculation, frequent data transmissions lead to huge I/O bandwidth consumption, which not only limits the computational efficiency but also increases the energy consumption. To address this challenge, the author has developed a fully customized basic self - attention accelerator - AttentionLego, as the basic building block for constructing spatially scalable large - scale language model processors. AttentionLego directly processes data in memory by integrating Processing - In - Memory (PIM) technology, reducing the data transfer between memory and processing units, thereby significantly improving computational efficiency and performance. ### Key technical points: 1. **Processing - In - Memory (PIM) technology**: By integrating processing units and memory on the same physical chip, PIM technology can eliminate data transmissions between the processor and memory, greatly reducing latency and improving performance. 2. **Matrix - vector multiplication and lookup - table - based Softmax design**: AttentionLego utilizes PIM technology to achieve efficient matrix - vector multiplication and adopts a lookup table to implement the Softmax function, further optimizing the calculation process. 3. **Modular design**: AttentionLego is designed to be easily stacked to adapt to language models of different scales, supporting spatial scalability. ### Core components of the solution: - **Input Process module**: Responsible for calculating \(XW_Q\), \(XW_K\), and \(XW_V\). - **Score module**: Calculates \(QK^\top\). - **Softmax module**: Executes the Softmax non - linear activation function. - **Direct Memory Access module (DMA module)**: Controls data transmissions between modules and external storage. - **Top Controller**: Manages and coordinates the communication, data flow, and functional operations between different modules inside the chip, ensuring that the system operates normally as designed. Through these technical means, AttentionLego aims to provide an efficient and energy - saving basic building block for constructing accelerators for large - scale language models, thereby promoting the development of natural language processing, multi - modal generative artificial intelligence, and other fields.

AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition

Efficient and Economic Large Language Model Inference with Attention Offloading

Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

LoMA: Lossless Compressed Memory Attention

Efficient Memory Management for Large Language Model Serving with PagedAttention

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

HAIMA: A Hybrid SRAM and DRAM Accelerator-in-Memory Architecture for Transformer

Efficient LLM inference solution on Intel GPU

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine

16.2 A 28nm 53.8TOPS/W 8b Sparse Transformer Accelerator with In-Memory Butterfly Zero Skipper for Unstructured-Pruned NN and CIM-Based Local-Attention-Reusable Engine

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

Gated Linear Attention Transformers with Hardware-Efficient Training

HARDSEA: Hybrid Analog-ReRAM Clustering and Digital-SRAM In-Memory Computing Accelerator for Dynamic Sparse Self-Attention in Transformer

ReTransformer

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning