AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

Rongqing Cong,Wenyang He,Mingxuan Li,Bangning Luo,Zebin Yang,Yuchao Yang,Ru Huang,Bonan Yan
2024-01-21
Abstract:Large language models (LLMs) with Transformer architectures have become phenomenal in natural language processing, multimodal generative artificial intelligence, and agent-oriented artificial intelligence. The self-attention module is the most dominating sub-structure inside Transformer-based LLMs. Computation using general-purpose graphics processing units (GPUs) inflicts reckless demand for I/O bandwidth for transferring intermediate calculation results between memories and processing units. To tackle this challenge, this work develops a fully customized vanilla self-attention accelerator, AttentionLego, as the basic building block for constructing spatially expandable LLM processors. AttentionLego provides basic implementation with fully-customized digital logic incorporating Processing-In-Memory (PIM) technology. It is based on PIM-based matrix-vector multiplication and look-up table-based Softmax design. The open-source code is available online:
Hardware Architecture,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high I/O bandwidth requirement in the calculation process of the self - attention module in large - scale language models (LLMs). Specifically, when using general - purpose graphics processing units (GPUs) for calculation, frequent data transmissions lead to huge I/O bandwidth consumption, which not only limits the computational efficiency but also increases the energy consumption. To address this challenge, the author has developed a fully customized basic self - attention accelerator - AttentionLego, as the basic building block for constructing spatially scalable large - scale language model processors. AttentionLego directly processes data in memory by integrating Processing - In - Memory (PIM) technology, reducing the data transfer between memory and processing units, thereby significantly improving computational efficiency and performance. ### Key technical points: 1. **Processing - In - Memory (PIM) technology**: By integrating processing units and memory on the same physical chip, PIM technology can eliminate data transmissions between the processor and memory, greatly reducing latency and improving performance. 2. **Matrix - vector multiplication and lookup - table - based Softmax design**: AttentionLego utilizes PIM technology to achieve efficient matrix - vector multiplication and adopts a lookup table to implement the Softmax function, further optimizing the calculation process. 3. **Modular design**: AttentionLego is designed to be easily stacked to adapt to language models of different scales, supporting spatial scalability. ### Core components of the solution: - **Input Process module**: Responsible for calculating \(XW_Q\), \(XW_K\), and \(XW_V\). - **Score module**: Calculates \(QK^\top\). - **Softmax module**: Executes the Softmax non - linear activation function. - **Direct Memory Access module (DMA module)**: Controls data transmissions between modules and external storage. - **Top Controller**: Manages and coordinates the communication, data flow, and functional operations between different modules inside the chip, ensuring that the system operates normally as designed. Through these technical means, AttentionLego aims to provide an efficient and energy - saving basic building block for constructing accelerators for large - scale language models, thereby promoting the development of natural language processing, multi - modal generative artificial intelligence, and other fields.