Abstract:Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H$100$, it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M$40$ (as shown in Figure~\ref{fig:tisser}, M$40$ only has one third carbon emission of H$100$'s) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with $70$B parameters typically requires $128$GB for inference, which substantially surpasses $24$GB HBM in a $3090$ GPU and remains infeasible even considering the additional $64$GB DRAM. To address this challenge, this paper proposes a mixed-precision with a model modularization algorithm to enable LLM inference on outdated hardware with resource constraints. (The precision denotes the numerical precision like FP16, INT8, INT4) and multi-level caching (M2Cache).) Specifically, our M2Cache first modulizes neurons in LLM and creates their importance ranking. Then, it adopts a dynamic sparse mixed-precision quantization mechanism in weight space to reduce computational demands and communication overhead at each decoding step. It collectively lowers the operational carbon emissions associated with LLM inference. Moreover, M2Cache introduces a three-level cache management system with HBM, DRAM, and SSDs that complements the dynamic sparse mixed-precision inference. To enhance communication efficiency, M2Cache maintains a neuron-level mixed-precision LRU cache in HBM, a larger layer-aware cache in DRAM, and a full model in SSD.

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System

Long Short-Term Memory Implementation Exploiting Passive RRAM Crossbar Array

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

Ese: Efficient Speech Recognition Engine with Sparse Lstm on Fpga

Efficient Weight Reuse for Large LSTMs.

C-LSTM: Enabling Efficient LSTM Using Structured Compression Techniques on FPGAs

ERA-LSTM: An Efficient ReRAM-Based Architecture for Long Short-Term Memory

A Compact and Configurable Long Short-Term Memory Neural Network Hardware Architecture.

E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks

A Fast and Power Efficient Architecture to Parallelize LSTM based RNN for Cognitive Intelligence Applications.

Learning Sparse Hidden States In Long Short-Term Memory

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM

A Highly Configurable 7.62gop/s Hardware Implementation for LSTM

ELSTM: An improved long short‐term memory network language model for sequence learning

Long short-term memory networks in memristor crossbar arrays

A low-latency LSTM accelerator using balanced sparsity based on FPGA

Accelerating LSTM-based High-Rate Dynamic System Models

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

A Power-Efficient Accelerator Based on FPGAs for LSTM Network

A High Energy-Efficiency FPGA-Based LSTM Accelerator Architecture Design by Structured Pruning and Normalized Linear Quantization