Abstract:Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H$100$, it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M$40$ (as shown in Figure 1, M$40$ only has one third carbon emission of H$100$'s) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with $70$B parameters typically requires $128$GB for inference, which substantially surpasses $24$GB HBM in a $3090$ GPU and remains infeasible even considering the additional $64$GB DRAM. To address this challenge, this paper proposes a mixed-precision with a model modularization algorithm to enable LLM inference on outdated hardware with resource constraints. (The precision denotes the numerical precision like FP16, INT8, INT4) and multi-level caching (M2Cache).) Specifically, our M2Cache first modulizes neurons in LLM and creates their importance ranking. Then, it adopts a dynamic sparse mixed-precision quantization mechanism in weight space to reduce computational demands and communication overhead at each decoding step. It collectively lowers the operational carbon emissions associated with LLM inference. Moreover, M2Cache introduces a three-level cache management system with HBM, DRAM, and SSDs that complements the dynamic sparse mixed-precision inference. To enhance communication efficiency, M2Cache maintains a neuron-level mixed-precision LRU cache in HBM, a larger layer-aware cache in DRAM, and a full model in SSD.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problems of high carbon emissions and resource limitations faced by large - language models (LLMs) during the inference process. Specifically, the paper focuses on the following key issues: 1. **High carbon emissions**: - As the number of parameters in LLMs increases, the carbon emissions during their deployment and inference also increase significantly. For example, the carbon emissions of modern high - performance GPUs such as H100 are much higher than those of older GPUs such as M40. - Using older GPUs for LLM inference can significantly reduce carbon emissions because the carbon footprint during the manufacturing and operation of older GPUs is lower. 2. **Resource limitations**: - The high - bandwidth memory (HBM) capacity of older GPUs (such as M40, 3090, etc.) is limited and cannot support the inference of large - scale LLMs. For example, the LLaMA - 7B model requires 128GB of memory, while the 3090 GPU has only 24GB of HBM. - The insufficient HBM capacity of older GPUs leads to the inability to load large - scale models and their intermediate activation data, making inference difficult. 3. **Performance and efficiency**: - When performing LLM inference on resource - constrained older GPUs, how to maintain a relatively high inference speed and accuracy is a challenge. - Existing methods such as pruning, quantization, and optimizing key - value caches (KV Cache) have partially alleviated the HBM limitations, but have introduced the problem of excessive parameter compression, affecting model accuracy. ### Solutions To solve the above problems, the paper proposes a mixed - precision and multi - level cache system (M2Cache), which mainly contains the following two core designs: 1. **Dynamic sparse mixed - precision inference**: - **Neuron identification**: Determine the necessary neurons required for a specific text generation task through a low - rank predictor. - **Selective loading**: Only load the identified active neurons from DRAM into GPU memory to optimize memory usage. - **Activity - based quantization**: Perform low - bit quantization on neurons with lower activity to save HBM space. This reduces the demand for communication bandwidth while maintaining the precision of key neurons. 2. **Prediction - driven multi - level cache**: - **Three - level cache system**: Utilize three storage media, namely GPU HBM, DRAM, and SSD, to store the most frequently accessed active neurons, larger layer - aware caches, and complete model parameters respectively. - **Pre - loading strategy**: Pre - load potentially required neurons in advance through a prediction mechanism to reduce the bandwidth bottleneck between DRAM and GPU. - **LRU cache mechanism**: Use the LRU mechanism in the GPU cache to store frequently accessed active neurons, further improving inference efficiency. ### Experimental results The experimental results show that, compared with the existing state - of - the - art offloading framework DeepSpeed Zero - Infinity, M2Cache has achieved significant performance improvements and carbon emission reductions on multiple models and different hardware: - **Inference latency**: M2Cache reduces the inference latency by up to 7 times on LLaMA - 7B and achieves up to 14 - fold acceleration on LLaMA - 13B. - **Carbon emissions**: M2Cache reduces carbon emissions by up to 7.67 times. - **Multi - level cache**: The prediction - driven multi - level cache mechanism reduces the inference latency by approximately 2.15 times and carbon emissions by approximately 2.17 times. ### Main contributions 1. **In - depth analysis of inference overhead**: Studied the overhead of LLM inference on memory - constrained devices and identified the challenges brought by quantization and offloading methods. 2. **Dynamic sparse mixed - precision quantization**: Introduced mixed - precision quantization through neuron scoring in dynamic sparse inference, improving the performance of LLMs in scenarios with limited memory resources. 3. **Innovative multi - level cache system**: Introduced a multi - level cache system, including GPU - DRAM and DRAM - SSD caches, further enhancing inference performance. 4. **Better sustainability**: M2Cache is the first research to achieve the sustainability of LLM inference on older hardware and improves sustainability without sacrificing model accuracy.

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

Efficient LLM inference solution on Intel GPU

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

On Optimal Caching and Model Multiplexing for Large Model Inference

Efficient and Economic Large Language Model Inference with Attention Offloading

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Inference Performance Optimization for Large Language Models on CPUs

MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition

Distributed Inference Performance Optimization for LLMs on CPUs

Efficient Deployment of Large Language Model Across Cloud-Device Systems

A Hardware Evaluation Framework for Large Language Model Inference

Progressive Mixed-Precision Decoding for Efficient LLM Inference

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity