Abstract:Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H$100$, it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M$40$ (as shown in Figure~\ref{fig:tisser}, M$40$ only has one third carbon emission of H$100$'s) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with $70$B parameters typically requires $128$GB for inference, which substantially surpasses $24$GB HBM in a $3090$ GPU and remains infeasible even considering the additional $64$GB DRAM. To address this challenge, this paper proposes a mixed-precision with a model modularization algorithm to enable LLM inference on outdated hardware with resource constraints. (The precision denotes the numerical precision like FP16, INT8, INT4) and multi-level caching (M2Cache).) Specifically, our M2Cache first modulizes neurons in LLM and creates their importance ranking. Then, it adopts a dynamic sparse mixed-precision quantization mechanism in weight space to reduce computational demands and communication overhead at each decoding step. It collectively lowers the operational carbon emissions associated with LLM inference. Moreover, M2Cache introduces a three-level cache management system with HBM, DRAM, and SSDs that complements the dynamic sparse mixed-precision inference. To enhance communication efficiency, M2Cache maintains a neuron-level mixed-precision LRU cache in HBM, a larger layer-aware cache in DRAM, and a full model in SSD.

AICAS Grand Challenge 2024: Software and Hardware Co-optimization for General Large Language Model Inference on CPU

Inference Performance Optimization for Large Language Models on CPUs

A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models

Distributed Inference Performance Optimization for LLMs on CPUs

EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

A Hardware Evaluation Framework for Large Language Model Inference

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

A Speed Odyssey for Deployable Quantization of LLMs

New Solutions on LLM Acceleration, Optimization, and Application

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Efficient and Economic Large Language Model Inference with Attention Offloading

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

A Comprehensive Evaluation of FPGA-Based Spatial Acceleration of LLMs

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization