Abstract:The fast-growing large scale language models are delivering unprecedented performance on almost all natural language processing tasks. However, the effectiveness of large language models are reliant on an exponentially increasing number of parameters. The overwhelming computation complexity incurs a high inference latency that negatively affects user experience. Existing methods to improve inference efficiency, such as tensor parallelism and quantization, target to reduce per-layer computing latency, yet overlook the cumulative latency due to the number of layers. Recent works on reducing the cumulative latency through layer removing, however, lead to significant performance drop. Motivated by the similarity of inputs among adjacent layers, we propose to identify quasi-independent layers, which can be concurrently computed to significantly decrease inference latency. We also introduce a bypassing technique to mitigate the effect of information loss. Empirical experiments of the proposed approach on the LLaMA models confirm that Concurrent Computation of Quasi-Independent Layers (CQIL) can reduce latency by up to 48.3% on LLaMA-33B, while maintaining a close level of performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high - latency problem faced by large - language models (LLMs) during reasoning. As the model scale continues to increase, although these models have demonstrated unprecedented performance on almost all natural - language - processing tasks, their large number of parameters has led to extremely high computational complexity, which in turn has caused high - reasoning latency, directly affecting the user experience. Existing methods for improving reasoning efficiency, such as tensor parallelism and quantization, mainly focus on reducing the computational latency of each layer, while ignoring the cumulative latency due to the increase in the number of layers. Some recent work has reduced the cumulative latency by removing certain layers, but this will lead to a significant performance degradation. Therefore, this paper proposes a new method - Concurrent Computation of Quasi - Independent Layers (CQIL), aiming to significantly reduce the reasoning latency by identifying quasi - independent layers that can be computed concurrently while maintaining the model performance. The main contributions of the paper include: 1. Proposing a new method, CQIL, which enhances the reasoning efficiency of LLMs by concurrently computing quasi - independent layers, effectively solving the challenges brought by the increase in the number of layers. 2. This method enables pre - trained LLMs to adapt in a manner similar to an ensemble model with minimal performance loss, which may provide deeper insights into the characteristics of each layer in LLMs. 3. Experimental results show that CQIL can significantly reduce the reasoning latency of the LLaMA model with little impact on the model performance, with a maximum reduction of 48.3%. In addition, the paper also explores the application potential of CQIL in ensemble models and discusses the compatibility with existing model - compression techniques (such as pruning).

CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent Layers

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

Not All Layers of LLMs Are Necessary During Inference

Cross-layer Attention Sharing for Large Language Models

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

LiveMind: Low-latency Large Language Models with Simultaneous Inference

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

QIANets: Quantum-Integrated Adaptive Networks for Reduced Latency and Improved Inference Times in CNN Models

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

On Optimal Caching and Model Multiplexing for Large Model Inference

Inference Performance Optimization for Large Language Models on CPUs

Efficient and Economic Large Language Model Inference with Attention Offloading

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

SDQ: Sparse Decomposed Quantization for LLM Inference

CORM: Cache Optimization with Recent Message for Large Language Model Inference