Abstract:Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficiency metric that is both (i) easily comparable across models from different providers, and (ii) representative of the true cost of running queries in an isolated performance environment. Unfortunately, access to LLMs today is largely restricted to black-box text generation APIs and raw runtimes measured through this interface do not satisfy these desiderata: model providers can apply various software and hardware optimizations orthogonal to the model, and models served on shared infrastructure are susceptible to performance contention. To circumvent these problems, we propose a new metric for comparing inference efficiency across models. This metric puts models on equal footing as though they were served (i) on uniform hardware and software, and (ii) without performance contention. We call this metric the \emph{idealized runtime}, and we propose a methodology to efficiently estimate this metric for autoregressive Transformer models. We also propose cost-aware variants that incorporate the number of accelerators needed to serve the model. Using these metrics, we compare ten state-of-the-art LLMs to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our methodology also facilitates the efficient comparison of different software and hardware stacks.

What problem does this paper attempt to address?

The main focus of this paper is on the cost and performance issues of large language models (LLMs) during the inference phase, and it attempts to propose a new metric to compare the differences in inference efficiency of autoregressive Transformer models from different providers. Specifically, the paper aims to address the following core issues: 1. **Evaluating Inference Efficiency**: Although current large language models have achieved significant results in natural language processing tasks, their computational cost during the inference phase is extremely high. Therefore, researchers need to better understand whether the additional cost of increasing model size is worth it, especially considering the improvement in model capabilities. 2. **Limitations of Existing Metrics**: Existing metrics (such as raw runtime or model size) cannot accurately reflect the true inference cost of models from different providers, as these providers may use different software optimizations and hardware configurations. Additionally, since models are often accessed through black-box APIs, directly measured runtime is affected by additional factors such as caching, custom hardware, etc., making comparisons between models complex. 3. **Proposing Idealized Runtime**: To overcome the above limitations, the paper proposes a new metric called "idealized runtime," which aims to compare models in the same software and hardware environment, thereby eliminating external factors that affect the actual inference efficiency of the models. 4. **Analyzing the Trade-off Between Inference Efficiency and Capability**: Using the proposed metric, the paper analyzes the trade-off between inference efficiency and capability for 10 state-of-the-art large language models, with the goal of providing model creators and researchers with a deeper understanding of the actual inference costs resulting from specific training processes and model architectures. In summary, the core objective of this paper is to address how to fairly compare the inference efficiency capabilities of large language models from different providers and to propose corresponding solutions.

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

The Efficiency Misnomer

Efficient and Economic Large Language Model Inference with Attention Offloading

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Inference Performance Optimization for Large Language Models on CPUs

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Inference Acceleration for Large Language Models on CPUs

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

MELTing point: Mobile Evaluation of Language Transformers

Model Compression and Efficient Inference for Large Language Models: A Survey

Enhancing Parameter Efficiency in Model Inference Using an Ultralight Inter-Transformer Linear Structure

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

Inference Optimization of Foundation Models on AI Accelerators

Efficiently Scaling Transformer Inference

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models