Abstract:Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficiency metric that is both (i) easily comparable across models from different providers, and (ii) representative of the true cost of running queries in an isolated performance environment. Unfortunately, access to LLMs today is largely restricted to black-box text generation APIs and raw runtimes measured through this interface do not satisfy these desiderata: model providers can apply various software and hardware optimizations orthogonal to the model, and models served on shared infrastructure are susceptible to performance contention. To circumvent these problems, we propose a new metric for comparing inference efficiency across models. This metric puts models on equal footing as though they were served (i) on uniform hardware and software, and (ii) without performance contention. We call this metric the \emph{idealized runtime}, and we propose a methodology to efficiently estimate this metric for autoregressive Transformer models. We also propose cost-aware variants that incorporate the number of accelerators needed to serve the model. Using these metrics, we compare ten state-of-the-art LLMs to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our methodology also facilitates the efficient comparison of different software and hardware stacks.

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

A Survey on Efficient Inference for Large Language Models

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Inference Performance Optimization for Large Language Models on CPUs

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

Model Compression and Efficient Inference for Large Language Models: A Survey

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Towards Optimizing with Large Language Models

Search for Efficient Large Language Models

A Survey on Evaluation of Large Language Models

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

Evaluating Large Language Models at Evaluating Instruction Following

A Survey on Evaluation of Large Language ModelsJust Accepted

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

ICLEval: Evaluating In-Context Learning Ability of Large Language Models