SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees

Saehan Jo,Immanuel Trummer
2024-03-12
Abstract:The advancement of Large Language Models (LLMs) has significantly boosted performance in natural language processing (NLP) tasks. However, the deployment of high-performance LLMs incurs substantial costs, primarily due to the increased number of parameters aimed at enhancing model performance. This has made the use of state-of-the-art LLMs more expensive for end-users. AI service providers, such as OpenAI and Anthropic, often offer multiple versions of LLMs with varying prices and performance. However, end-users still face challenges in choosing the appropriate LLM for their tasks that balance result quality with cost.
Machine Learning,Artificial Intelligence,Computation and Language,Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to reduce the inference cost of large - language models (LLMs) while ensuring the quality of the results of natural - language - processing tasks. Specifically, as the number of parameters in LLMs increases, although the performance has been significantly improved, the cost of deployment and use has also increased substantially, which is a huge burden for end - users. Currently, AI service providers usually offer multiple versions of LLMs with different prices and performances, but users face challenges when choosing a model suitable for their tasks, especially in the absence of real - world labels as references. To solve this problem, the paper proposes a new framework named SMART, which aims to minimize the inference cost of NLP tasks by automatically adjusting the model scale while ensuring that the quality of the results meets the user's accuracy requirements. SMART allows users to specify an accuracy constraint, that is, the probability that the output is equivalent to the output of the most powerful LLM does not exceed the user - defined threshold. SMART identifies models that meet the user - defined accuracy level by evaluating the performance of multiple LLMs and optimizes the trade - off between evaluation overhead and expected cost savings. In addition, SMART significantly reduces the inference cost by strategically using combinations of LLMs with different performances and costs. Overall, the goal of SMART is to significantly reduce the cost for users while providing output quality similar to that of the most powerful LLMs. Verified by experiments, the performance of SMART on three real - world datasets shows that, compared with GPT - 4, SMART can achieve cost savings of up to 25.6 times.