Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Dujian Ding,Ankur Mallick,Chi Wang,Robert Sim,Subhabrata Mukherjee,Victor Ruhle,Laks V.S. Lakshmanan,Ahmed Hassan Awadallah
2024-04-23
Abstract:Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the issue of balancing response quality and inference cost in the use of large - language models (LLMs). Specifically, large - language models perform excellently in most natural - language - processing tasks, but their deployment requires expensive cloud servers because these models are huge. While smaller models can be deployed on lower - cost devices (such as edge devices), they generally lag behind large models in response quality. Therefore, this paper proposes a hybrid - inference method, aiming to combine the advantages of these two models to save costs while maintaining high - quality responses. To achieve this goal, the author proposes a router mechanism that can assign queries to small or large models according to the predicted query difficulty and the required quality level. This method allows for dynamic adjustment of the required quality level at test time, thus seamlessly trading off between quality and cost. Experimental results show that, compared with using only large models, this method can reduce the number of large - model invocations by up to 40% without causing a decline in response quality.