Abstract:Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts. This paper presents a novel hybrid inference approach that leverages the strengths of both model types while minimizing reliance on costly cloud-based LLMs. Unlike existing methods that route entire queries to either an SLM or a cloud LLM, our approach introduces a reward-based mechanism to dynamically determine the involvement of the cloud LLM during token generation. Specifically, each token predicted by the SLM is evaluated against a reward score, and only when this score falls below a certain threshold is the cloud LLM consulted for assistance in the next token prediction. This method not only reduces the traffic to the cloud LLM, thereby lowering costs, but also allows for flexible control over response quality depending on the reward score threshold. Experimental results demonstrate that our approach significantly reduces cloud LLM usage with minimal impact on overall response quality, offering a cost-effective solution for deploying high-performance language models

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Although large - language models (LLMs) perform excellently in natural - language - processing tasks, their deployment costs are high, while small - language models (SLMs) have lower costs but are difficult to match the performance of large models. Existing methods usually route the entire query to the SLM or cloud LLM, which may lead to resource waste or insufficient performance. To solve this problem, this paper proposes a novel hybrid - inference method. By introducing a reward - based mechanism to dynamically decide when the assistance of the cloud LLM is required, it reduces the dependence on expensive cloud resources while maintaining high performance. Specifically, this method evaluates the reward score of each token generated by the SLM, and only consults the cloud LLM to generate the next token when the score is lower than a certain threshold. This method not only reduces the traffic of the cloud LLM and the cost, but also allows flexible control of the response quality according to the reward - score threshold. ### Main contributions of the paper include: 1. **Introduction of a reward - based modeling method**: By real - time dynamic evaluation of token alignment, it provides fine - grained control of cloud LLM participation. 2. **Selective assistance mechanism**: Invoke the cloud LLM only when the SLM output quality is lower than a certain threshold, effectively reducing the computational overhead. 3. **Extensive experimental verification**: It is proved that this method has the least impact on the overall performance while reducing the cloud usage rate. Through these improvements, this paper provides a flexible and cost - effective solution, enabling developers and users to fully utilize the capabilities of LLMs without incurring excessive costs. ### Key formula The loss function used to train the reward model is defined as follows: \[ L(\psi)=\log \sigma(r(x, y_w)-r(x, y_l)) \] where \( r(x, y_w) \) and \( r(x, y_l) \) respectively represent the reward scores of tokens aligned with the cloud LLM (preferred) and SLM (non - preferred) distributions. This loss function prompts the model to correctly distinguish these two distributions, thereby achieving accurate token - level routing decisions. ### Summary This paper aims to significantly reduce the deployment cost of large - language models through an innovative hybrid - inference method while ensuring high - quality output, especially for applications on edge devices.

Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Large Language Models (llms) Inference Offloading and Resource Allocation in Cloud-Edge Networks: an Active Inference Approach

Efficient and Economic Large Language Model Inference with Attention Offloading

Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions

LLMCad: Fast and Scalable On-device Large Language Model Inference

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Efficiently Deploying LLMs with Controlled Risk

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

SMART: Automatically Scaling Down Language Models with Accuracy Guarantees for Reduced Processing Fees

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models

Cloud-Device Collaborative Learning for Multimodal Large Language Models

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

All Language Models Large and Small