Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications

Oren Sultan,Alex Khasin,Guy Shiran,Asnat Greenstein-Messica,Dafna Shahaf
2024-10-10
Abstract:We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language ("golden hour"), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in real - time video editing applications, especially the cost and latency issues encountered when using large - language models (LLMs) for tool invocation. Specifically: 1. **High cost and high latency**: - Although current proprietary large - language models (such as GPT - 3.5 - Turbo) perform well in visual editing tasks, their high cost and long response time make them unsuitable for real - time applications. 2. **Interpretation of user intent and tool selection**: - Users describe the desired visual effects through natural language (for example, "golden hour"), and a system is required to accurately understand these intents and select appropriate tools and their parameters to achieve the desired effects. 3. **Performance improvement in the case of low data volume**: - In practical applications, especially on mobile devices, the amount of available data may be limited. Therefore, how to effectively train the model in the case of low data volume is a challenge. 4. **Evaluation of the performance of student models**: - It is necessary to develop offline evaluation metrics to measure the performance of student models in order to predict their performance in an online environment and avoid frequent and expensive online A/B tests. To solve these problems, the authors propose a method based on a distillation framework. By fine - tuning a smaller student LLM (guided by a larger teacher LLM and user behavior signals), low - cost and low - latency real - time visual editing tasks can be achieved. In addition, they also introduce data augmentation techniques to improve the fine - tuning effect in the case of low data volume and design offline evaluation metrics to ensure the performance of student models. ### Brief summary This paper mainly solves the following problems: - How to reduce the cost and latency of using LLMs for visual editing while ensuring high - quality editing effects. - How to achieve good performance in the case of low data volume by fine - tuning smaller student models. - How to develop effective offline evaluation metrics to reduce the dependence on expensive online A/B tests. Through these methods, the authors hope that their solutions can better meet the needs of real - time industry applications.