Abstract:We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language ("golden hour"), using an LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model (GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using augmentation.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in real - time video editing applications, especially the cost and latency issues encountered when using large - language models (LLMs) for tool invocation. Specifically: 1. **High cost and high latency**: - Although current proprietary large - language models (such as GPT - 3.5 - Turbo) perform well in visual editing tasks, their high cost and long response time make them unsuitable for real - time applications. 2. **Interpretation of user intent and tool selection**: - Users describe the desired visual effects through natural language (for example, "golden hour"), and a system is required to accurately understand these intents and select appropriate tools and their parameters to achieve the desired effects. 3. **Performance improvement in the case of low data volume**: - In practical applications, especially on mobile devices, the amount of available data may be limited. Therefore, how to effectively train the model in the case of low data volume is a challenge. 4. **Evaluation of the performance of student models**: - It is necessary to develop offline evaluation metrics to measure the performance of student models in order to predict their performance in an online environment and avoid frequent and expensive online A/B tests. To solve these problems, the authors propose a method based on a distillation framework. By fine - tuning a smaller student LLM (guided by a larger teacher LLM and user behavior signals), low - cost and low - latency real - time visual editing tasks can be achieved. In addition, they also introduce data augmentation techniques to improve the fine - tuning effect in the case of low data volume and design offline evaluation metrics to ensure the performance of student models. ### Brief summary This paper mainly solves the following problems: - How to reduce the cost and latency of using LLMs for visual editing while ensuring high - quality editing effects. - How to achieve good performance in the case of low data volume by fine - tuning smaller student models. - How to develop effective offline evaluation metrics to reduce the dependence on expensive online A/B tests. Through these methods, the authors hope that their solutions can better meet the needs of real - time industry applications.

Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation

Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach

Leveraging Zero-Shot Prompting for Efficient Language Model Distillation

Sub-goal Distillation: A Method to Improve Small Language Agents

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning

Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing

Reward Guided Latent Consistency Distillation

Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning.

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Efficiently Distilling LLMs for Edge Applications

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

Efficient End-to-End Visual Document Understanding with Rationale Distillation

LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators