Tuning Language Models by Proxy

Alisa Liu,Xiaochuang Han,Yizhong Wang,Yulia Tsvetkov,Yejin Choi,Noah A. Smith
2024-08-23
Abstract:Despite the general capabilities of large pretrained language models, they consistently benefit from further adaptation to better achieve desired behaviors. However, tuning these models has become increasingly resource-intensive, or impossible when model weights are private. We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning, but by accessing only its predictions over the output vocabulary, not its parameters. Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning, while retaining the benefits of larger-scale pretraining. In experiments, when we apply proxy-tuning to Llama2-70B using proxies of only 7B size, we can close 88% of the gap between Llama2-70B and its truly-tuned chat version, when evaluated across knowledge, reasoning, and safety benchmarks. We then demonstrate the generality of proxy-tuning by applying it to domain adaptation on code, and task-specific finetuning on question-answering and math problems. Finally, we show how to proxy-tune a truly black-box LM, GPT-3.5, for temporal adaptation, increasing its knowledge about recent events. Our work demonstrates the promise of using small tuned LMs to efficiently customize large, potentially proprietary LMs through decoding-time guidance.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently customize large pre - trained language models (LMs) to achieve the desired behavior and performance improvement without accessing or modifying the internal weights of these models. Specifically, the authors propose a lightweight decoding - time algorithm - **proxy - tuning**, which can adjust large - language models by only accessing the model's prediction distribution of the output vocabulary without unlocking the model parameters. ### Main Problems 1. **Resource - Intensive Tuning**: Directly tuning large pre - trained language models is becoming increasingly resource - intensive and, in some cases (such as when model weights are privatized), even impossible. 2. **How to Efficiently Customize Large Language Models**: As the model scale increases, how to efficiently customize these models for different users and application scenarios becomes a challenge. ### Solutions The authors propose a method named **proxy - tuning**, which is implemented through the following steps: - **Using a Small - scale Tuning Model**: First, tune a small - scale language model (referred to as an "expert"), and then compare the prediction differences between it and the untuned version (referred to as an "anti - expert"). - **Guiding the Large - scale Model**: Apply this difference to the original predictions of the large - scale pre - trained model, thereby guiding its behavior during decoding while retaining the advantages of large - scale pre - training. ### Experimental Results - **Instruction Tuning**: Through proxy - tuning, the gap between large - scale models and their directly tuned versions can be significantly narrowed in knowledge, reasoning, and safety benchmark tests. For example, applying proxy - tuning on LLAMA 2 - 70B can close 88% of the performance gap. - **Domain Adaptation**: Applying proxy - tuning to the code domain can significantly improve the performance of programming tasks. - **Task - Specific Fine - Tuning**: For specific tasks such as question - answering and math problems, proxy - tuning can also greatly improve performance and enable the model to follow strict format requirements. - **Temporal Adaptation of Black - Box Models**: Through proxy - tuning, the knowledge of GPT - 3.5 about recent events can be enhanced, even when only limited information is available. ### Formula Representation The core formula of proxy - tuning is shown as follows: \[ p_{\tilde{M}}(X_t | x_{<t})=\text{softmax}[s_M(X_t | x_{<t})+s_{M +}(X_t | x_{<t})-s_{M -}(X_t | x_{<t})] \] where: - \( s_M \) is the logit score of the base model, - \( s_{M +} \) is the logit score of the tuned expert model, - \( s_{M -} \) is the logit score of the untuned anti - expert model. This method enables large pre - trained models to be adjusted according to the guidance of small - scale tuning models during decoding without directly modifying their parameters.