Training Task Experts through Retrieval Based Distillation

Jiaxin Ge,Xueying Jia,Vijay Viswanathan,Hongyin Luo,Graham Neubig
2024-07-08
Abstract:One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.
Computation and Language
What problem does this paper attempt to address?
This paper proposes a solution to the problem of acquiring data for training task-specific expert models. The current method involves generating task-specific data from large language models (LLMs) and distilling the knowledge into smaller models. However, this approach is limited by the quality of the output from LLMs, often leading to repetitive or inaccurate data. The paper introduces a method called Retrieval-Based Distillation (ReBase), which first retrieves data from rich online resources and then transforms it into domain-specific data to enhance data diversity. ReBase also generates Chain-of-Thought reasoning and distills the reasoning capability of LLMs into smaller models. In four benchmark tests, ReBase achieved a 7.8% improvement on SQuAD, a 1.37% improvement on MNLI, and a 1.94% improvement on BigBench-Hard. The research found that by retrieving and transforming data from multiple sources, the performance of task-specific models can be effectively improved. ReBase overcomes the problem of insufficient information due to relying on a single or few datasets, enhancing the content diversity and quality of the data, especially for tasks that require reasoning.