I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

Xuan Ren,Biao Wu,Lingqiao Liu
2024-10-11
Abstract:This paper explores an intriguing observation: fine-tuning a large language model (LLM) with responses generated by a LLM often yields better results than using responses generated by humans, particularly in reasoning tasks. We conduct an in-depth investigation to understand why this occurs. Contrary to the common belief that these instances is due to the more detailed nature of LLM-generated content, our study identifies another contributing factor: an LLM is inherently more "familiar" with LLM generated responses. This familiarity is evidenced by lower perplexity before fine-tuning. We design a series of experiments to understand the impact of the "familiarity" and our conclusion reveals that this "familiarity" significantly impacts learning performance. Training with LLM-generated responses not only enhances performance but also helps maintain the model's capabilities in other reasoning tasks after fine-tuning on a specific task.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper explores an interesting phenomenon: using responses generated by large - language models (LLMs) to fine - tune another large - language model usually has better results than using human - generated responses, especially in reasoning tasks. Specifically, the paper aims to answer the following questions: 1. **Why is the effect of fine - tuning with responses generated by LLMs better than that with human - generated responses?** - The common view is that the content generated by LLMs is more detailed, so the effect is better. However, the author finds that this is not the only reason. - The author proposes a new hypothesis: the target LLM is more "familiar" with the responses generated by other LLMs, and this "familiarity" plays a key role in the fine - tuning process. 2. **How to measure and verify this "familiarity"?** - The author measures and verifies this hypothesis through a series of experiments, including calculating the perplexity of different generation methods and designing different data variants for comparative experiments. 3. **Can the "familiarity" be improved by methods that do not rely on advanced LLMs?** - The author tries a method, that is, using the target LLM itself to rewrite the training data to make its style closer to the responses generated by LLMs, so as to improve the fine - tuning effect. ### Main observations and hypotheses 1. **Observation 1: Data generated by LLMs is better than human - annotated data** - The experimental results show that the effect of fine - tuning with data generated by LLMs is significantly better than that with human - annotated data, especially in math - related tasks, with a performance improvement of more than 10%. 2. **Observation 2: Responses generated by LLMs have significantly lower perplexity** - The perplexity of the target LLM for responses generated by other LLMs is significantly lower than that for human - generated responses, indicating that the LLM is more "familiar" with responses generated by other LLMs. 3. **Hypothesis 1: Given the same question, an LLM is more "familiar" with responses generated by other LLMs** - This "familiarity" is manifested as lower perplexity, meaning that the LLM can understand and adapt to these responses more easily. 4. **Hypothesis 2: LLMs perform better during training because they are more "familiar" with the data** - The author verifies this through experiments and finds that the effect of fine - tuning with data of low perplexity is significantly better than that with data of high perplexity. ### Experimental design 1. **Experiment 1: Influence of detailed reasoning steps** - The author designs an experiment to explore whether the detailed reasoning steps generated by LLMs are the main reason for performance improvement. The results show that even without adding detailed reasoning steps, the data generated by LLMs still performs better. 2. **Experiment 2: Influence of familiarity** - The author compares the fine - tuning effects of high - perplexity and low - perplexity data by generating responses with different perplexities. The results show that low - perplexity data performs better in the fine - tuning process, further verifying the importance of "familiarity". 3. **Experiment 3: Methods that do not rely on advanced LLMs** - The author tries to use the target LLM itself to rewrite the training data to improve its "familiarity". The experimental results show that this method can improve the fine - tuning effect in most cases, but there are still limitations in some tasks. ### Conclusion Through a series of experiments, the author proves that an LLM is more "familiar" with responses generated by other LLMs, and this "familiarity" plays a key role in the fine - tuning process. In addition, the author also proposes a method that does not rely on advanced LLMs to improve "familiarity", providing new ideas for further optimizing the fine - tuning process.