Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Jing Zhou,Chenglin Jiang,Wei Shen,Xiao Zhou,Xiaonan He
2024-08-15
Abstract:Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms several open-source models larger than 32B and surpasses well-known closed-source models such as GPT-3.5, highlighting the efficacy of our approach.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the challenge of acquiring high-quality data for large-scale language models (LLMs) in specific domains (such as mathematical reasoning). Specifically, the paper proposes a simple and effective method that utilizes web-crawled data for high-quality supervised fine-tuning, without relying on advanced models like GPT-4. The main contributions of the paper include: 1. Proposing a method to convert web-crawled data into high-quality data, thereby avoiding dependence on other advanced language models. 2. Experimental results demonstrate that this method significantly improves the performance of two representative open-source models (ChatGLM and Qwen) on Chinese mathematical problems, with an average improvement of 9.4%. 3. Analyzing the reasons for semantic inaccuracies caused by formatting errors and discussing the effectiveness of the method. Through the above method, the paper addresses the issue of the difficulty in obtaining high-quality data and demonstrates its superior performance in specific domains (such as mathematical reasoning).