The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Peter Hase,Mohit Bansal,Peter Clark,Sarah Wiegreffe

2024-06-05

Abstract:How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data, even performing as well as oracle models finetuned on hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect easy data rather than hard data for finetuning, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied. Our code is available at: <a class="link-external link-https" href="https://github.com/allenai/easy-to-hard-generalization" rel="external noopener nofollow">this https URL</a>

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper explores the issue of how to train language models to perform well on complex tasks when it is difficult to annotate correct labels. Specifically, the research focuses on the generalization ability from easy data to hard data. The authors found that current pre-trained language models can generalize well from easy data to hard data, and their performance is close to that of models fine-tuned with hard data. #### Main Contributions 1. **Data Difficulty Measurement**: Proposed various methods to measure data difficulty, including educational level, expert ratings, required cognitive skills, etc. 2. **Generalization from Easy to Hard**: Through experiments, it was verified that models fine-tuned on easy data perform comparably on complex tasks to those fine-tuned directly on hard data. 3. **Cost-Benefit Analysis**: Explored the cost-benefit trade-off between collecting easy data and hard data, finding that collecting easy data might be more advantageous since hard data is usually more expensive and prone to errors. #### Experimental Results - Experiments were conducted on multiple datasets (such as ARC, MMLU, StrategyQA, GSM8k) to verify that fine-tuning models on easy data can achieve effects similar to fine-tuning on hard data. - The study shows that even simple fine-tuning methods (such as in-context learning, linear classifiers, QLoRA) can achieve good generalization effects. - In some cases, fine-tuning with easy data even outperforms fine-tuning with hard data, especially when hard data is more difficult to obtain or has a higher error rate in annotation. In summary, this paper demonstrates the strong generalization ability of pre-trained language models from easy to hard data and proposes a more cost-effective training strategy.

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Does your data spark joy? Performance gains from domain upsampling at the end of training

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance

How to Train Data-Efficient LLMs

Larger and more instructable language models become less reliable

Pedagogical Alignment of Large Language Models

Text Difficulty Study: Do Machines Behave the Same as Humans Regarding Text Difficulty?

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning

Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?

Assessing Programming Task Difficulty for Efficient Evaluation of Large Language Models

Student Data Paradox and Curious Case of Single Student-Tutor Model: Regressive Side Effects of Training LLMs for Personalized Learning

Language models scale reliably with over-training and on downstream tasks

Data Factors for Better Compositional Generalization

Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

Learning to Learn to be Right for the Right Reasons

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs