Task Contamination: Language Models May Not Be Few-Shot Anymore

Changmao Li,Jeffrey Flanigan
DOI: https://doi.org/10.48550/arXiv.2312.16337
2023-12-27
Abstract:Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the performance of large language models (LLMs) in zero - shot and few - shot tasks may be affected by task contamination. Specifically, the paper explores the performance changes of LLMs on datasets released at different time points, especially the performance differences of these models when facing datasets whose release dates are earlier or later than the collection dates of their training data. By analyzing the performance of 12 different models on 16 classification tasks and 1 semantic parsing task, the authors reveal the existence of task contamination and provide evidence that this contamination has a significant impact on the evaluation results of LLMs in zero - shot and few - shot settings. ### Main Research Questions 1. **Existence of Task Contamination**: The paper aims to verify whether task contamination exists, that is, whether LLMs have been exposed to task examples used to evaluate them during the pre - training stage, thus casting doubt on the validity of zero - shot or few - shot evaluations. 2. **Temporal Trends in Performance Changes**: Study the performance changes of LLMs on datasets released at different time points, especially for datasets whose release dates are earlier or later than the collection dates of the model training data. 3. **Detection Methods for Task Contamination**: Propose and apply multiple methods to detect task contamination, including training data inspection, task example extraction, membership inference attacks, and time - series analysis. ### Research Background In recent years, large language models have performed well in zero - shot and few - shot tasks, especially in in - context learning. However, this performance improvement may be partly attributed to task contamination, that is, the model has seen examples of the evaluation task during the pre - training stage. This has led to doubts about the model's true zero - shot and few - shot capabilities. The paper explores this issue by systematically analyzing the performance of different models on different datasets. ### Methods 1. **Training Data Inspection**: Search for the existence of task - related training examples in the training data. 2. **Task Example Extraction**: Extract task examples from existing models, especially for instruction - tuned models. 3. **Membership Inference Attacks**: Check whether the content generated by the model completely matches the examples in the dataset, mainly used for generation tasks. 4. **Time - Series Analysis**: Compare the performance of the model on datasets released at different time points to detect task contamination. ### Results - **Time - Series Analysis**: For datasets whose release dates are earlier than the collection dates of the model training data, the performance of LLMs is usually better than the simple majority baseline, while in datasets whose release dates are later than the collection dates of the model training data, this advantage is significantly weakened. - **Evidence of Task Contamination**: Through training data inspection and task example extraction, evidence of task contamination is found in multiple models, especially in the GPT - 3 series models. - **Membership Inference Attacks**: For a specific generation task (Spider), it is found that the content generated by the model highly matches the examples in the dataset, further confirming the existence of task contamination. ### Conclusions The paper systematically analyzes the impact of task contamination on the performance of LLMs in zero - shot and few - shot tasks through multiple methods, providing strong evidence that task contamination does exist and has a significant impact on the model's evaluation results. These findings are of great significance for future model evaluation and training data management.