Elephants Never Forget: Testing Language Models for Memorization of Tabular Data

Sebastian Bordt,Harsha Nori,Rich Caruana
DOI: https://doi.org/10.48550/arXiv.2403.06644
2024-03-11
Abstract:While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim. On these datasets, although seen during training, good performance on downstream tasks might not be due to overfitting. Our findings underscore the need for ensuring data integrity in machine learning tasks with LLMs. To facilitate future research, we release an open-source tool that can perform various tests for memorization \url{<a class="link-external link-https" href="https://github.com/interpretml/LLM-Tabular-Memorization-Checker" rel="external noopener nofollow">this https URL</a>}.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issues of data contamination and memorization in large language models (LLMs) when handling tabular data. Specifically, the authors focus on the following aspects: 1. **Data Contamination**: Many popular datasets are used as pre-training data for LLMs, which may lead to invalid performance in downstream task evaluations because LLMs have already "seen" the data in the test set. 2. **Memorization**: Whether LLMs remember specific samples from a dataset rather than completing tasks by learning the true joint probability distribution of the data. 3. **Distinguishing Knowledge, Learning, and Memorization**: - **Knowledge**: Whether the model knows the metadata of the dataset, such as feature names, values, and formats. - **Learning**: Whether the model can learn the true joint probability distribution from the dataset to complete supervised learning tasks. - **Memorization**: Whether the model remembers specific samples from the dataset rather than just sampling from the true distribution. ### Main Contributions 1. **Emphasizing the Importance of Verifying Data Contamination**: The authors highlight the importance of verifying data contamination before applying LLMs and propose a series of practical methods to assess whether a given tabular dataset might be part of the model's training corpus. 2. **Open-Source Tool**: To facilitate future research, the authors released an open-source tool that can automatically perform these tests. 3. **Testing Methods**: - **Basic Metadata Test**: Evaluating the model's understanding of the dataset's metadata through interactive chat. - **Conditional Completion Test**: Providing partial feature values and requiring the model to complete the remaining feature values to assess whether the model can reproduce the statistical characteristics of the dataset. - **Zero-Knowledge Prompt**: Generating new data samples without providing any dataset information to evaluate whether the model can accurately model the conditional distribution of the dataset. - **Memorization Tests**: Including header tests, row completion tests, feature completion tests, and first token tests to distinguish memorization from learning. 4. **Experimental Results**: The authors validated the effectiveness of these testing methods on multiple datasets, including publicly available and non-public datasets. The results show that LLMs exhibit high memorization on some datasets, while there is no evidence of memorization on other datasets. ### Conclusion This study emphasizes the importance of ensuring data integrity and avoiding data contamination when using LLMs for tabular data tasks. The proposed testing methods allow researchers to better understand the behavior of LLMs when handling specific datasets, thereby improving the accuracy of performance evaluations in downstream tasks.