Can Foundation Models Wrangle Your Data?

Avanika Narayan,Ines Chami,Laurel Orr,Simran Arora,Christopher Ré
DOI: https://doi.org/10.48550/arXiv.2205.09911
2022-12-24
Abstract:Foundation Models (FMs) are models trained on large corpora of data that, at very large scale, can generalize to new tasks without any task-specific finetuning. As these models continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This paper aims to understand an underexplored area of FMs: classical data tasks like cleaning and integration. As a proof-of-concept, we cast five data cleaning and integration tasks as prompting tasks and evaluate the performance of FMs on these tasks. We find that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks. We identify specific research challenges and opportunities that these models present, including challenges with private and domain specific data, and opportunities to make data management systems more accessible to non-experts. We make our code and experiments publicly available at: <a class="link-external link-https" href="https://github.com/HazyResearch/fm_data_tasks" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to explore whether foundation models (FMs) can be applied to traditional data tasks, especially data cleaning and integration tasks. Specifically, the paper focuses on the following aspects: 1. **Transferability**: Research whether large - scale FMs can be transferred to data tasks in zero - shot and few - shot settings, such as entity matching, error detection, schema matching, data transformation and data imputation tasks. These tasks usually require specific data processing techniques, and FMs are mainly trained on natural language texts, so it is of great significance to study their performance on these tasks. 2. **Application challenges**: Analyze the specific challenges encountered when applying FMs to data tasks, such as the ability to handle private and domain - specific data, and how to improve performance through fine - tuning or prompt tuning. 3. **Opportunities and challenges**: Explore the potential opportunities of FMs in the field of data management, including how to make data management systems more user - friendly for non - experts, and related research challenges, such as updating the knowledge base of FMs, handling time - series and local data. ### Main contributions - **Experimental verification**: Through a series of experiments, the paper shows the performance of large - scale FMs (such as GPT - 3) in data cleaning and integration tasks, especially in zero - shot and few - shot settings. - **Performance comparison**: Compared with the existing state - of - the - art methods (such as deep - learning - based methods), the results show that FMs can achieve or even exceed the performance of these methods in some tasks. - **Prompt - tuning analysis**: Analyze in detail the impact of different choices in the prompt - tuning process (such as attribute selection, prompt format and task example selection) on the model performance, and provide strategies for optimizing prompts. ### Experimental results - **Few - shot performance**: In the few - shot setting, GPT - 3 - 175B reaches the state - of - the - art level in multiple data tasks, including entity matching, data imputation, data transformation, error detection and schema matching, through manually curated task examples. - **Zero - shot performance**: In the zero - shot setting, FMs outperform statistical methods and standard data repair engines in data imputation tasks, but perform poorly in entity matching tasks, indicating that task examples have a significant impact on performance. - **Prompt - tuning**: Experiments have found that the selection of attributes, prompt formats and task examples has an important impact on the performance of FMs. Manually curated task examples can significantly improve performance compared to randomly selected examples. ### Conclusion The paper experimentally proves the potential of large - scale FMs in data cleaning and integration tasks, and points out the challenges and opportunities that need to be noted in practical applications. Future research directions may include further optimizing prompt - tuning methods, the ability to handle domain - specific data, and developing more efficient data management systems.