Jellyfish: A Large Language Model for Data Preprocessing

Haochen Zhang,Yuyang Dong,Chuan Xiao,Masafumi Oyamada
2024-06-21
Abstract:This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction tuning data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models' abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: <a class="link-external link-https" href="https://huggingface.co/NECOUDBFM/Jellyfish" rel="external noopener nofollow">this https URL</a> . Our instruction dataset is available at: <a class="link-external link-https" href="https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct" rel="external noopener nofollow">this https URL</a> .
Artificial Intelligence,Computation and Language,Databases,Machine Learning
What problem does this paper attempt to address?
The paper primarily focuses on addressing challenges in Data Preprocessing (DP), particularly by leveraging Large Language Models (LLMs) to achieve this goal. Specifically, the research team developed an LLM named "Jellyfish," designed to serve as a general-purpose solver for data preprocessing tasks. The key issues the paper attempts to address are as follows: 1. **Data Security and Privacy Issues**: Many current LLM-based data preprocessing solutions rely on services like the GPT API, raising concerns about data leakage. The Jellyfish model proposed in the paper runs locally and can operate on a single, cost-effective GPU, thereby ensuring data security and privacy. 2. **Improving Data Preprocessing Efficiency and Flexibility**: Existing data preprocessing tools often require specific programming skills or specialized tools. Jellyfish allows users to perform data preprocessing tasks in a more intuitive manner through a natural language interface. Additionally, the model can be customized according to user instructions to meet specific task requirements. 3. **Generality and Scalability**: Jellyfish is designed as a versatile data preprocessing solution capable of handling various types of data preprocessing tasks, including Error Detection (ED), Data Imputation (DI), Schema Matching (SM), and Entity Matching (EM). It can not only handle known tasks but also generalize to unseen tasks. 4. **Reasoning Ability and Interpretability**: Jellyfish possesses strong reasoning capabilities, providing not only the results of data preprocessing but also natural language explanations for these results, making the output easier to understand. 5. **Avoiding Limitations of Existing LLMs**: To address issues in existing LLMs when applied to data preprocessing, such as high resource consumption, input length limitations leading to consistency issues, and factual errors (hallucinations), Jellyfish has been optimized through various techniques, such as knowledge injection to reduce factual errors. In summary, the paper aims to develop an efficient, flexible, secure, and highly capable LLM—Jellyfish—to tackle various challenges in data preprocessing. The effectiveness and superiority of Jellyfish in multiple data preprocessing tasks have been validated through experiments.