Abstract:This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction tuning data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models' abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: <a class="link-external link-https" href="https://huggingface.co/NECOUDBFM/Jellyfish" rel="external noopener nofollow">this https URL</a> . Our instruction dataset is available at: <a class="link-external link-https" href="https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The paper primarily focuses on addressing challenges in Data Preprocessing (DP), particularly by leveraging Large Language Models (LLMs) to achieve this goal. Specifically, the research team developed an LLM named "Jellyfish," designed to serve as a general-purpose solver for data preprocessing tasks. The key issues the paper attempts to address are as follows: 1. **Data Security and Privacy Issues**: Many current LLM-based data preprocessing solutions rely on services like the GPT API, raising concerns about data leakage. The Jellyfish model proposed in the paper runs locally and can operate on a single, cost-effective GPU, thereby ensuring data security and privacy. 2. **Improving Data Preprocessing Efficiency and Flexibility**: Existing data preprocessing tools often require specific programming skills or specialized tools. Jellyfish allows users to perform data preprocessing tasks in a more intuitive manner through a natural language interface. Additionally, the model can be customized according to user instructions to meet specific task requirements. 3. **Generality and Scalability**: Jellyfish is designed as a versatile data preprocessing solution capable of handling various types of data preprocessing tasks, including Error Detection (ED), Data Imputation (DI), Schema Matching (SM), and Entity Matching (EM). It can not only handle known tasks but also generalize to unseen tasks. 4. **Reasoning Ability and Interpretability**: Jellyfish possesses strong reasoning capabilities, providing not only the results of data preprocessing but also natural language explanations for these results, making the output easier to understand. 5. **Avoiding Limitations of Existing LLMs**: To address issues in existing LLMs when applied to data preprocessing, such as high resource consumption, input length limitations leading to consistency issues, and factual errors (hallucinations), Jellyfish has been optimized through various techniques, such as knowledge injection to reduce factual errors. In summary, the paper aims to develop an efficient, flexible, secure, and highly capable LLM—Jellyfish—to tackle various challenges in data preprocessing. The effectiveness and superiority of Jellyfish in multiple data preprocessing tasks have been validated through experiments.

Jellyfish: A Large Language Model for Data Preprocessing

Large Language Models as Data Preprocessors

Petals: Collaborative Inference and Fine-tuning of Large Models

TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

AlpaGasus: Training A Better Alpaca with Fewer Data

Data-Prep-Kit: getting your data ready for LLM application development

OceanGPT: A Large Language Model for Ocean Science Tasks

CoLLiE: Collaborative Training of Large Language Models in an Efficient Way

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Platypus: Quick, Cheap, and Powerful Refinement of LLMs

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

Pedagogical Alignment of Large Language Models

Goldfish: Monolingual Language Models for 350 Languages

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Does your data spark joy? Performance gains from domain upsampling at the end of training

PolyLM: An Open Source Polyglot Large Language Model

EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models