Solving Data-centric Tasks using Large Language Models

Shraddha Barke,Christian Poelitz,Carina Suzana Negreanu,Benjamin Zorn,José Cambronero,Andrew D. Gordon,Vu Le,Elnaz Nouri,Nadia Polikarpova,Advait Sarkar,Brian Slininger,Neil Toronto,Jack Williams
2024-03-25
Abstract:Large language models (LLMs) are rapidly replacing help forums like StackOverflow, and are especially helpful for non-professional programmers and end users. These users are often interested in data-centric tasks, such as spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including the data. But how do we decide how much data and which data to include in the prompt? This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table, our cluster-then-select technique outperforms a random selection baseline.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize large - language models (LLMs) to complete data - centered tasks, especially when the structure and content of the input data are crucial for task completion. Specifically, the paper focuses on the following issues: 1. **How to decide how much data and which data to include in the prompt**: For many data - processing tasks, it is not enough to simply describe the intention in natural language. Example data also need to be provided to help the model understand the specific requirements of the task. However, providing too much data may lead to performance degradation or cost increase, so a balance needs to be found. 2. **How to select representative rows from a large data set**: Data sets in practical applications are often very large, and it is impossible to pass the entire data set to the model. Therefore, an effective method is required to select a small number of rows that can represent the characteristics of the entire data set. To solve these problems, the paper makes the following contributions: - **Created a new data set SOFSET**: This data set contains real - world NL - to - code tasks from StackOverflow, especially those involving tabular data manipulation tasks. - **Proposed a new cluster - then - select prompt technique**: This technique first clusters data rows according to the syntactic structure of the input data, and then selects the most representative rows from each cluster to add to the prompt. Experiments show that this method is superior to the random selection baseline when dealing with tasks with a large number of syntactic variations. - **Analyzed the sensitivity of LLM to the amount, selection, and position of data in the prompt**: Through a series of experiments, the influence of different amounts and types of input data on the model performance was studied, and the importance of data and its crucial role in the quality of task completion were demonstrated. These contributions help to improve the performance of large - language models in data - centered tasks, especially when dealing with complex multi - step calculations and data manipulations.