LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Nathalia Nascimento,Everton Guimaraes,Sai Sanjna Chintakunta,Santhosh Anitha Boominathan
2024-11-17
Abstract:The adoption of Large Language Models (LLMs) for code generation in data science offers substantial potential for enhancing tasks such as data manipulation, statistical analysis, and visualization. However, the effectiveness of these models in the data science domain remains underexplored. This paper presents a controlled experiment that empirically assesses the performance of four leading LLM-based AI assistants-Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet), and Perplexity Labs (Llama-3.1-70b-instruct)-on a diverse set of data science coding challenges sourced from the Stratacratch platform. Using the Goal-Question-Metric (GQM) approach, we evaluated each model's effectiveness across task types (Analytical, Algorithm, Visualization) and varying difficulty levels. Our findings reveal that all models exceeded a 50% baseline success rate, confirming their capability beyond random chance. Notably, only ChatGPT and Claude achieved success rates significantly above a 60% baseline, though none of the models reached a 70% threshold, indicating limitations in higher standards. ChatGPT demonstrated consistent performance across varying difficulty levels, while Claude's success rate fluctuated with task complexity. Hypothesis testing indicates that task type does not significantly impact success rate overall. For analytical tasks, efficiency analysis shows no significant differences in execution times, though ChatGPT tended to be slower and less predictable despite high success rates. This study provides a structured, empirical evaluation of LLMs in data science, delivering insights that support informed model selection tailored to specific task demands. Our findings establish a framework for future AI assessments, emphasizing the value of rigorous evaluation beyond basic accuracy measures.
Software Engineering,Artificial Intelligence,Emerging Technologies
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the effectiveness of large language models (LLMs) in data science code generation. Specifically, through a controlled experiment, the researchers systematically evaluated the performance of four leading LLM - based AI assistants - Microsoft Copilot (GPT - 4 Turbo), ChatGPT (o1 - preview), Claude (3.5 Sonnet) and Perplexity Labs (Llama - 3.1 - 70b - instruct) when dealing with data science coding challenges from the Stratacratch platform. These challenges cover different task types (analysis, algorithm, visualization) and different difficulty levels (easy, medium, difficult). The main objectives of the study are: 1. **Evaluate the overall performance of LLMs in data science coding tasks**: Evaluate the effectiveness of each model by measuring the proportion of successfully generated correct code solutions. 2. **Compare the performance differences between different LLMs**: Analyze the performance of different models on different task types and difficulty levels, and determine which models perform better on specific tasks. 3. **Explore the influence of task types and difficulty levels on model performance**: Study whether task types (analysis, algorithm, visualization) and difficulty levels (easy, medium, difficult) significantly affect the success rate of the model. 4. **Analyze the efficiency and quality of the generated code**: For analysis tasks, evaluate the execution time of the generated code; for visualization tasks, evaluate the similarity between the generated chart and the expected result. Through these evaluations, the researchers hope to provide a structured, empirical framework to support future improvements and research in AI - assisted data science.