Abstract:The adoption of Large Language Models (LLMs) for code generation in data science offers substantial potential for enhancing tasks such as data manipulation, statistical analysis, and visualization. However, the effectiveness of these models in the data science domain remains underexplored. This paper presents a controlled experiment that empirically assesses the performance of four leading LLM-based AI assistants-Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet), and Perplexity Labs (Llama-3.1-70b-instruct)-on a diverse set of data science coding challenges sourced from the Stratacratch platform. Using the Goal-Question-Metric (GQM) approach, we evaluated each model's effectiveness across task types (Analytical, Algorithm, Visualization) and varying difficulty levels. Our findings reveal that all models exceeded a 50% baseline success rate, confirming their capability beyond random chance. Notably, only ChatGPT and Claude achieved success rates significantly above a 60% baseline, though none of the models reached a 70% threshold, indicating limitations in higher standards. ChatGPT demonstrated consistent performance across varying difficulty levels, while Claude's success rate fluctuated with task complexity. Hypothesis testing indicates that task type does not significantly impact success rate overall. For analytical tasks, efficiency analysis shows no significant differences in execution times, though ChatGPT tended to be slower and less predictable despite high success rates. This study provides a structured, empirical evaluation of LLMs in data science, delivering insights that support informed model selection tailored to specific task demands. Our findings establish a framework for future AI assessments, emphasizing the value of rigorous evaluation beyond basic accuracy measures.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the effectiveness of large language models (LLMs) in data science code generation. Specifically, through a controlled experiment, the researchers systematically evaluated the performance of four leading LLM - based AI assistants - Microsoft Copilot (GPT - 4 Turbo), ChatGPT (o1 - preview), Claude (3.5 Sonnet) and Perplexity Labs (Llama - 3.1 - 70b - instruct) when dealing with data science coding challenges from the Stratacratch platform. These challenges cover different task types (analysis, algorithm, visualization) and different difficulty levels (easy, medium, difficult). The main objectives of the study are: 1. **Evaluate the overall performance of LLMs in data science coding tasks**: Evaluate the effectiveness of each model by measuring the proportion of successfully generated correct code solutions. 2. **Compare the performance differences between different LLMs**: Analyze the performance of different models on different task types and difficulty levels, and determine which models perform better on specific tasks. 3. **Explore the influence of task types and difficulty levels on model performance**: Study whether task types (analysis, algorithm, visualization) and difficulty levels (easy, medium, difficult) significantly affect the success rate of the model. 4. **Analyze the efficiency and quality of the generated code**: For analysis tasks, evaluate the execution time of the generated code; for visualization tasks, evaluate the similarity between the generated chart and the expected result. Through these evaluations, the researchers hope to provide a structured, empirical framework to support future improvements and research in AI - assisted data science.

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Large Language Model Evaluation Via Multi AI Agents: Preliminary results

Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

Evaluation of the Programming Skills of Large Language Models

Are Large Language Models Good Statisticians?

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Efficiently Measuring the Cognitive Ability of LLMs: an Adaptive Testing Perspective

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Large Language Models as Code Executors: An Exploratory Study

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

Large Language Models as Data Preprocessors

Performance of Large Language Models in a Computer Science Degree Program

ChatGPT Alternative Solutions: Large Language Models Survey

The Battle of LLMs: A Comparative Study in Conversational QA Tasks

Large Language Models: Their Success and Impact

Large Language Models in Computer Science Education: A Systematic Literature Review

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

An evaluation of LLM code generation capabilities through graded exercises