Abstract:The adoption of Large Language Models (LLMs) for code generation in data science offers substantial potential for enhancing tasks such as data manipulation, statistical analysis, and visualization. However, the effectiveness of these models in the data science domain remains underexplored. This paper presents a controlled experiment that empirically assesses the performance of four leading LLM-based AI assistants-Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet), and Perplexity Labs (Llama-3.1-70b-instruct)-on a diverse set of data science coding challenges sourced from the Stratacratch platform. Using the Goal-Question-Metric (GQM) approach, we evaluated each model's effectiveness across task types (Analytical, Algorithm, Visualization) and varying difficulty levels. Our findings reveal that all models exceeded a 50% baseline success rate, confirming their capability beyond random chance. Notably, only ChatGPT and Claude achieved success rates significantly above a 60% baseline, though none of the models reached a 70% threshold, indicating limitations in higher standards. ChatGPT demonstrated consistent performance across varying difficulty levels, while Claude's success rate fluctuated with task complexity. Hypothesis testing indicates that task type does not significantly impact success rate overall. For analytical tasks, efficiency analysis shows no significant differences in execution times, though ChatGPT tended to be slower and less predictable despite high success rates. This study provides a structured, empirical evaluation of LLMs in data science, delivering insights that support informed model selection tailored to specific task demands. Our findings establish a framework for future AI assessments, emphasizing the value of rigorous evaluation beyond basic accuracy measures.

Predicting the Performance of Black-box LLMs through Self-Queries

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method

Embers of autoregression show how large language models are shaped by the problem they are trained to solve

How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench

Large Language Model Confidence Estimation via Black-Box Access

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

LLM-Select: Feature Selection with Large Language Models

Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs

LLM-Generated Black-box Explanations Can Be Adversarially Helpful

Language Models can Evaluate Themselves via Probability Discrepancy

Eight Things to Know about Large Language Models

Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Large Language Models have Intrinsic Self-Correction Ability

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

LLM Performance Predictors are good initializers for Architecture Search

On Large Language Models’ Resilience to Coercive Interrogation

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction