Abstract:The adoption of Large Language Models (LLMs) for code generation in data science offers substantial potential for enhancing tasks such as data manipulation, statistical analysis, and visualization. However, the effectiveness of these models in the data science domain remains underexplored. This paper presents a controlled experiment that empirically assesses the performance of four leading LLM-based AI assistants-Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet), and Perplexity Labs (Llama-3.1-70b-instruct)-on a diverse set of data science coding challenges sourced from the Stratacratch platform. Using the Goal-Question-Metric (GQM) approach, we evaluated each model's effectiveness across task types (Analytical, Algorithm, Visualization) and varying difficulty levels. Our findings reveal that all models exceeded a 50% baseline success rate, confirming their capability beyond random chance. Notably, only ChatGPT and Claude achieved success rates significantly above a 60% baseline, though none of the models reached a 70% threshold, indicating limitations in higher standards. ChatGPT demonstrated consistent performance across varying difficulty levels, while Claude's success rate fluctuated with task complexity. Hypothesis testing indicates that task type does not significantly impact success rate overall. For analytical tasks, efficiency analysis shows no significant differences in execution times, though ChatGPT tended to be slower and less predictable despite high success rates. This study provides a structured, empirical evaluation of LLMs in data science, delivering insights that support informed model selection tailored to specific task demands. Our findings establish a framework for future AI assessments, emphasizing the value of rigorous evaluation beyond basic accuracy measures.

Exploring Large Language Models for Product Attribute Value Identification

Using LLMs for the Extraction and Normalization of Product Attribute Values

ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction

LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction

Apprentices to Research Assistants: Advancing Research with Large Language Models

Learn by Selling: Equipping Large Language Models with Product Knowledge for Context-Driven Recommendations

Large Language Models for Relevance Judgment in Product Search

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

Capture the Flag: Uncovering Data Insights with Large Language Models

An Empirical Comparison of Generative Approaches for Product Attribute-Value Identification

Attribute or Abstain: Large Language Models as Long Document Assistants

Augmented Large Language Models with Parametric Knowledge Guiding

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

LLM-Select: Feature Selection with Large Language Models

Large language models for aspect-based sentiment analysis

Large Language Models: A Survey

Making Large Language Models Interactive: A Pioneer Study on Supporting Complex Information-Seeking Tasks with Implicit Constraints

Aligning Large Language Models with Recommendation Knowledge

Large Language Models in Consumer Electronic Retail Industry: An AI Product Advisor