Abstract:OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.

WildChat: 1M ChatGPT Interaction Logs in the Wild

Pchatbot: A Large-Scale Dataset for Personalized Chatbot

WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild

Early ChatGPT User Portrait through the Lens of Data

ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time

ChatGPT Role-play Dataset: Analysis of User Motives and Model Naturalness

LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically Constructed from Live Streaming

Unveiling Security, Privacy, and Ethical Concerns of ChatGPT

The public attitude towards ChatGPT on reddit: A study based on unsupervised learning from sentiment analysis and topic modeling

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

ChatGPT: perspectives from human-computer interaction and psychology

Can ChatGPT provide a better support: a comparative analysis of ChatGPT and dataset responses in mental health dialogues

Public Attitudes Toward ChatGPT on Twitter: Sentiments, Topics, and Occupations

ChatGPT in healthcare: A taxonomy and systematic review

Exploring the Impact of ChatGPT on Wikipedia Engagement

ChatGPT: Jack of all trades, master of none

ChatGPT in Data Visualization Education: A Student Perspective

From Unstructured Data to Insights: Understanding the Role of ChatGPT in the Rising Trend of AI Chatbots in Web Publications

"The ChatGPT bot is causing panic now – but it'll soon be as mundane a tool as Excel": analysing topics, sentiment and emotions relating to ChatGPT on Twitter

Deceptive AI Ecosystems: The Case of ChatGPT

How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection