Abstract:For software evolution, user feedback has become a meaningful way to improve applications. Recent studies show a significant increase in analyzing end-user feedback from various social media platforms for software evolution. However, less attention has been given to the end-user feedback for low-rating software applications. Also, such approaches are developed mainly on the understanding of human annotators who might have subconsciously tried for a second guess, questioning the validity of the methods. For this purpose, we proposed an approach that analyzes end-user feedback for low-rating applications to identify the end-user opinion types associated with negative reviews (an issue or bug). Also, we utilized Generative Artificial Intelligence (AI), i.e., ChatGPT, as an annotator and negotiator when preparing a truth set for the deep learning (DL) classifiers to identify end-user emotion. For the proposed approach, we first used the ChatGPT Application Programming Interface (API) to identify negative end-user feedback by processing 71853 reviews collected from 45 apps in the Amazon store. Next, a novel grounded theory is developed by manually processing end-user negative feedback to identify frequently associated emotion types, including anger, confusion, disgust, distrust, disappointment, fear, frustration, and sadness. Next, two datasets were developed, one with human annotators using a content analysis approach and the other using ChatGPT API with the identified emotion types. Next, another round is conducted with ChatGPT to negotiate over the conflicts with the human-annotated dataset, resulting in a conflict-free emotion detection dataset. Finally, various DL classifiers, including LSTM, BILSTM, CNN, RNN, GRU, BiGRU and BiRNN, are employed to identify their efficacy in detecting end-users emotions by preprocessing the input data, applying feature engineering, balancing the data set, and then training and testing them using a cross-validation approach. We obtained an average accuracy of 94%, 94%, 93%, 92%, 91%, 91%, and 85%, with LSTM, BILSTM, RNN, CNN, GRU, BiGRU and BiRNN, respectively, showing improved results with the truth set curated with human and ChatGPT. Using ChatGPT as an annotator and negotiator can help automate and validate the annotation process, resulting in better DL performances.

How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?

Sentiment Analysis in the Age of Generative AI

Analyzing LLMs' Capabilities to Establish Implicit User Sentiment of Software Desirability

Exploring Qualitative Research Using LLMs

Sentiment Analysis in the Era of Large Language Models: A Reality Check

LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

ChatGPT vs Gemini vs LLaMA on Multilingual Sentiment Analysis

Exploring Requirements Elicitation from App Store User Reviews Using Large Language Models

Do Large Language Models Possess Sensitive to Sentiment?

Movie Review Sentiment Analysis: Supervised Learning versus Large Language Model

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges, Benchmarks, and Future Directions

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review

LLMs in e-commerce: A comparative analysis of GPT and LLaMA models in product review evaluation

Leveraging Large Language Model ChatGPT for enhanced understanding of end-user emotions in social media feedbacks

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Sentiment Analysis through LLM Negotiations