ESG Sentiment Analysis: comparing human and language model performance including GPT

Karim Derrick

2024-02-26

Abstract:In this paper we explore the challenges of measuring sentiment in relation to Environmental, Social and Governance (ESG) social media. ESG has grown in importance in recent years with a surge in interest from the financial sector and the performance of many businesses has become based in part on their ESG related reputations. The use of sentiment analysis to measure ESG related reputation has developed and with it interest in the use of machines to do so. The era of digital media has created an explosion of new media sources, driven by the growth of social media platforms. This growing data environment has become an excellent source for behavioural insight studies across many disciplines that includes politics, healthcare and market research. Our study seeks to compare human performance with the cutting edge in machine performance in the measurement of ESG related sentiment. To this end researchers classify the sentiment of 150 tweets and a reliability measure is made. A gold standard data set is then established based on the consensus of 3 researchers and this data set is then used to measure the performance of different machine approaches: one based on the VADER dictionary approach to sentiment classification and then multiple language model approaches, including Llama2, T5, Mistral, Mixtral, FINBERT, GPT3.5 and GPT4.

Computational Engineering; Finance; and Science,Computation and Language,Computers and Society

What problem does this paper attempt to address?

This paper discusses the challenges of sentiment analysis on environmental, social, and governance (ESG) related social media. The study compares the performance of human judgments with machines (including advanced language models like GPT) in measuring ESG sentiment. With the development of digital media, social media has generated a large amount of data, which has become an important source for behavioral insights in multiple disciplines. The paper evaluates different machine methods, including dictionary-based methods (such as VADER) and various language models (such as GPT3.5 and GPT4), by allowing researchers to classify 150 tweets based on sentiment and establish a gold standard dataset based on consensus from three researchers. The study focuses on two questions: the consistency of human sentiment analysis and whether large language models can improve performance. The author points out that sentiment analysis aims to identify the positive or negative attitude of text towards a specific topic, but its measurement is not straightforward as sentiment expressions can be subjective and ambiguous. Traditional dictionary-based methods often perform poorly, while machine learning methods, especially pre-trained language models like the GPT series, show better performance in certain tasks. However, establishing a gold standard (i.e., human judgment) as a benchmark for machine methods also has issues because the reliability of human judgment is often low. The study finds that despite the outstanding performance of large language models in sentiment analysis, dictionary-based methods have poorer results. GPT4 excels in accuracy, recall, and F1 score, while other models such as FinBERT struggle in identifying negative or neutral sentiment. The paper discusses these results, suggesting that as language model performance improves, they may surpass average human performance and raises questions about the necessity of continuing to treat human judgment as the benchmark.

ESG Sentiment Analysis: comparing human and language model performance including GPT

Decoding mood of the Twitterverse on ESG investing: opinion mining and key themes using machine learning

Differential Impacts of Environmental, Social, and Governance News Sentiment on Corporate Financial Performance in the Global Market: An Analysis of Dynamic Industries Using Advanced Natural Language Processing Models

Sentiment Analysis on Social Media Content

Evaluating the performance of state-of-the-art esg domain-specific pre-trained large language models in text classification against existing models and traditional machine learning techniques

Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey

Performance evaluation of Reddit Comments using Machine Learning and Natural Language Processing methods in Sentiment Analysis

SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning

More than a Feeling: Accuracy and Application of Sentiment Analysis

Linking microblogging sentiments to stock price movement: An application of GPT-4

Sentiment Analysis in the Age of Generative AI

Sentiment analysis of COP9-related tweets: a comparative study of pre-trained models and traditional techniques

The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms

Creating a Systematic ESG (Environmental Social Governance) Scoring System Using Social Network Analysis and Machine Learning for More Sustainable Company Practices

Comparison of Deep Learning Sentiment Analysis Methods, Including LSTM and Machine Learning

Sentiment analysis in tweets: an assessment study from classical to modern word representation models

Leveraging ChatGPT As Text Annotation Tool For Sentiment Analysis

Financial sentiment analysis: Classic methods vs. deep learning models

Movie Review Sentiment Analysis: Supervised Learning versus Large Language Model

ESGBERT: Language Model to Help with Classification Tasks Related to Companies Environmental, Social, and Governance Practices

A supervised deep learning-based sentiment analysis by the implementation of Word2Vec and GloVe Embedding techniques