ESG Sentiment Analysis: comparing human and language model performance including GPT

Karim Derrick
2024-02-26
Abstract:In this paper we explore the challenges of measuring sentiment in relation to Environmental, Social and Governance (ESG) social media. ESG has grown in importance in recent years with a surge in interest from the financial sector and the performance of many businesses has become based in part on their ESG related reputations. The use of sentiment analysis to measure ESG related reputation has developed and with it interest in the use of machines to do so. The era of digital media has created an explosion of new media sources, driven by the growth of social media platforms. This growing data environment has become an excellent source for behavioural insight studies across many disciplines that includes politics, healthcare and market research. Our study seeks to compare human performance with the cutting edge in machine performance in the measurement of ESG related sentiment. To this end researchers classify the sentiment of 150 tweets and a reliability measure is made. A gold standard data set is then established based on the consensus of 3 researchers and this data set is then used to measure the performance of different machine approaches: one based on the VADER dictionary approach to sentiment classification and then multiple language model approaches, including Llama2, T5, Mistral, Mixtral, FINBERT, GPT3.5 and GPT4.
Computational Engineering; Finance; and Science,Computation and Language,Computers and Society
What problem does this paper attempt to address?
This paper discusses the challenges of sentiment analysis on environmental, social, and governance (ESG) related social media. The study compares the performance of human judgments with machines (including advanced language models like GPT) in measuring ESG sentiment. With the development of digital media, social media has generated a large amount of data, which has become an important source for behavioral insights in multiple disciplines. The paper evaluates different machine methods, including dictionary-based methods (such as VADER) and various language models (such as GPT3.5 and GPT4), by allowing researchers to classify 150 tweets based on sentiment and establish a gold standard dataset based on consensus from three researchers. The study focuses on two questions: the consistency of human sentiment analysis and whether large language models can improve performance. The author points out that sentiment analysis aims to identify the positive or negative attitude of text towards a specific topic, but its measurement is not straightforward as sentiment expressions can be subjective and ambiguous. Traditional dictionary-based methods often perform poorly, while machine learning methods, especially pre-trained language models like the GPT series, show better performance in certain tasks. However, establishing a gold standard (i.e., human judgment) as a benchmark for machine methods also has issues because the reliability of human judgment is often low. The study finds that despite the outstanding performance of large language models in sentiment analysis, dictionary-based methods have poorer results. GPT4 excels in accuracy, recall, and F1 score, while other models such as FinBERT struggle in identifying negative or neutral sentiment. The paper discusses these results, suggesting that as language model performance improves, they may surpass average human performance and raises questions about the necessity of continuing to treat human judgment as the benchmark.