Do LLMs Think Fast and Slow? A Causal Study on Sentiment Analysis

Zhiheng Lyu,Zhijing Jin,Fernando Gonzalez,Rada Mihalcea,Bernhard Schölkopf,Mrinmaya Sachan
2024-10-28
Abstract:Sentiment analysis (SA) aims to identify the sentiment expressed in a text, such as a product review. Given a review and the sentiment associated with it, this work formulates SA as a combination of two tasks: (1) a causal discovery task that distinguishes whether a review "primes" the sentiment (Causal Hypothesis C1), or the sentiment "primes" the review (Causal Hypothesis C2); and (2) the traditional prediction task to model the sentiment using the review as input. Using the peak-end rule in psychology, we classify a sample as C1 if its overall sentiment score approximates an average of all the sentence-level sentiments in the review, and C2 if the overall sentiment score approximates an average of the peak and end sentiments. For the prediction task, we use the discovered causal mechanisms behind the samples to improve LLM performance by proposing causal prompts that give the models an inductive bias of the underlying causal graph, leading to substantial improvements by up to 32.13 F1 points on zero-shot five-class SA. Our code is at <a class="link-external link-https" href="https://github.com/cogito233/causal-sa" rel="external noopener nofollow">this https URL</a>
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is whether large - language models (LLMs) can improve their performance through causal alignment in the sentiment analysis (SA) task. Specifically, the author proposes a dual - task framework, dividing sentiment analysis into two parts: 1. **Causal Discovery Task**: Identify the causal relationship between reviews and sentiment, that is, whether the review "causes" the sentiment (Causal Hypothesis C1), or whether the sentiment "causes" the review (Causal Hypothesis C2). The author uses the "Peak - End Rule" in psychology to classify samples. If the overall sentiment score is close to the average of all sentence sentiments, it is classified as C1; if the overall sentiment score is close to the average of the peak and end sentiments, it is classified as C2. 2. **Prediction Task**: Use reviews as input to model sentiment. The author proposes to provide inductive bias to the model through causal prompts to improve the performance of LLMs in zero - sample five - category sentiment analysis. ### Main contributions of the paper: 1. **Propose the dual nature of sentiment analysis**: Decompose the sentiment analysis task into a causal discovery task and a prediction task. 2. **Conduct causal discovery based on psychological theory**: Use the Peak - End Rule to identify two possible causal processes. 3. **Design causal prompts to improve model performance**: Verified by experiments, causal prompts can significantly improve the performance of LLMs in the sentiment analysis task, with a maximum increase of 32.13 F1 - score points. ### Experimental results: - **Performance under standard prompts**: Existing LLMs perform better on the C2 dataset, indicating that the decision - making pattern of LLMs is closer to the Fast Thinking system, that is, judging the overall sentiment according to the peak and end of the sentiment arc. - **Effect of causal prompts**: When using the C2 prompt on the C2 dataset, the model performs best, with significant improvements in both F1 - score and accuracy. For example, the F1 - score of GPT - 2 is increased by 32.13 points, and the F1 - score of GPT - 4 is increased by 14.23 points. - **Model's understanding of causal stories**: Although the C2 prompt performs well, the C1 prompt does not significantly improve the performance of the C1 dataset in some cases, which raises further questions - whether LLMs truly understand the mechanisms behind these causal prompts. ### Conclusion: The paper successfully improves the performance of LLMs in the sentiment analysis task through the method of causal alignment, especially on the C2 dataset. However, for the C1 dataset, there is still room for improvement in the model's performance, and future research can further explore how to make LLMs better understand and apply causal prompts.