Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Shiyuan Huang,Siddarth Mamidanna,Shreedhar Jangam,Yilun Zhou,Leilani H. Gilpin
2023-10-17
Abstract:Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper primarily explores the effectiveness and practicality of explanations automatically generated by large language models (LLMs) such as ChatGPT, particularly focusing on feature attribution explanations in sentiment analysis tasks. The core issue of the paper is to evaluate the quality of self-explanations generated by these large language models. The authors are concerned with whether these explanations can effectively support the model's prediction results, such as whether the reasons provided for determining the sentiment tendency of a text in sentiment analysis are reasonable and relevant. Additionally, they investigate the similarities and differences between these self-explanations and traditional explanation methods (such as occlusion and Local Interpretable Model-agnostic Explanations [LIME]). Specifically, the research in the paper includes the following aspects: 1. **Experimental Design**: The paper obtains self-explanations in two different ways—Explain-then-Predict (E-P) and Predict-then-Explain (P-E). Both methods guide the model to generate explanations and prediction results through specific prompts. 2. **Comparison Objects**: In addition to the model's automatically generated explanations, traditional explanation methods, namely occlusion and LIME, are used as benchmarks for comparison. 3. **Evaluation Metrics**: To quantitatively evaluate the effectiveness of these explanations, the paper employs various metrics, including completeness, sufficiency, and decision flip rate after removing important words, and compares the performance of these self-explanations with traditional methods on these metrics. 4. **Findings and Conclusions**: The study finds that self-explanations generated by ChatGPT perform comparably to traditional methods on multiple evaluation metrics, but there are significant differences between the two. Furthermore, the authors point out that traditional model interpretability practices may not be suitable for handling language models like ChatGPT, which possess human-like reasoning capabilities. In summary, the paper aims to explore and understand the reliability and effectiveness of explanations automatically generated by large language models and their relationship with traditional explanation techniques, providing guidance for further improving the transparency and interpretability of models.