Abstract:Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

What problem does this paper attempt to address?

The paper primarily explores the effectiveness and practicality of explanations automatically generated by large language models (LLMs) such as ChatGPT, particularly focusing on feature attribution explanations in sentiment analysis tasks. The core issue of the paper is to evaluate the quality of self-explanations generated by these large language models. The authors are concerned with whether these explanations can effectively support the model's prediction results, such as whether the reasons provided for determining the sentiment tendency of a text in sentiment analysis are reasonable and relevant. Additionally, they investigate the similarities and differences between these self-explanations and traditional explanation methods (such as occlusion and Local Interpretable Model-agnostic Explanations [LIME]). Specifically, the research in the paper includes the following aspects: 1. **Experimental Design**: The paper obtains self-explanations in two different ways—Explain-then-Predict (E-P) and Predict-then-Explain (P-E). Both methods guide the model to generate explanations and prediction results through specific prompts. 2. **Comparison Objects**: In addition to the model's automatically generated explanations, traditional explanation methods, namely occlusion and LIME, are used as benchmarks for comparison. 3. **Evaluation Metrics**: To quantitatively evaluate the effectiveness of these explanations, the paper employs various metrics, including completeness, sufficiency, and decision flip rate after removing important words, and compares the performance of these self-explanations with traditional methods on these metrics. 4. **Findings and Conclusions**: The study finds that self-explanations generated by ChatGPT perform comparably to traditional methods on multiple evaluation metrics, but there are significant differences between the two. Furthermore, the authors point out that traditional model interpretability practices may not be suitable for handling language models like ChatGPT, which possess human-like reasoning capabilities. In summary, the paper aims to explore and understand the reliability and effectiveness of explanations automatically generated by large language models and their relationship with traditional explanation techniques, providing guidance for further improving the transparency and interpretability of models.

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

Evaluating the Reliability of Self-Explanations in Large Language Models

Are self-explanations from Large Language Models faithful?

Properties and Challenges of LLM-Generated Explanations

Large Language Models as Evaluators for Recommendation Explanations

Large Language Models Cannot Explain Themselves

Explaining Explanation: An Empirical Study on Explanation in Code Reviews

Towards Interpretable Mental Health Analysis with Large Language Models

"Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline

Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations

Towards Uncovering How Large Language Model Works: An Explainability Perspective

Explainability for Large Language Models: A Survey

Explingo: Explaining AI Predictions using Large Language Models

LMExplainer: Grounding Knowledge and Explaining Language Models

CELL your Model: Contrastive Explanations for Large Language Models

Inference to the Best Explanation in Large Language Models

XplainLLM: A Knowledge-Augmented Dataset for Reliable Grounded Explanations in LLMs

The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models

Comparing zero-shot self-explanations with human rationales in multilingual text classification