Are self-explanations from Large Language Models faithful?

Andreas Madsen,Sarath Chandar,Siva Reddy

2024-05-17

Abstract:Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is whether the self - explanations provided by large language models (LLMs) truly reflect the behavior of the models, that is, whether these explanations have interpretability - faithfulness. Specifically, the author focuses on how to evaluate the authenticity and reliability of the self - explanations generated by LLMs, because inaccurate but convincing self - explanations may lead to an unfounded increase in confidence in the model's capabilities, thus bringing risks. To meet this challenge, the paper proposes a method based on self - consistency checks to measure the authenticity of explanations. This method is applicable to counterfactual explanations, importance - measure explanations, and ablative explanations, and can be implemented through the model's inference API without accessing the model's internal structure or parameters. Through this method, the author aims to provide a general framework to evaluate the authenticity of explanations under different tasks and models, thereby providing support for improving the transparency and credibility of LLMs.

Are self-explanations from Large Language Models faithful?

Evaluating the Reliability of Self-Explanations in Large Language Models

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models

FaithLM: Towards Faithful Explanations for Large Language Models

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

On Measuring Faithfulness or Self-consistency of Natural Language Explanations

Properties and Challenges of LLM-Generated Explanations

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

Faithfulness Tests for Natural Language Explanations

New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

Why Would You Suggest That? Human Trust in Language Model Responses

Large Language Models are reasoners with Self-Verification

Large Language Models Cannot Self-Correct Reasoning Yet