Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

Christos Fragkathoulas,Odysseas S. Chlapanis
DOI: https://doi.org/10.1145/3688671.3688775
2024-09-18
Abstract:This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the faithfulness of large - language models (LLMs) when providing self - explanations. Specifically, the paper proposes a new task to evaluate the faithfulness of black - box LLMs through local perturbations and self - explanations. Many LLMs require additional context information when answering certain questions, and these models are often proprietary and can only be accessed via APIs, which makes it difficult to understand their decision - making processes. Therefore, the paper proposes a new and efficient interpretability technique, aiming to identify the sufficient and necessary parts required for the model to generate correct answers and use them as explanations. In addition, the paper also proposes a metric method for evaluating faithfulness by comparing these key parts with the model's self - explanations. Using the Natural Questions dataset, the paper validates the effectiveness of its method and demonstrates its effectiveness in explaining model decisions and evaluating faithfulness.