How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives

Timour Ichmoukhamedov,James Hinns,David Martens
2024-12-13
Abstract:A rapidly developing application of LLMs in XAI is to convert quantitative explanations such as SHAP into user-friendly narratives to explain the decisions made by smaller prediction models. Evaluating the narratives without relying on human preference studies or surveys is becoming increasingly important in this field. In this work we propose a framework and explore several automated metrics to evaluate LLM-generated narratives for explanations of tabular classification tasks. We apply our approach to compare several state-of-the-art LLMs across different datasets and prompt types. As a demonstration of their utility, these metrics allow us to identify new challenges related to LLM hallucinations for XAI narratives.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to quantitatively evaluate the quality of XAI (Explainable Artificial Intelligence) narratives generated by large language models (LLM) without relying on human preference studies or surveys. Specifically, the author proposes a framework and explores several automated evaluation metrics to evaluate the narratives generated by LLM for explaining tabular classification tasks. ### Core Problems of the Paper 1. **Evaluating the Quality of XAI Narratives** - Traditionally, evaluating the quality of XAI narratives usually depends on user surveys or preference studies. This method is difficult to scale and cannot be verified in real - time and automatically. - This paper aims to develop a fully automated evaluation framework to measure the performance of XAI narratives generated by LLM on different datasets and prompt types through multiple quantitative metrics. 2. **Identifying Challenges in LLM - Generated Narratives** - In particular, identify new challenges related to LLM hallucinations, that is, the situation where the content generated by LLM does not match the actual data. ### Specific Objectives - **Propose an Automated Evaluation Framework**, covering the following aspects: - **Faithfulness**: Evaluate whether the narrative accurately reflects the original SHAP table and other provided data. - **Human Similarity**: Evaluate the similarity between the generated narrative and the human - written narrative. - **Assumptions**: Evaluate whether the general knowledge assumptions included in the narrative are reasonable. - **Verify the Behavior of These Evaluation Metrics** to ensure their reliability and validity in multiple proof - of - concept experiments. - **Compare Narrative Generation Effects on Different LLMs and Datasets** to show how these metrics can help identify new challenges for LLM in XAI narrative generation. ### Method Overview 1. **Generate Narratives** - Use zero - shot prompt, provide task descriptions, dataset backgrounds, SHAP tables, and prediction scores to generate narratives explaining binary classification models. 2. **Extract Information** - Use another LLM as an extraction model to extract key information (such as feature rankings, signs, values, and assumptions) from the generated narrative and verify it through downstream metrics. 3. **Evaluation Metrics** - **Faithfulness**: Measured by Rank Agreement (RA), Sign Agreement (SA), and Value Agreement (VA). - **Assumptions**: Use metrics such as perplexity to evaluate the rationality of assumptions. - **Human Similarity**: Use embedding models (such as cosine similarity) to compare the generated narrative with the human - written narrative. Through these methods, the paper hopes to provide a new tool for the automated evaluation of XAI narratives and reveal the challenges currently faced by LLM in this field.