Abstract:A rapidly developing application of LLMs in XAI is to convert quantitative explanations such as SHAP into user-friendly narratives to explain the decisions made by smaller prediction models. Evaluating the narratives without relying on human preference studies or surveys is becoming increasingly important in this field. In this work we propose a framework and explore several automated metrics to evaluate LLM-generated narratives for explanations of tabular classification tasks. We apply our approach to compare several state-of-the-art LLMs across different datasets and prompt types. As a demonstration of their utility, these metrics allow us to identify new challenges related to LLM hallucinations for XAI narratives.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to quantitatively evaluate the quality of XAI (Explainable Artificial Intelligence) narratives generated by large language models (LLM) without relying on human preference studies or surveys. Specifically, the author proposes a framework and explores several automated evaluation metrics to evaluate the narratives generated by LLM for explaining tabular classification tasks. ### Core Problems of the Paper 1. **Evaluating the Quality of XAI Narratives** - Traditionally, evaluating the quality of XAI narratives usually depends on user surveys or preference studies. This method is difficult to scale and cannot be verified in real - time and automatically. - This paper aims to develop a fully automated evaluation framework to measure the performance of XAI narratives generated by LLM on different datasets and prompt types through multiple quantitative metrics. 2. **Identifying Challenges in LLM - Generated Narratives** - In particular, identify new challenges related to LLM hallucinations, that is, the situation where the content generated by LLM does not match the actual data. ### Specific Objectives - **Propose an Automated Evaluation Framework**, covering the following aspects: - **Faithfulness**: Evaluate whether the narrative accurately reflects the original SHAP table and other provided data. - **Human Similarity**: Evaluate the similarity between the generated narrative and the human - written narrative. - **Assumptions**: Evaluate whether the general knowledge assumptions included in the narrative are reasonable. - **Verify the Behavior of These Evaluation Metrics** to ensure their reliability and validity in multiple proof - of - concept experiments. - **Compare Narrative Generation Effects on Different LLMs and Datasets** to show how these metrics can help identify new challenges for LLM in XAI narrative generation. ### Method Overview 1. **Generate Narratives** - Use zero - shot prompt, provide task descriptions, dataset backgrounds, SHAP tables, and prediction scores to generate narratives explaining binary classification models. 2. **Extract Information** - Use another LLM as an extraction model to extract key information (such as feature rankings, signs, values, and assumptions) from the generated narrative and verify it through downstream metrics. 3. **Evaluation Metrics** - **Faithfulness**: Measured by Rank Agreement (RA), Sign Agreement (SA), and Value Agreement (VA). - **Assumptions**: Use metrics such as perplexity to evaluate the rationality of assumptions. - **Human Similarity**: Use embedding models (such as cosine similarity) to compare the generated narrative with the human - written narrative. Through these methods, the paper hopes to provide a new tool for the automated evaluation of XAI narratives and reveal the challenges currently faced by LLM in this field.

How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives

Explingo: Explaining AI Predictions using Large Language Models

Tell Me a Story! Narrative-Driven XAI with Large Language Models

LLMs for XAI: Future Directions for Explaining Explanations

Interpretable Narrative Explanation for ML Predictors with LP: A Case Study for XAI

Evaluating Explanations Through LLMs: Beyond Traditional User Studies

Who's Thinking? A Push for Human-Centered Evaluation of LLMs using the XAI Playbook

Gamifying XAI: Enhancing AI Explainability for Non-technical Users through LLM-Powered Narrative Gamifications

Are Large Language Models Capable of Generating Human-Level Narratives?

Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era

"It Explains What I am Currently Going Through Perfectly to a Tee": Understanding User Perceptions on LLM-Enhanced Narrative Interventions

Reasoning before Comparison: LLM-Enhanced Semantic Similarity Metrics for Domain Specialized Text Analysis

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs

Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs

Leveraging LLMs for Dialogue Quality Measurement

eXplainable AI with GPT4 for story analysis and generation: A novel framework for diachronic sentiment analysis

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions