Benchmarking XAI Explanations with Human-Aligned Evaluations

Rémi Kazmierczak,Steve Azzolin,Eloïse Berthier,Anna Hedström,Patricia Delhomme,Nicolas Bousquet,Goran Frehse,Massimiliano Mancini,Baptiste Caramiaux,Andrea Passerini,Gianni Franchi
2024-11-04
Abstract:In this paper, we introduce PASTA (Perceptual Assessment System for explanaTion of Artificial intelligence), a novel framework for a human-centric evaluation of XAI techniques in computer vision. Our first key contribution is a human evaluation of XAI explanations on four diverse datasets (COCO, Pascal Parts, Cats Dogs Cars, and MonumAI) which constitutes the first large-scale benchmark dataset for XAI, with annotations at both the image and concept levels. This dataset allows for robust evaluation and comparison across various XAI methods. Our second major contribution is a data-based metric for assessing the interpretability of explanations. It mimics human preferences, based on a database of human evaluations of explanations in the PASTA-dataset. With its dataset and metric, the PASTA framework provides consistent and reliable comparisons between XAI techniques, in a way that is scalable but still aligned with human evaluations. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. Our findings indicate that humans tend to prefer saliency maps over other explanation types. Moreover, we provide evidence that human assessments show a low correlation with existing XAI metrics that are numerically simulated by probing the model.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies in current evaluation methods for Explainable Artificial Intelligence (XAI) techniques, especially the lack of a standardized evaluation framework from the perspective of human perception. Specifically: 1. **Lack of a Standard Evaluation Framework from the Perspective of Human Perception**: Existing XAI technique evaluations mainly rely on computational models and simulation evaluations, while ignoring human users' understanding and preferences for explanations. This has led to a disconnection between evaluation results and user experiences in practical applications. 2. **High - cost and Unreliable Evaluation**: Currently, widely - used evaluation methods rely on manual annotation. This method is not only costly and time - consuming but also easily affected by individual factors of annotators (such as fatigue, time, etc.), resulting in inconsistent and unreliable evaluation results. To solve these problems, the author proposes a new framework - PASTA (Perceptual Assessment System for explanaTion of Artificial intelligence), aiming to simulate human evaluations of XAI explanations in an automated manner, thereby providing an evaluation standard that is more in line with human cognition. The main contributions of the PASTA framework include: - **Constructing a Large - scale Benchmark Dataset**: It contains image datasets in four different domains (COCO, Pascal Parts, Cats Dogs Cars, and MonumAI). These datasets have image - level and concept - level annotations and can be used to evaluate multiple XAI methods. - **Developing a Data - based Metric Method**: This method provides an automated scoring system by imitating human preferences for explanations. This metric method is based on a large number of human evaluation results in the PASTA dataset, ensuring the consistency and reliability of the evaluation. - **Comparing the Performance of Different XAI Methods**: Through large - scale evaluations of 21 XAI methods, it is found that humans are more inclined to choose saliency maps as an explanation method rather than other types of explanations. - **Revealing the Low Correlation between Existing XAI Metrics and Human Evaluations**: Research shows that there is a low correlation between existing numerical simulation metrics (such as ROAD) and human evaluations, indicating that these metrics may have ignored important aspects of human perception. In conclusion, by introducing the PASTA framework, this paper fills the gap in the perspective of human perception in XAI technique evaluations and provides a more comprehensive, reliable, and human - cognition - compliant evaluation tool.