Abstract:Zero-shot learning (ZSL) has attracted significant attention due to its capabilities of classifying new images from unseen classes. To perform the classification task for ZSL, learning visual and semantic embeddings has been the main research approach in existing literature. At the same time, generating complementary explanations to justify the classification decision has remained largely unexplored. In this paper, we propose to address a new and challenging task, namely explainable zero-shot learning (XZSL), which aims to generate visual and textual explanations to support the classification decision. To accomplish this task, we build a novel Deep Multi-modal Explanation (DME) model that incorporates a joint visual-attribute embedding module and a multi-channel explanation module in an end-to-end fashion. In contrast to existing ZSL approaches, our visual-attribute embedding is associated not only with the decision, but also with new visual and textual explanations. For visual explanations, we first capture several attribute activation maps (AAM) and then merge them into a class activation map (CAM) that visually infers which region of an image is relevant to the class. Textual explanations are generated from the multi-channel explanation module, jointly integrating three long short-term memory models (LSTMs) each of which is conditioned on a different feature representation. Additionally, we suggest that the DME model can retain explanatory consistency for similar instances and explanatory diversity for diverse instances. We conduct qualitative and quantitative experiments to assess the model for ZSL classification and explanation. Specifically, the ablation studies verify the effectiveness of the components in our model. Our results on three well-known datasets are competitive with prior approaches. More importantly, the joint training of our embedding and explanation modules demonstrates mutual performance improvements between ZSL classification and explanation. We shed more light on DME to analyze and diagnose its advantages and limitations.

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)

A Study on Multimodal and Interactive Explanations for Visual Question Answering

MEGL: Multimodal Explanation-Guided Learning

Multimodal Contrastive Transformer for Explainable Recommendation

Generation of Multimodal Justification Using Visual Word Constraint Model for Explainable Computer-Aided Diagnosis

A Deep Multi-modal Explanation Model for Zero-shot Learning

DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Sim2Word: Explaining Similarity with Representative Attribute Words via Counterfactual Explanations

A Concept-Based Explainability Framework for Large Multimodal Models

M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis

Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models

MultiViz: Towards Visualizing and Understanding Multimodal Models

Improving VQA and its Explanations \\ by Comparing Competing Explanations

Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Generating Post-Hoc Explanation from Deep Neural Networks for Multi-Modal Medical Image Analysis Tasks

Explanation as a process: user-centric construction of multi-level and multi-modal explanations

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction