FastRM: An efficient and automatic explainability framework for multimodal generative models

Gabriela Ben-Melech Stan,Estelle Aflalo,Man Luo,Shachar Rosenman,Tiep Le,Sayak Paul,Shao-Yen Tseng,Vasudev Lal

2024-12-02

Abstract:While Large Vision Language Models (LVLMs) have become masterly capable in reasoning over human prompts and visual inputs, they are still prone to producing responses that contain misinformation. Identifying incorrect responses that are not grounded in evidence has become a crucial task in building trustworthy AI. Explainability methods such as gradient-based relevancy maps on LVLM outputs can provide an insight on the decision process of models, however these methods are often computationally expensive and not suited for on-the-fly validation of outputs. In this work, we propose FastRM, an effective way for predicting the explainable Relevancy Maps of LVLM models. Experimental results show that employing FastRM leads to a 99.8% reduction in compute time for relevancy map generation and an 44.4% reduction in memory footprint for the evaluated LVLM, making explainable AI more efficient and practical, thereby facilitating its deployment in real-world applications.

Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Although large - scale visual - language models (LVLMs) perform excellently in handling human prompts and visual inputs, they are prone to generate responses containing misinformation. Identifying these unsubstantiated wrong responses is crucial for building trustworthy artificial intelligence. Existing explanation methods, such as gradient - based relevance maps, can provide insights into the model's decision - making process, but these methods are often computationally expensive and not suitable for real - time verification of outputs. Specifically, the paper proposes the FastRM framework, which aims to efficiently predict the explainable relevancy maps of LVLMs, thereby significantly reducing the computation time and memory usage, making explanatory AI more efficient and practical and facilitating its deployment in real - world applications. Experimental results show that using FastRM can reduce the computation time for generating relevancy maps by 99.8% and the memory usage by 44.4%. In summary, the main problem addressed in this paper is how to improve the efficiency and reliability of LVLMs in practical applications while ensuring interpretability, especially in high - risk or interactive scenarios (such as the medical field, self - driving, etc.), making explanatory AI more practical and easier to deploy.

FastRM: An efficient and automatic explainability framework for multimodal generative models

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Towards Explainability in Retrieval-Augmented LLMs

Unlocking the Potential of Large Language Models for Explainable Recommendations

LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models

V-RECS, a Low-Cost LLM4VIS Recommender with Explanations, Captioning and Suggestions

RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model

From Feature Importance to Natural Language Explanations Using LLMs with RAG

LLM4Vis: Explainable Visualization Recommendation using ChatGPT

From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing

Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales

Interpreting Language Reward Models via Contrastive Explanations

Fast Explainability via Feasible Concept Sets Generator

Language Model as Visual Explainer

Large Language Models as Evaluators for Recommendation Explanations

Explainable Concept Generation through Vision-Language Preference Learning

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

Uncertainty-Aware Explainable Recommendation with Large Language Models

A Concept-Based Explainability Framework for Large Multimodal Models

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification