LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Gabriela Ben Melech Stan,Estelle Aflalo,Raanan Yehezkel Rohekar,Anahita Bhiwandiwalla,Shao-Yen Tseng,Matthew Lyle Olson,Yaniv Gurwicz,Chenfei Wu,Nan Duan,Vasudev Lal

2024-06-25

Abstract:In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of understanding the internal mechanisms in large - scale vision - language models (LVLMs). Although these models perform excellently in combining multiple data inputs, their internal working principles remain complex and difficult to understand. Specifically, the paper focuses on how to improve the understanding of image patches, which are crucial for generating answers, and evaluates the effectiveness of language models in aligning outputs with images. In addition, the paper also aims to pave the way for further enhancing the system's capabilities by providing an interactive application that enables users to systematically investigate the model and discover the system's limitations. The main contributions of the paper include: - Proposing an interactive tool for interpreting the internal attention mechanisms of large - scale vision - language models. - Revealing the possible reasons for some failure cases of LVLMs through case studies. - Speculating that large - scale vision - language models (such as LLaVA) implicitly learn to represent causal structures through the study of causal explanations. Through these contributions, the paper not only provides new methods for understanding and analyzing LVLMs, but also offers valuable insights for improving the performance and reliability of these models.

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

Towards Interpreting Visual Information Processing in Vision-Language Models

Language Model as Visual Explainer

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

An Introduction to Vision-Language Modeling

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Rethinking Interpretability in the Era of Large Language Models

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Effectiveness Assessment of Recent Large Vision-Language Models

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

EVLM: An Efficient Vision-Language Model for Visual Understanding

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images

InfMLLM: A Unified Framework for Visual-Language Tasks.

Valley: Video Assistant with Large Language model Enhanced abilitY