LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Gabriela Ben Melech Stan,Estelle Aflalo,Raanan Yehezkel Rohekar,Anahita Bhiwandiwalla,Shao-Yen Tseng,Matthew Lyle Olson,Yaniv Gurwicz,Chenfei Wu,Nan Duan,Vasudev Lal
2024-06-25
Abstract:In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of understanding the internal mechanisms in large - scale vision - language models (LVLMs). Although these models perform excellently in combining multiple data inputs, their internal working principles remain complex and difficult to understand. Specifically, the paper focuses on how to improve the understanding of image patches, which are crucial for generating answers, and evaluates the effectiveness of language models in aligning outputs with images. In addition, the paper also aims to pave the way for further enhancing the system's capabilities by providing an interactive application that enables users to systematically investigate the model and discover the system's limitations. The main contributions of the paper include: - Proposing an interactive tool for interpreting the internal attention mechanisms of large - scale vision - language models. - Revealing the possible reasons for some failure cases of LVLMs through case studies. - Speculating that large - scale vision - language models (such as LLaVA) implicitly learn to represent causal structures through the study of causal explanations. Through these contributions, the paper not only provides new methods for understanding and analyzing LVLMs, but also offers valuable insights for improving the performance and reliability of these models.