Abstract:Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as "multi-modal concepts". We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. We will publicly release our code.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to understand and interpret the internal representations of large - scale multimodal models (LMMs)**. ### Problem Background In recent years, large - scale multimodal models (LMMs), which combine unimodal encoders and large - language models (LLMs), have made remarkable progress in handling multimodal tasks. However, despite the excellent performance of these models, their internal representations are still difficult to understand. This not only limits the interpretability of the models but also affects their reliability and credibility. ### Paper Objectives To fill this gap, this paper proposes a new concept - based interpretability framework, aiming to help researchers better understand the internal representations of LMMs. Specifically, the objectives of the paper include: 1. **Propose a novel concept extraction method**: This method is based on dictionary learning and is applied to the representations of tokens in LMMs. Through this method, a multimodal concept dictionary can be learned, where each concept can be semantically grounded in the visual and textual domains. 2. **Verify the effectiveness of multimodal concepts**: Through qualitative and quantitative evaluations, prove that the extracted multimodal concepts can effectively explain the representations of test samples and have good disentanglement ability and high - quality visual and textual grounding. 3. **Provide public code**: To promote follow - up research, the author will publicly release the code implementing this method. ### Main Contributions - **Propose a concept - interpretation framework for large - scale multimodal models for the first time**: As far as the author knows, this is the first attempt to interpret such large - scale multimodal models. - **Introduce the Semi - Non - negative Matrix Factorization (Semi - NMF) optimization strategy**: Expand the previous concept - dictionary - learning strategy and propose a new optimization method. - **Verify the effectiveness of multimodal concepts through experiments**: Through qualitative and quantitative evaluations, prove that the learned concept dictionary has meaningful multimodal grounding and can effectively explain the representations of test samples. ### Method Overview The method of the paper mainly includes three steps: 1. **Select relevant images**: Select images related to the target tokens from the dataset and extract the internal representations of these images. 2. **Linearly decompose the representation matrix**: Use the dictionary - learning method to decompose the representation matrix into a concept dictionary and an activation - coefficient matrix. 3. **Semantic grounding**: Semantically ground the learned concepts in the visual and textual domains to ensure that they have practical meanings. Through these steps, the paper successfully reveals the multimodal structure of the internal representations of LMMs and provides a new perspective for understanding these complex models.

A Concept-Based Explainability Framework for Large Multimodal Models

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

Enhancing Explainability in Multimodal Large Language Models Using Ontological Context

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations

LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions

Concept Bottleneck Large Language Models

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

MultiViz: Towards Visualizing and Understanding Multimodal Models

MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models

What Makes Multimodal In-Context Learning Work?

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

Explaining latent representations of generative models with large multimodal models

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Towards Uncovering How Large Language Model Works: An Explainability Perspective

Crafting Large Language Models for Enhanced Interpretability

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

Self-supervised Interpretable Concept-based Models for Text Classification

Understanding Information Storage and Transfer in Multi-modal Large Language Models