Abstract:Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the building block for Large Multimodal Models, yet, we still lack a proper understanding of their success. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation aiming to understand their generalization beyond textual inputs. Findings. Perceptual tokens (1) are easily distinguishable from textual ones inside LLMs, with significantly different representations, and complete translation to textual tokens does not exist. Yet, (2) both perceptual and textual tokens activate similar LLM weights. Despite being different, (3) perceptual and textual tokens are implicitly aligned inside LLMs, we call this the implicit multimodal alignment (IMA), and argue that this is linked to architectural design, helping LLMs to generalize. This provide more evidence to believe that the generalization of LLMs to multimodal inputs is mainly due to their architecture. Implications. (1) We find a positive correlation between the implicit alignment score and the task performance, suggesting that this could act as a proxy metric for model evaluation and selection. (2) A negative correlation exists regarding hallucinations, revealing that this problem is mainly due to misalignment between the internal perceptual and textual representations. (3) Perceptual tokens change slightly throughout the model, thus, we propose different approaches to skip computations (e.g. in FFN layers), and significantly reduce the inference cost. (4) Due to the slowly changing embeddings across layers, and the high overlap between textual and multimodal activated weights, we compress LLMs by keeping only 1 subnetwork that works well across a wide range of multimodal tasks. Paper code: <a class="link-external link-https" href="https://github.com/mshukor/ima-lmms" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore how large language models (LLMs) can handle multimodal inputs such as images, videos, audio, and text without undergoing multimodal fine-tuning, and to analyze their internal representations to understand how these models achieve generalization to non-text inputs. Specifically, the researchers hope to reveal the characteristics of the internal representations of frozen LLMs under different modal inputs and how these characteristics affect the model's performance, security, and efficiency. ### Main research content: 1. **Differences and similarities in multimodal representations**: - The study found that perceptual tokens (such as images, videos, audio) and text tokens have significantly different representation spaces in LLMs, but they activate similar weights. - Despite the different representations, there is an implicit multimodal alignment (IMA) between perceptual tokens and text tokens in LLMs, which is related to the model architecture design. 2. **Implicit Multimodal Alignment (IMA)**: - During training and inference, the similarity between perceptual tokens and text tokens gradually increases, indicating the presence of an implicit multimodal alignment effect. - This alignment effect is mainly driven by the model's architecture design, particularly the role of residual streams and steering blocks. 3. **Practical applications and impact**: - **Performance evaluation**: The implicit alignment score is positively correlated with task performance and can serve as a proxy metric for model evaluation and selection. - **Hallucination problem**: The hallucination problem is mainly caused by the misalignment between internal perceptual and text representations, as revealed by a negative correlation. - **Computational efficiency**: Since perceptual tokens change little within the model, inference efficiency can be improved by skipping certain computations (such as FFN layers). - **Model compression**: Due to the high overlap of weights activated by different modalities, LLMs can be compressed by retaining a sub-network (α-SubNet) to make them suitable for various multimodal tasks. ### Conclusion: By analyzing the internal representations of LLMs under multimodal inputs, the researchers revealed the existence of the implicit multimodal alignment effect and its impact on model performance, security, and efficiency. These findings not only deepen the understanding of the generalization capabilities of LLMs but also provide new ideas for the design and optimization of multimodal models.

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

Skipping Computations in Multimodal LLMs

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

EMMA: Efficient Visual Alignment in Multi-Modal LLMs

InfMLLM: A Unified Framework for Visual-Language Tasks.

OneLLM: One Framework to Align All Modalities with Language

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

eP-ALM: Efficient Perceptual Augmentation of Language Models

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

F-LMM: Grounding Frozen Large Multimodal Models

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex

NoteLLM-2: Multimodal Large Representation Models for Recommendation

Aligning Large Multimodal Models with Factually Augmented RLHF

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Matryoshka Multimodal Models

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs