Abstract:We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications. Project page: <a class="link-external link-https" href="https://task-vectors-are-cross-modal.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore how Vision-and-Language Models (VLMs) encode task representations and investigate the generality and transferability of these representations across different modalities and task specifications. Specifically, the paper focuses on the following core issues: 1. **Cross-modal Consistency of Task Representations**: The paper finds that conceptually similar tasks are mapped to similar task vector representations across different modalities (text or image) and specifications. This means that the model can generate consistent task representations regardless of whether the input is text or image. 2. **Evolution Process of Task Representations**: The paper reveals that when VLMs generate answers, token representations go through three stages: input, task, and answer. This process is consistent across different modalities and specifications. 3. **Cross-modal Transfer of Task Vectors**: The paper explores whether task vectors can be transferred from one modality (e.g., text) to another modality (e.g., image) and evaluates the effectiveness of such transfer. Experimental results show that cross-modal transfer can significantly improve task performance, especially in low-data scenarios. 4. **Combination of Instructions and Examples**: The paper also investigates how to combine instruction-based task vectors with example-based task vectors to improve the quality and sample efficiency of task representations. ### Main Contributions 1. **Classification of Task Vectors**: The paper proposes a method for classifying task vectors, which can specify tasks not only through examples but also through instructions. 2. **Evolution Pattern of Task Representations**: The paper demonstrates the evolution pattern of task representations across model layers, showing that this pattern is consistent regardless of input modality or specification format. 3. **Cross-modal Transfer**: The paper explores the cross-modal transfer of task vectors, which is a useful metric for measuring the interchangeability of different task representations and can enhance the expressiveness of task definitions. ### Experimental Results - **Transfer from Text ICL to Image Query**: Experimental results show that cross-modal patching performs best among all VLMs, improving performance by 14-33% compared to providing examples within the same context window. - **Transfer from LLM to VLM**: Many VLMs are initialized from pre-trained LLMs. Experiments find that task vectors generated by LLMs and VLMs under the same text ICL examples are highly similar, and LLM task vectors can be successfully applied to VLM image queries. - **Combination of Instruction Vectors**: Combining instruction vectors with example vectors can significantly improve sample efficiency and reduce the variance of example vectors. ### Conclusion By deeply analyzing the task representation mechanisms of VLMs, the paper reveals the generality and transferability of task vectors across different modalities and specifications, providing new perspectives and methods for handling multimodal tasks.

Task Vectors are Cross-Modal

Finding Visual Task Vectors

Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

A Survey of Vision and Language Related Multi-Modal Task

Are VLMs Really Blind

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Phase Diagram of Vision Large Language Models Inference: A Perspective from Interaction across Image and Instruction

Multitask Learning for Visual Question Answering

X-VILA: Cross-Modality Alignment for Large Language Model

12-in-1: Multi-Task Vision and Language Representation Learning

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Interpretability of Language Models via Task Spaces

Language Features Matter: Effective Language Representations for Vision-Language Tasks