Task Vectors are Cross-Modal

Grace Luo,Trevor Darrell,Amir Bar
2024-10-30
Abstract:We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications. Project page: <a class="link-external link-https" href="https://task-vectors-are-cross-modal.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore how Vision-and-Language Models (VLMs) encode task representations and investigate the generality and transferability of these representations across different modalities and task specifications. Specifically, the paper focuses on the following core issues: 1. **Cross-modal Consistency of Task Representations**: The paper finds that conceptually similar tasks are mapped to similar task vector representations across different modalities (text or image) and specifications. This means that the model can generate consistent task representations regardless of whether the input is text or image. 2. **Evolution Process of Task Representations**: The paper reveals that when VLMs generate answers, token representations go through three stages: input, task, and answer. This process is consistent across different modalities and specifications. 3. **Cross-modal Transfer of Task Vectors**: The paper explores whether task vectors can be transferred from one modality (e.g., text) to another modality (e.g., image) and evaluates the effectiveness of such transfer. Experimental results show that cross-modal transfer can significantly improve task performance, especially in low-data scenarios. 4. **Combination of Instructions and Examples**: The paper also investigates how to combine instruction-based task vectors with example-based task vectors to improve the quality and sample efficiency of task representations. ### Main Contributions 1. **Classification of Task Vectors**: The paper proposes a method for classifying task vectors, which can specify tasks not only through examples but also through instructions. 2. **Evolution Pattern of Task Representations**: The paper demonstrates the evolution pattern of task representations across model layers, showing that this pattern is consistent regardless of input modality or specification format. 3. **Cross-modal Transfer**: The paper explores the cross-modal transfer of task vectors, which is a useful metric for measuring the interchangeability of different task representations and can enhance the expressiveness of task definitions. ### Experimental Results - **Transfer from Text ICL to Image Query**: Experimental results show that cross-modal patching performs best among all VLMs, improving performance by 14-33% compared to providing examples within the same context window. - **Transfer from LLM to VLM**: Many VLMs are initialized from pre-trained LLMs. Experiments find that task vectors generated by LLMs and VLMs under the same text ICL examples are highly similar, and LLM task vectors can be successfully applied to VLM image queries. - **Combination of Instruction Vectors**: Combining instruction vectors with example vectors can significantly improve sample efficiency and reduce the variance of example vectors. ### Conclusion By deeply analyzing the task representation mechanisms of VLMs, the paper reveals the generality and transferability of task vectors across different modalities and specifications, providing new perspectives and methods for handling multimodal tasks.