Memory-Inspired Temporal Prompt Interaction for Text-Image Classification

Xinyao Yu,Hao Sun,Ziwei Niu,Rui Qin,Zhenjia Bai,Yen-Wei Chen,Lanfen Lin
2024-01-26
Abstract:In recent years, large-scale pre-trained multimodal models (LMM) generally emerge to integrate the vision and language modalities, achieving considerable success in various natural language processing and computer vision tasks. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this contex, we propose a novel prompt-based multimodal interaction strategy inspired by human memory strategy, namely Memory-Inspired Temporal Prompt Interaction (MITP). Our proposed method involves in two stages as in human memory strategy: the acquiring stage, and the consolidation and activation stage. We utilize temporal prompts on intermediate layers to imitate the acquiring stage, leverage similarity-based prompt interaction to imitate memory consolidation, and employ prompt generation strategy to imitate memory activation. The main strength of our paper is that we interact the prompt vectors on intermediate layers to leverage sufficient information exchange between modalities, with compressed trainable parameters and memory usage. We achieve competitive results on several datasets with relatively small memory usage and 2.0M of trainable parameters (about 1% of the pre-trained foundation model).
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper presents a new multimodal interaction strategy called Memory-Inspired Temporal Prompt Interaction (MITP), aiming to address the limitations of existing methods in cross-modal information exchange. Current approaches typically extract features from individual modalities separately, and then mix these features through fusion modules. However, this approach may not fully capture the comprehensive relationships between modalities and within modalities. In this paper, the researchers are inspired by human memory mechanisms and utilize temporal prompts for information retrieval and storage at intermediate layers, and facilitate cross-modal information exchange based on similarity.