MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention

DOI: https://doi.org/10.1007/s11263-024-02067-x
IF: 13.369
2024-04-24
International Journal of Computer Vision
Abstract:Aggregating multi-modal data to obtain reliable data representation attracts more and more attention. Recent studies demonstrate that Transformer models usually work well for multi-modal tasks. Existing Transformers generally either adopt the cross-attention (CA) mechanism or simple concatenation to achieve the information interaction among different modalities which generally ignore the issue of modality gap. In this work, we re-think Transformer and extend it to MutualFormer for multi-modal data representation. Rather than CA in Transformer, MutualFormer employs our new design of cross-diffusion attention (CDA) to conduct the information communication among different modalities. Comparing with CA, the main advantages of the proposed CDA are three aspects. First, the cross-affinities in CDA are defined based on the individual modal affinities (token metrics) which thus can naturally alleviate the issue of modality/domain gap existed in traditional token feature based CA definition. Second, CDA provides a general scheme which can either be used for multi-modal representation or serve as the post-optimization for existing CA models. Third, CDA is implemented efficiently. We successfully apply the MutualFormer on several multi-modal learning tasks. Extensive experiments demonstrate the effectiveness of the proposed MutualFormer.
computer science, artificial intelligence
What problem does this paper attempt to address?