Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Weikuo Guo,Xiangwei Kong,Huaibo Huang
DOI: https://doi.org/10.1016/j.neucom.2024.129003
IF: 6
2024-01-01
Neurocomputing
Abstract:Extracting semantically consistent representations from multi-modal data helps computers understand the human world more comprehensively. Visual-semantic matching, as one of the fundamental tasks for multi-modal learning, attracts continuous attention. Recent research makes unflagging endeavors to enhance the matching performance, but sometimes at the expense of overlooking the delicate balance between efficiency and effectiveness. In this paper, we aim to address this dilemma through a newly proposed attention-mechanism-based architecture. To ensure optimal effectiveness, we adopt a more advanced Transformer Encoder (TE) as our basic model and introduce two significant ameliorations to tailor it for the visual-semantic matching task. Initially, we incorporate fine-grained supervision into the classic TE, allowing our model to capture sophisticated correspondences between different modalities. Subsequently, we employ a dynamic attention-evolving strategy to selectively pass useful information and strengthen the attention pattern consistency between adjacent TE blocks. To maintain efficiency, we propose a novel Select & Re-rank strategy that enables the model to ignore redundant information. This approach significantly reduces the computational cost and increases the matching speed with relatively minimal performance degradation. The proposed architecture can gradually capture and reorganize useful information from inter-modality as well as intra-modality under the supervision of both fine-grained and global similarity, which leads to more comprehensive and discriminative embeddings. Experiments on two benchmark datasets show that the proposed method achieves competitive results in terms of both efficiency and effectiveness.
What problem does this paper attempt to address?