Enhanced Semantic Similarity Learning Framework for Image-Text Matching
Kun Zhang,Bo Hu,Huatian Zhang,Zhe Li,Zhendong Mao
DOI: https://doi.org/10.1109/tcsvt.2023.3307554
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Image-text matching is a fundamental task to bridge vision and language. The critical challenge lies in accurately learning the semantic similarity between these two heterogeneous modalities. For visual and textual features, existing methods typically default to a static dimensional correspondence mechanism, i.e., using a single dimension as the measure-unit to perform one-to-one correspondence, to examine semantic similarity, e.g., the cosine/Euclidean distance or the weighted similarity. In this paper, different from the single-dimensional correspondence with limited semantic expressive capability, we propose a novel enhanced semantic similarity learning (ESL), which generalizes both measure-units and their correspondences into a dynamic learnable framework to examine the multi-dimensional enhanced correspondence between visual and textual features. Specifically, we first devise the intra-modal multi-dimensional aggregators with iterative enhancing mechanism, which dynamically captures new measure-units integrated by hierarchical multi-dimensions, producing diverse semantic combinatorial expressive capabilities to provide richer and discriminative information for similarity examination. Then, we devise the inter-modal enhanced correspondence learning with sparse contribution degrees, which comprehensively and efficiently determines the cross-modal semantic similarity. Extensive experiments verify its superiority in achieving state-of-the-art performance. Codes will be released.
engineering, electrical & electronic