DEMO: A Statistical Perspective for Efficient Image-Text Matching

Fan Zhang,Xian-Sheng Hua,Chong Chen,Xiao Luo

2024-05-19

Abstract:Image-text matching has been a long-standing problem, which seeks to connect vision and language through semantic understanding. Due to the capability to manage large-scale raw data, unsupervised hashing-based approaches have gained prominence recently. They typically construct a semantic similarity structure using the natural distance, which subsequently provides guidance to the model optimization process. However, the similarity structure could be biased at the boundaries of semantic distributions, causing error accumulation during sequential optimization. To tackle this, we introduce a novel hashing approach termed Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching. From a statistical view, DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution. Then, we employ a non-parametric distribution divergence to ensure a robust and precise similarity structure. In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions in a self-supervised manner. Through extensive experiments on three benchmark image-text matching datasets, we demonstrate that DEMO achieves superior performance compared with many state-of-the-art methods.

Computer Vision and Pattern Recognition,Information Retrieval

What problem does this paper attempt to address?

The paper aims to address the problem of Image-Text Matching, an important task that connects computer vision and natural language processing. Specifically, the paper proposes a new hashing method called DEMO, which improves unsupervised cross-modal hashing techniques from a statistical perspective. The main objectives of DEMO are: 1. **Addressing the biased similarity structure issue**: Traditional methods use natural distances (such as cosine distance) to construct semantic similarity structures, but they are prone to bias at the boundaries of semantic distributions, leading to error accumulation during optimization. DEMO introduces data augmentation to estimate potential semantic distributions and employs a non-parametric distribution discrepancy measure (energy distance) to construct more accurate and robust semantic structures. 2. **Reducing distribution differences between modalities**: Data from different modalities (images and texts) may follow different distributions when generating binary codes in the network, which can weaken the effectiveness of cross-modal retrieval. DEMO promotes distribution consistency for cross-modal retrieval based on image and text queries through a self-supervised approach, thereby obtaining modality-invariant binary descriptors. In summary, DEMO aims to improve the performance of hashing methods in the image-text matching task through innovative distribution mining and consistency learning strategies.

DEMO: A Statistical Perspective for Efficient Image-Text Matching

Nonlinear Discrete Cross-Modal Hashing for Visual-Textual Data

An End-to-End Image-Text Matching Approach Considering Semantic Uncertainty

A Statistical Approach to Mining Semantic Similarity for Deep Unsupervised Hashing

Discrete Joint Semantic Alignment Hashing for Cross-Modal Image-Text Search

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

Image-Text Matching with Multi-View Attention

Learning Semantic Relationship among Instances for Image-Text Matching

Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Active Mining Sample Pair Semantics for Image-text Matching

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Enhanced Semantic Similarity Learning Framework for Image-Text Matching

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching.

Cross-modal image–text search via Efficient Discrete Class Alignment Hashing

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

Dual Semantic Relationship Attention Network for Image-Text Matching

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

Hashing Based Efficient Inference for Image-Text Matching

Improving Image-Text Matching with Bidirectional Consistency of Cross-Modal Alignment

A Multiview Text Imagination Network Based on Latent Alignment for Image-Text Matching

Unsupervised Discrete Hashing With Affinity Similarity