Adaptive Marginalized Semantic Hashing for Unpaired Cross-Modal Retrieval

Kaiyi Luo,Chao Zhang,Huaxiong Li,Xiuyi Jia,Chunlin Chen
DOI: https://doi.org/10.1109/TMM.2023.3245400
2022-07-25
Abstract:In recent years, Cross-Modal Hashing (CMH) has aroused much attention due to its fast query speed and efficient storage. Previous literatures have achieved promising results for Cross-Modal Retrieval (CMR) by discovering discriminative hash codes and modality-specific hash functions. Nonetheless, most existing CMR works are subjected to some restrictions: 1) It is assumed that data of different modalities are fully paired, which is impractical in real applications due to sample missing and false data alignment, and 2) binary regression targets including the label matrix and binary codes are too rigid to effectively learn semantic-preserving hash codes and hash functions. To address these problems, this paper proposes an Adaptive Marginalized Semantic Hashing (AMSH) method which not only enhances the discrimination of latent representations and hash codes by adaptive margins, but also can be used for both paired and unpaired CMR. As a two-step method, in the first step, AMSH generates semantic-aware modality-specific latent representations with adaptively marginalized labels, which enlarges the distances between different classes, and exploits the labels to preserve the inter-modal and intra-modal semantic similarities into latent representations and hash codes. In the second step, adaptive margin matrices are embedded into the hash codes, and enlarge the gaps between positive and negative bits, which improves the discrimination and robustness of hash functions. On this basis, AMSH generates similarity-preserving hash codes and robust hash functions without strict one-to-one data correspondence requirement. Experiments are conducted on several benchmark datasets to demonstrate the superiority and flexibility of AMSH over some state-of-the-art CMR methods.
Multimedia
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two key challenges in cross - modal retrieval (CMR): 1. **Unpaired Cross - Modal Retrieval (UCMR) problem**: - Most existing CMR methods assume that data in different modalities are fully paired, that is, each image has a corresponding text description and vice versa. However, in practical applications, due to sample loss and incorrect data alignment, this fully paired assumption is often unrealistic. Therefore, how to perform effective cross - modal retrieval on unpaired data is an urgent problem to be solved. 2. **The problem of learning semantic - preserving hash codes and hash functions**: - Existing methods usually use strict binary regression targets (such as label matrices and binary codes), which makes the learned hash codes and hash functions too rigid, difficult to effectively preserve semantic information, and reduces the robustness and discrimination ability of the model. Specifically, these methods ignore the distances between different classes, resulting in poor classification effects. To solve these problems, the author proposes a new method - Adaptive Marginalized Semantic Hashing (AMSH). AMSH enhances the discrimination of latent representations and hash codes by introducing an adaptive margin matrix and can perform cross - modal retrieval on unpaired data. In addition, AMSH also generates semantic - aware modality - specific latent representations through a two - step method and embeds an adaptive margin matrix to widen the gap between positive and negative bits, thereby improving the discrimination and robustness of the hash function. ### Specific solutions 1. **Adaptive Marginalized Regression**: - Introduce an adaptive margin matrix \( E^{(i)} \) to adjust the regression target, increase the distances between different classes, and thus improve the discrimination of features. The formula is as follows: \[ \min_{P^{(i)}, V^{(i)}, E^{(i)}} \sum_{i = 1}^m \| L^{(i)}+R^{(i)} \odot E^{(i)}-P^{(i)} V^{(i)} \|_F^2 \quad \text{s.t.} \quad E^{(i)} \geq 0 \] where \( R^{(i)} \) is an index matrix used to mark positive and negative directions; \( E^{(i)} \) is a margin matrix used to adjust the value of each element. 2. **Semantic Similarity Embedding**: - Enhance semantic information by preserving the similarity between modalities and the compactness within modalities. The formula is as follows: \[ \min_{V^{(i)}} \sum_{i \neq j} \| V^{(j)T} V^{(i)}-r S^{(ji)} \|_F^2+\sum_{i = 1}^m \| V^{(i)T} V^{(i)}-r S^{(ii)} \|_F^2 \] where \( S^{(ij)} \) is a similarity matrix between modalities or within modalities. 3. **Overall optimization objective**: - Combine the above two steps, and the final objective function is: \[ \begin{aligned} &\min \sum_{i = 1}^m \left( \| L^{(i)}+R^{(i)} \odot E^{(i)}-P^{(i)} V^{(i)} \|_F^2+\eta \| B^{(i)}-V^{(i)} \|_F^2+\lambda \| B^{(i)T} V^{(i)}-r S^{(ii)} \|_F^2 \right) \\ &+ \beta \ \end{aligned} \]