Learning Controlled Semantic Embedding for Cross-Modal Retrieval.

Rong Yang,Min Meng,Jun Yu,Jigang Wu
DOI: https://doi.org/10.1109/ICME51207.2021.9428280
2021-01-01
Abstract:Cross-modal retrieval has caught appealing attentions as it supports querying across different modalities. However, most existing methods have emphasized on directly mapping heterogeneous features into the common subspace, which inevitably results in highly entangled representations, thereby preventing them from bridging the modality gap. This paper presents a novel deep framework called Controlled Semantic Embedding (CSE), which is the first attempt to learn disentangled representations with controlled semantic structure for cross-modal retrieval. Specifically, we design two generative networks based on variational autoencoder, which incorporate semantic discriminators for effective prediction of structured semantics. Meanwhile, a self-supervised semantic network is seamlessly integrated into the generative networks to supervise the semantic embedding process, which is further coupled with a quantizer for controlling the quantizability of semantic representations. Extensive experiments show the superiority of CSE over other state-of-the-art methods in cross-modal retrieval.
What problem does this paper attempt to address?