Coordinated and Specific Restricted Boltzmann Machine for Cross-Modal Retrieval

Menghan Xu,Bo Sun,Jing Jiang,Fangxiang Feng
DOI: https://doi.org/10.1117/12.2639886
2022-01-01
Abstract:With the rapid growth of multimodal web data, the task of cross-modal retrieval, i.e., using a text query to search for images or vice versa, has attracted a lot of attention from researchers. Existing approaches usually learn a common representation space where different modalities can be directly compared. However, little work has been done to verify that the learned common representation space contains only common part shared between different modalities. In this paper, we present a coordinated and specific restricted Boltzmann machine (a.k.a. CSRBM) that can distinguish the common part from modality-specific part of different modalities. The proposed CSRBM consists of two RBMs, each with two hidden layers. The common hidden layer learns the common patterns shared within different modalities. And the modality-specific hidden layer learns the modality-specific patterns owned by individual modalities. To verify the split effectiveness of our proposed model, we construct a multimodal dataset based on the popular MNIST dataset. Moreover, we evaluate our model on three publicly real-world datasets with the task of cross-modal retrieval. The extensive experiments demonstrate the effectiveness of our CSRBM.
What problem does this paper attempt to address?