Uniting Image and Text Deep Networks Via Bi-directional Triplet Loss for Retreival

Yan Hua,Jianhe Du
DOI: https://doi.org/10.1109/iceiec.2019.8784629
2019-01-01
Abstract:Image and text are heterogeneous data, thus it is difficult to retrieve images with text query or retrieve texts with image query. Thanks to the success of deep learning in recent years, the feature representations of image and text have been made great advances. However, their distances still cannot be compared directly since they are from different modalities. In this paper, we propose a bi-directional triplet constraint for learning image and text deep networks by simultaneously 1) minimizing the distance of relevant image-text pairwise data, 2) pushing the distance of image vs. its irrelevant text and the distance of text vs. its irrelevant image both larger than that of the pairwise data. Our triplet loss could be seen as cross-modal and bi-directional extension of large margin nearest neighbor method, which is for single-modal data classification. For raw image, a fully-connected subnetwork is designed for image representation learning based on ResNet, and the same architecture is designed for text representation learning. The two deep models are jointly learned with the bi-directional triplet loss in an end-to-end manner. Experiments verify the effectiveness of our proposed model on a widely used dataset.
What problem does this paper attempt to address?