Joint embeddings with multimodal cues for video-text retrieval

Niluthpol C. Mithun,Juncheng Li,Florian Metze,Amit K. Roy-Chowdhury
DOI: https://doi.org/10.1007/s13735-018-00166-3
2019-01-12
International Journal of Multimedia Information Retrieval
Abstract:For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.
computer science, artificial intelligence, software engineering
What problem does this paper attempt to address?