A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.

Junting Li,Dehao Wu,Yuesheng Zhu,Zhiqiang Bai
DOI: https://doi.org/10.1007/978-3-030-92310-5_55
2021-01-01
Abstract:With the explosive growth of videos on the internet, video-text retrieval is receiving increasing attention. Most of the existing approaches map videos and texts into a shared latent vector space and then measure their similarities. However, for video encoding, most methods ignore the interactions of frames in a video. In addition, many works obtain features of various aspects but lack a proper module to fuse them. They use simple concatenation, gate unit, or average pooling, which possibly can not fully exploit the interactions of different features. To solve these problems, we propose the Multi-Interaction Model (MIM). Concretely, we propose a well-designed multi-scale interaction module to exploit interactions among frames. Besides, a fusion module is designed to combine representations from different branches by encoding them into various subspaces and capturing interactions among them. Furthermore, to learn more discriminative representations, we propose an improved loss function. And we design a new mining strategy, which selectively reserves informative pairs. Extensive experiments conducted on MSR-VTT, TGIF, and VATEX datasets demonstrate the effectiveness of the proposed video-text retrieval model.
What problem does this paper attempt to address?